Using Adobe PDF Extract API to maintain sentence structure
Amid the challenges of extracting text data from PDF les, MTEC launched a new project to evaluate
Environment, Social, and Governance (ESG) issues. ESG is aracting aention as it relates to a company’s
long-term growth. To objectively evaluate ESG, it is important to understand a company’s initiatives, rather
than just numerical gures. ere is a focus on integrated reports, which add non-nancial information such
as corporate governance, corporate social responsibility (CSR), and intellectual property to a company’s
nancial information. However, in order to fully understand the content of dozens of pages in an integrated
report, maintaining the sentence structure of extracted text is essential. While researching ways to address
this, Mr. Naritomi came across the Adobe PDF Extract API.
“I discovered a blog post in English and became interested aer seeing the accuracy of the extracted
sentence presented there as an example,” says Mr. Naritomi. “ere are several tools that recognize text in
PDF les, but this was the only one that claimed to maintain the sentence structure. Aer consulting with
Adobe, we decided to conduct an enterprise trial to verify the performance of the Adobe PDF Extract API.”
“We found that Adobe Acrobat DC OCR and Adobe PDF Extract
API resulted in the highest quality work.”
Mr. Yusuke Naritomi
Financial Engineer, Development Group 2, Research Department, MTEC
Improving speed of the analysis and verication cycle
An integrated report issued by a company listed on the former First Section of the Tokyo Stock Exchange was
used for this trial.
Masahiro Shimizu, Senior Financial Engineer in Development Group 1 of the Research Department, explains
the aim of the project. “Information that is disseminated by ESG-related companies tends to be lled with
similar words and phrases. erefore, it is important for the text to be extracted in a manner that preserves
the sentence structure and preserves the meaning of sentences and paragraphs to highlight the dierences
in the initiatives pursued by each company. erefore, the most important point was to evaluate the ability
of this tool to maintain sentence structure while extracting the text information from dozens of pages in an
integrated report.”
e Adobe PDF Extract API uses Adobe Sensei, the proprietary AI and machine learning engine from Adobe.
“Consistent and high-quality PDF conversion is very important to scaling text extraction,” says Mr. Naritomi.
“We found that Adobe Acrobat DC OCR and Adobe PDF Extract API resulted in the highest quality work. We
took the PDF les in Amazon S3 that we wanted to convert and placed them in a separate folder, then we
used Acrobat DC to convert the text in the PDF. en we use the Adobe PDF Extract API to extract the text
and output it as a JSON le. e text extraction process was fast, including the time required for OCR. is
contributed to speeding up the analysis and verication cycle.”
e data that is output as a JSON le requires a minimal amount of organization by Mr. Naritomi before
being passed on to a researcher.
“In the past, we had to separate some sentences and splice others together to decipher the meaning of
the text data extracted from PDF les,” says Mr. Shimizu. “e Adobe PDF Extract API not only eliminates
the need for such work, but also oers unprecedented features such as the ability to identify headings and
body text. Another use is to determine the extent to which the 17 SDGs are present based on the text of
an integrated report, and to determine from that where each company is focused related to the SDGs, or
corporate materiality. By comparing these results with other companies in the same industry, it is possible to
measure a company’s level of focus on any particular issue.”