Established
1988
Employees: 45
Marunouchi, Chiyoda-ku, Tokyo
www.mtec-institute.co.jp/en/
Speeds up analysis and
verication cycles
Products:
Adobe Acrobat Services
Adobe PDF Extract API
Objectives
Automatically recognize awkward link breaks in
sentences
Use data extraction that recognizes text styles
and images in PDFs
Support further business growth
Decrease time to conduct PDF text extractions
Results
Enables extraction of highly accurate
sentences rather than words with pre-OCR
Maintains document structure to enable
analysis that captures meaning of sentences
Conducts more accurate surveys and expands
scope of business
Accelerates analysis and verication cycle
Automating PDF data
extractions.
MTEC uses Adobe PDF Extract API to
improve speed and accuracy of automatic
text extraction from nancial data PDF les.
Established in 1988, Mitsubishi UFJ Trust Investment Technology Institute (MTEC) provides asset management,
risk management, data analytics, and data analysis consulting services primarily to its parent company,
Mitsubishi UFJ Trust and Banking Corporation, and its group companies. In 2022, the company began utilizing
the knowledge and expertise it had accumulated over many years to provide investment advisory services to
users beyond the group’s framework.
e scope of data science analysis is expanding to include unstructured data such as natural language. In this
context, MTEC, which integrates mathematical science and information science to solve problems in nancial
operations, has focused on integrated reports of listed companies as one of its new targets for analysis. e
company’s eorts to adopt Adobe PDF Extract API as a tool for extracting text data from PDF les have
contributed to improving the speed of the analysis and validation cycle, by completing the extraction of text
data from integrated reports at high speed.
ere is a strong demand in the eld of nancial engineering for
analysis that includes market psychology, in addition to numerical
data Department. Text extraction that preserves sentence structure
is of great signicance in the world of nancial data science.
Mr. Yusuke Naritomi
Financial Engineer, Development Group 2, Research Department, MTEC
Continued evolution of nancial data science: analysis of
naturallanguage
MTEC focuses on analyzing market data, nancial data, and other numerical data that aect the movement
of stock market prices. It provides its parent company, Mitsubishi UFJ Trust and Banking Corporation, and
its group companies with the mathematical models needed to make investment and nancing decisions.
e institute has been an early adopter in using text in nancial statements and other documents in data
analysis.
“It is becoming increasingly dicult to conduct more accurate analysis using conventional methods that
only follow numerical data. ere is a strong demand in the eld of nancial engineering for analysis that
includes market psychology, in addition to numerical data,” explains Yusuke Naritomi, a Financial Engineer in
Development Group 2 of the Research Department. “Text extraction that preserves sentence structure is of
great signicance in the world of nancial data science.
Under these circumstances, a new issue emerged. How can the text data of timely disclosure information
and a variety of reports be extracted from PDF les with high accuracy and eciency?
“In the past, we used free soware to extract text information from timely disclosure information in PDF les.
However, problems such as the incorrect interpretation of character strings at line breaks made it dicult to
maintain the sentence structure of the extracted text,” explains Mr. Naritomi. “at is ne if you simply want
to pick out words contained in the text and quantify how frequently they occur. However, that method of
analysis does not include the meaning of the text. In terms of improving our service quality, maintaining the
sentence structure of text data extracted from PDF les has become important.
Using Adobe PDF Extract API to maintain sentence structure
Amid the challenges of extracting text data from PDF les, MTEC launched a new project to evaluate
Environment, Social, and Governance (ESG) issues. ESG is aracting aention as it relates to a company’s
long-term growth. To objectively evaluate ESG, it is important to understand a company’s initiatives, rather
than just numerical gures. ere is a focus on integrated reports, which add non-nancial information such
as corporate governance, corporate social responsibility (CSR), and intellectual property to a company’s
nancial information. However, in order to fully understand the content of dozens of pages in an integrated
report, maintaining the sentence structure of extracted text is essential. While researching ways to address
this, Mr. Naritomi came across the Adobe PDF Extract API.
“I discovered a blog post in English and became interested aer seeing the accuracy of the extracted
sentence presented there as an example,” says Mr. Naritomi. “ere are several tools that recognize text in
PDF les, but this was the only one that claimed to maintain the sentence structure. Aer consulting with
Adobe, we decided to conduct an enterprise trial to verify the performance of the Adobe PDF Extract API.
We found that Adobe Acrobat DC OCR and Adobe PDF Extract
API resulted in the highest quality work.
Mr. Yusuke Naritomi
Financial Engineer, Development Group 2, Research Department, MTEC
Improving speed of the analysis and verication cycle
An integrated report issued by a company listed on the former First Section of the Tokyo Stock Exchange was
used for this trial.
Masahiro Shimizu, Senior Financial Engineer in Development Group 1 of the Research Department, explains
the aim of the project. “Information that is disseminated by ESG-related companies tends to be lled with
similar words and phrases. erefore, it is important for the text to be extracted in a manner that preserves
the sentence structure and preserves the meaning of sentences and paragraphs to highlight the dierences
in the initiatives pursued by each company. erefore, the most important point was to evaluate the ability
of this tool to maintain sentence structure while extracting the text information from dozens of pages in an
integrated report.
e Adobe PDF Extract API uses Adobe Sensei, the proprietary AI and machine learning engine from Adobe.
“Consistent and high-quality PDF conversion is very important to scaling text extraction,” says Mr. Naritomi.
We found that Adobe Acrobat DC OCR and Adobe PDF Extract API resulted in the highest quality work. We
took the PDF les in Amazon S3 that we wanted to convert and placed them in a separate folder, then we
used Acrobat DC to convert the text in the PDF. en we use the Adobe PDF Extract API to extract the text
and output it as a JSON le. e text extraction process was fast, including the time required for OCR. is
contributed to speeding up the analysis and verication cycle.
e data that is output as a JSON le requires a minimal amount of organization by Mr. Naritomi before
being passed on to a researcher.
“In the past, we had to separate some sentences and splice others together to decipher the meaning of
the text data extracted from PDF les,” says Mr. Shimizu. “e Adobe PDF Extract API not only eliminates
the need for such work, but also oers unprecedented features such as the ability to identify headings and
body text. Another use is to determine the extent to which the 17 SDGs are present based on the text of
an integrated report, and to determine from that where each company is focused related to the SDGs, or
corporate materiality. By comparing these results with other companies in the same industry, it is possible to
measure a company’s level of focus on any particular issue.
© 2023 Adobe. All rights reserved.
Adobe, the Adobe logo, and Connect are either registered trademarks or trademarks of Adobe in the United States and/or other countries. All other
trademarks are the property of their respective owners.
e Adobe PDF Extract API...oers unprecedented features such
as the ability to identify headings and body text.
Mr. Masahiro Shimizu
Senior Financial Engineer, Development Group 1, Research Department, MTEC
Extracting text from a variety of PDF les, not just integrated reports
MTEC is currently working on automating the text extraction process with the Adobe PDF Extract API. It has
also started using the tool for PDF les other than integrated reports.
“In addition to integrated reports, we are currently aempting to extract text from the Timely Disclosure
network (TDnet) service provided by the Japan Exchange Group (JPX),” says Mr. Naritomi. “We are
also looking into text extraction for a variety of reports, such as CSR reports, from approximately 4,000
companies listed on the JPX.
e company is also considering expanding these services to its parent company, Mitsubishi UFJ Trust and
Banking Corporation, and other group companies.
“In the case of trust banks, many operations have documents that have not been converted to data,” says
Mr.Shimizu. “We believe that the conversion of such text into data is also meaningful from the perspective of
improving operational eciency.
Adobe Acrobat Services licenses are available for the Adobe PDF Extract API, as well as for the Document
Generation API and the PDF Services API. MTEC is considering the use of Adobe Acrobat Services in the
future for automatic document generation.
* is information is current as of September 2022.