PDF Java Toolkit

Extracting text from PDF files

Text extraction refers to a set of APIs that enable users to find and extract text from within PDF documents. The basic unit of text is a word and the text extraction feature needs to provide for the logical delineation of text into words. The list of words and related information need to be made available to the user.

To learn more about how PDF manages text, see section 9, “Text,” on page 237 of the ISO 32000 document. Information about words includes location, font, bounding box, and character widths. This document is found on the web store of the International Standards Organization.

The purpose of the text extraction feature is to provide users with the following abilities:

  • Find text on a page known to be in a certain location
  • Search and index PDF content
  • Repurpose PDF text content
  • Manage search engines so that they can deal with PDF documents holding content more complex than simple text

Text found within an annotation or a form field in a PDF document is not considered part of the text in the PDF document, but it is still possible to extract this content. This is described under Text Extraction from PDF Files.

The Text Extraction APIs do not extract text from metadata associated with a PDF file.

A text extraction from a PDF document may fail if a font is embedded in the document and subset, but a to Unicode table specific to that font is not provided. The API probably will not be able to identify the font, and the resulting text might be unreadable. Also, if the document is password protected or encrypted, the API may not be able to extract text from the PDF unless the user can provide the owner password with sufficient access rights.