PDF Java Toolkit

Text Extraction from PDF Files

Adobe PDF Java Toolkit supports text extraction from PDF files. Text extraction makes it possible to save the PDF source as plain text.
Text extraction draws from two areas of the PDF document, form XObjects in a page’s content stream and form fields and Annotations.
PDF Java Toolkit presents text as Java objects that can be iterated. To get the text, user applications are required to take the following steps.

  1. Iterate over the objects.
  2. Retrieve the text.
  3. Save it in the desired format.

The Text Extraction from XObjects example shows how to implement these steps.

Text Extraction from form XObjects in a page’s content stream

This section provides a discussion of text objects present in Form XObjects. A Form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects (including path objects, text objects, and sampled images).
For more detail, see Section 8.10, “Form XObjects,” in the ISO 32000 Reference, page 217.

Example: Content stream containing an XObject
q
0.8987885 -0.4383698 0.4383698 0.8987885 209.2356262
460.8054199 cm
/GS0 gs
/Fm0 Do Q
Q
Example: XObject Fm0 in the resource dictionary
<< /Type /XObject
/Subtype /Form
/FormType 1
/BBox [0 0 1000 1000]
/Matrix [1 0 0 1 0 0]
/Resources << /ProcSet [/PDF] >>
/Length 58
>>
stream
0.25 0.333 1 rg
0 i BT
/TT0 1 Tf
0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 24 0 0 24 0 -24 Tm

(This is a Background.)Tj
ET
Endstream
Example: Text Extraction

Here is a Java code sample.

TextExtractionSampleJavaProgram

Text extraction from Form Fields and Annotations

PDF Java Toolkit does not provide “text extraction services” for annotations and form fields. Text can be obtained from the appropriate dictionary fields. Quads are not computed and the word content is not run through the disambiguation algorithm. The position information available is limited to the “location” dictionary entry of the field/annot on the page. To learn more see Section 12.7, “Interactive Forms,” in the ISO 32000 Reference, page 430.

Text Extraction from Annotations

An annotation associates an object such as a note, sound, or movie with a location on a page of a PDF document. The optional Annots entry in a page object holds an array of annotation dictionaries, each representing an annotation associated with the given page. A given annotation dictionary may be referenced from the Annots array of only one page.
The entries that are relevant in the context of Text Extraction are listed below.

Key Type Description
Type Name Optional The type of PDF object that this dictionary describes; if present must be Annot for an annotation dictionary.
Subtype Name Required The type of annotation that this dictionary describes.
Rect Rectangle Required Contents
Contents Text string Optional Text to be displayed for the annotation. If this type of annotation does not display text it will provide an alternate description of the annotation's contents in human-readable form. In either case this text is useful when extracting the document's contents in support of accessibility to users with disabilities or for other purposes.
M Date or string Optional The date and time when the annotation was most recently modified. Viewer applications should be prepared to accept and display a string in any format.
T Text string Optional The text label to be displayed in the title bar of the annotation's pop-up window when open and active. By convention this entry identifies the user who added the annotation.
V Various Optional; inheritable The field's value. The format varies depending on the field type. See the descriptions of individual field types for further information.
RV Various Optional; inheritable The Rich-Text version of the field's value. The format varies depending on the field type.
Example: Extract information from a list of annotations

In this example, you iterate over all the pages in the PDF document. For each page, you can enumerate the annot values and extract the required information as shown below.

PDFAnnotationList annotations =
        pdfPage.getAnnotationList();
// Get annotation iterator.
annotIterator = annotations.iterator();
while (annotIterator.hasNext())
{
    // Get the next annotation.
    PDFAnnotation pdfAnnotation = annotIterator.next();

   //Get the /Contents entry as a String.
   String annotation_content =
           pdfAnnotation.getContents();

   //Get the /M entry or modification date as an ASDate.
   ASDate modification_date =
           pdfAnnotation.getModificationDate();

   //Get the location of the annotation as a PDFRectangle.
   PDFRectangle annot_location =
           pdfAnnotation.getRect();

   //All MarkUp Annotations have the /T entry.
   if(pdfAnnotation instanceof PDFAnnotationMarkup)
   {
       //Get the /T entry as a string.
       String title = ((PDFAnnotationMarkup)pdfAnnotation).getTitle();
   }
}
Annotation Types

The type of annotation is identified in the Annotation dictionary’s Subtype entry. See Section 12.5, “Annotations,” in the ISO 32000 Reference.
Many annotation types are defined as markup annotations because they are used primarily to edit PDF documents. These annotations have text that appears as part of the annotation.
Annotations can be broadly classified into Markup and Non-MarkUp annotations.

Markup Annotations

Markup annotations can be divided into the following groups:

  • Free text annotations display text directly on the page. The annotation’s Contents entry specifies the displayed text.
  • Most other markup annotations have an associated pop-up window that may contain text. The annotation’s Contents entry specifies the text to be displayed when the pop-up window is opened. These include text, line, square, circle, polygon, polyline, highlight, underline, squiggly-underline, strikeout, rubber stamp, caret, ink, and file attachment annotations.
  • Sound annotations do not have a pop-up window but may also have associated text specified by the Contents entry.
  • A subset of markup annotations is called text markup annotations.
 Non-Markup Annotations

The pop-up annotation type typically does not appear by itself; it is associated with a markup annotation that uses it to display text.

NoteThe Contents entry for a pop-up annotation is relevant only if it has no parent. In that case, it represents the text of the annotation.

For all other annotation types (Link, Movie, Widget, PrinterMark, and TrapNet), the Contents entry provides an alternate representation of the annotation’s contents in human-readable form, which is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes

Text Extraction from Form Fields

For form fields, the contents saved by Acrobat indicated by the V entry.

Key Type Value Description
V Various Optional; inheritable The field's value. The format varies depending on the field type. See the descriptions of individual field types for further information.

To get the V entry using the PDF Java Toolkit methods, use the PDFField.getValueList() method as in the following example.

Example: Field Value Extraction
  PDFInteractiveForm iforms = pdfDocument.getInteractiveForm();
     // Get Field iterator
     Iterator fieldIterator = iforms.iterator();
     // Iterate over form fields
     while(fieldIterator.hasNext())
{
       // Get next field
        PDFField pdfField = (PDFField)(fieldIterator.next());
  // The value of a field can be a list of strings.
  // In most cases this is just a name or a single string.
  // In any case, the list of values is always represented as a list of strings.
  List valueList = (pdfField).getValueList(); // get the list of values
  if (valueList != null)
    {
      Iterator valueIterator = valueList.iterator();
  // get the iterator of the list
      while  (valueIterator.hasNext())
      {
       String value = (String) valueIterator.next();
      }
    }
  }
Note:The PDFField and PDFAnnotation classes are described in the JavaDocs content for PDF Java Toolkit.