Adobe PDF Java Toolkit supports text extraction from PDF files. Text extraction makes it possible to save the PDF source as plain text.
Text extraction draws from two areas of the PDF document, form XObjects in a page's content stream and form fields and Annotations.
PDF Java Toolkit presents text as Java objects that can be iterated. To get the text, user applications are required to take the following steps.
- Iterate over the objects.
- Retrieve the text.
- Save it in the desired format.
The Text Extraction from XObjects example shows how to implement these steps.
Text Extraction from form XObjects in a page’s content stream
This section provides a discussion of text objects present in Form XObjects. A Form XObject is a PDF content stream that is a self-contained description of any sequence of graphics objects (including path objects, text objects, and sampled images).
For more detail, see Section 8.10, “Form XObjects,” in the ISO 32000 Reference, page 217.This document is found on the web store of the International Standards Organization.
Example: Content stream containing an XObject
q 0.8987885 -0.4383698 0.4383698 0.8987885 209.2356262 460.8054199 cm /GS0 gs /Fm0 Do Q Q
Example: XObject Fm0 in the resource dictionary
<< /Type /XObject /Subtype /Form /FormType 1 /BBox [0 0 1000 1000] /Matrix [1 0 0 1 0 0] /Resources << /ProcSet [/PDF] >> /Length 58 >> stream 0.25 0.333 1 rg 0 i BT /TT0 1 Tf 0 Tc 0 Tw 0 Ts 100 Tz 0 Tr 24 0 0 24 0 -24 Tm (This is a Background.)Tj ET Endstream
Example: Text Extraction
Here is a Java code sample.
Text extraction from Form Fields and Annotations
PDF Java Toolkit does not provide "text extraction services" for annotations and form fields. Text can be obtained from the appropriate dictionary fields. Quads are not computed and the word content is not run through the disambiguation algorithm. The position information available is limited to the "location" dictionary entry of the field/annot on the page. To learn more see Section 12.7, “Interactive Forms,” in the ISO 32000 Reference, page 430.This document is found on the web store of the International Standards Organization.
Text Extraction from Annotations
An annotation associates an object such as a note, sound, or movie with a location on a page of a PDF document. The optional Annots entry in a page object holds an array of annotation dictionaries, each representing an annotation associated with the given page. A given annotation dictionary may be referenced from the Annots array of only one page.
The entries that are relevant in the context of Text Extraction are listed below.
Key | Type | Description | |
---|---|---|---|
Type | Name | Optional | The type of PDF object that this dictionary describes; if present must be Annot for an annotation dictionary. |
Subtype | Name | Required | The type of annotation that this dictionary describes. |
Rect | Rectangle | Required | Contents |
Contents | Text string | Optional | Text to be displayed for the annotation. If this type of annotation does not display text it will provide an alternate description of the annotation's contents in human-readable form. In either case this text is useful when extracting the document's contents in support of accessibility to users with disabilities or for other purposes. |
M | Date or string | Optional | The date and time when the annotation was most recently modified. Viewer applications should be prepared to accept and display a string in any format. |
T | Text string | Optional | The text label to be displayed in the title bar of the annotation's pop-up window when open and active. By convention this entry identifies the user who added the annotation. |
V | Various | Optional; inheritable | The field's value. The format varies depending on the field type. See the descriptions of individual field types for further information. |
RV | Various | Optional; inheritable | The Rich-Text version of the field's value. The format varies depending on the field type. |
Example: Extract information from a list of annotations
In this example, you iterate over all the pages in the PDF document. For each page, you can enumerate the annot values and extract the required information as shown below.
PDFAnnotationList annotations = pdfPage.getAnnotationList(); // Get annotation iterator. annotIterator = annotations.iterator(); while (annotIterator.hasNext()) { // Get the next annotation. PDFAnnotation pdfAnnotation = annotIterator.next(); //Get the /Contents entry as a String. String annotation_content = pdfAnnotation.getContents(); //Get the /M entry or modification date as an ASDate. ASDate modification_date = pdfAnnotation.getModificationDate(); //Get the location of the annotation as a PDFRectangle. PDFRectangle annot_location = pdfAnnotation.getRect(); //All MarkUp Annotations have the /T entry. if(pdfAnnotation instanceof PDFAnnotationMarkup) { //Get the /T entry as a string. String title = ((PDFAnnotationMarkup)pdfAnnotation).getTitle(); } }
Annotation Types
The type of annotation is identified in the Annotation dictionary's Subtype entry. See Section 12.5, “Annotations,” in the ISO 32000 Reference. This document is found on the web store of the International Standards Organization.
Many annotation types are defined as markup annotations because they are used primarily to edit PDF documents. These annotations have text that appears as part of the annotation.
Annotations can be broadly classified into Markup and Non-MarkUp annotations.
Markup Annotations
Markup annotations can be divided into the following groups:
- Free text annotations display text directly on the page. The annotation's Contents entry specifies the displayed text.
- Most other markup annotations have an associated pop-up window that may contain text. The annotation's Contents entry specifies the text to be displayed when the pop-up window is opened. These include text, line, square, circle, polygon, polyline, highlight, underline, squiggly-underline, strikeout, rubber stamp, caret, ink, and file attachment annotations.
- Sound annotations do not have a pop-up window but may also have associated text specified by the Contents entry.
- A subset of markup annotations is called text markup annotations.
Non-Markup Annotations
The pop-up annotation type typically does not appear by itself; it is associated with a markup annotation that uses it to display text.
For all other annotation types (Link, Movie, Widget, PrinterMark, and TrapNet), the Contents entry provides an alternate representation of the annotation's contents in human-readable form, which is useful when extracting the document's contents in support of accessibility to users with disabilities or for other purposes
Text Extraction from Form Fields
For form fields, the contents saved by Acrobat indicated by the V entry.
Key | Type | Value | Description |
---|---|---|---|
V | Various | Optional; inheritable | The field's value. The format varies depending on the field type. See the descriptions of individual field types for further information. |
To get the V entry using the PDF Java Toolkit methods, use the PDFField.getValueList() method as in the following example.
Example: Field Value Extraction
PDFInteractiveForm iforms = pdfDocument.getInteractiveForm(); // Get Field iterator Iterator fieldIterator = iforms.iterator(); // Iterate over form fields while(fieldIterator.hasNext()) { // Get next field PDFField pdfField = (PDFField)(fieldIterator.next()); // The value of a field can be a list of strings. // In most cases this is just a name or a single string. // In any case, the list of values is always represented as a list of strings. List valueList = (pdfField).getValueList(); // get the list of values if (valueList != null) { Iterator valueIterator = valueList.iterator(); // get the iterator of the list while (valueIterator.hasNext()) { String value = (String) valueIterator.next(); } } }