The text extraction API’s generate a list of words in a PDF document that use Roman or Unicode encoding. Text extraction works for both non-structured and tagged PDF.
Since PDF Java Toolkit is Java-based, and Java provides native support for Unicode, the APIs provide the extracted text in Unicode format.
Beyond this, text extraction is designed to handle text in any standard encodings. Users may extract text from the entire document or on any element of the structure tree.
Determining Glyph Encoding
Determining the glyph encoding for non-tagged PDF is a complicated process.. The basic outline for determining glyph encoding is given in Section 9.10.2, “Mapping Character Codes to Unicode Values” of the ISO 32000 document, page 292. This document is found on the web store of the International Standards Organization.
Basically, conversion to Unicode can be done if the font’s characters are identified using a character set that is known to the library. This character identification can occur if either the font uses a standard named encoding, or the characters in the font are identified by standard character names or CIDs in a well-known collection.
For Tagged PDF, the extraction of character information is somewhat less ambiguous. Section 14.8.3 of the ISO 32000 document, “Basic Layout Model” (page 581), dictates that producers of tagged PDF should provide enough information to perform the Unicode mapping using according to one of the methods given in Section 9.10.2. This document is found on the web store of the International Standards Organization.
Also, Tagged PDF has information for dealing with artifacts (such as hyphens) which aids word disambiguation.
Treatment of Glyphs and Fonts
In general, glyphs whose names that are not recognizable and that don't have /ToUnicode entries cannot be converted to Unicode. See the lists of Character Sets and Encodings provided in given in Appendix D of the ISO 32000 Reference, page 651.This document is found on the web store of the International Standards Organization.
All Fonts except Type3 fonts have a built-in encoding. The four encodings possible are:
- StandardEncoding
- MacRomanEncoding
- WinAnsiEncoding
- PDFDocEncoding
It is possible to override this encoding by populating the Encoding entry of a PDFFont Dictionary with an EncodingDictionary. See Table 114, “Entries in an encoding dictionary,” page 263, in Section 9.6.6, “Character Encoding.” This document is found on the web store of the International Standards Organization.
For text extraction, StandardEncoding is applied to simple fonts that lack an encoding.
To map a character code to that character’s Unicode value for Simple fonts or character code to character identifier for Composite Fonts, PDF Java Toolkit uses the CMap file present in the / ToUnicode entry of the font dictionary. For a detailed explanation of the /ToUnicodeCmaps, see Section 9.10.3, “ToUnicode CMaps.” This document is found on the web store of the International Standards Organization.