Text extraction in PDF Java Toolkit takes three steps:
- Parse the content streams of a PDF page to locate text objects.
- Process Do operators as nested content streams.
- Apply word disambiguation rules.
- Generate a list of words.
Form XObjects are treated as nested content streams. To detect them, we look for the Do operator in a page's content stream.
Each of these areas has its own list of features and issues discussed in Determining Glyph Encoding and Sorting and Packaging the List of Words.