Model

Text extraction in PDF Java Toolkit takes three steps:

  1. Parse the content streams of a PDF page to locate text objects.
    1. Process Do operators as nested content streams.
  2. Apply word disambiguation rules.
  3. Generate a list of words.

Form XObjects are treated as nested content streams. To detect them, we look for the Do operator in a page's content stream.

Each of these areas has its own list of features and issues discussed in Determining Glyph Encoding and Sorting and Packaging the List of Words.