PDF Java Toolkit

Word Extraction

Assumptions for word breaking

If two or more consecutive structure or marked-content sequences has an ActualText entry, they should be treated as if no word break is present between them.

In a non-structured document, pages are in reading order and reading order within a page is the drawing order of text operators on that page. In a structured document, the structure elements are in reading order and the reading order within an element is the order of occurrence in the element’s /Kids array. Reading order within marked content sequences or form XObject referenced from a structure element is the drawing order of text operators in the sequence or form.

In structured PDF, words do not span structure elements. The first character of a structure element is the first character of the first word in that element. Similarly, the last character of a structure element is the last character of the last word in that element.

Word breaking characters
  • A period or comma is a word breaking character except if it is preceded or followed by a numeral in which case it is included as part of a “numeral word.”
  • Multiple non-whitespace word-breaking characters are themselves considered as a word. For example, the string “!@#$% ^&*” contains two words.
  • Multiple whitespace characters have no additional word breaking meaning.
  • The hyphen character is not a word-breaking character.
Line Endings

A new line of text starts a new word unless the previous line ended with a hyphen or en dash. More rigorously stated, a character that is not a hyphen or en dash and which is followed by another character whose baseline is not “aligned” with it, denotes the end of a word.

Baseline alignment

Two characters are considered to have aligned baselines if the second character’s origin is located within 10% of the height of the larger of the two characters in the direction perpendicular to the baseline of the first character.
In general, large fluctuations in character spacing, fonts, color, size, and the like are ignored for the purposes of identifying words. Additionally, no word breaking information can be gleaned from the usage pattern of the various text operators.

Sorting and Packaging the List of Words

You can get a list of words on a page by using a Word object and following these guidelines.

See the example code for Text Extraction.

Text Extraction example for a code listing.

  • Iterate over a Word list.
  • Get a Unicode string representing the word.
  • Get the page number where the word is located.
  • Get a list of bounding quads for the word.
  • Iterate over the list of quads.
  • Convert quads to strings.