Home Page | Datalogics Developer Resources Home Page | Datalogics Developer Resources

Home Page | Datalogics Developer Resources

  • datalogics.com
  • Home
  • Documentation
    • PDF Java Toolkit
    • Adobe® PDF Library™
    • Forms Extension
    • Adobe RMSDK and DL Reader
    • PDF Forms Flattener
    • FLIP2PDF
    • PDF2IMG
    • PDF2PRINT
    • PDF Optimizer
    • PDF Checker
    • PDF Alchemist
    • Adobe PDF Converter
    • READynamic

PDF Java Toolkit

Home | PDF Java Toolkit | Guides | Extracting text from PDF files | Model

Model

Text extraction in PDF Java Toolkit takes three steps:

  1. Parse the content streams of a PDF page to locate text objects.
    1. Process Do operators as nested content streams.
  2. Apply word disambiguation rules.
  3. Generate a list of words.

Form XObjects are treated as nested content streams. To detect them, we look for the Do operator in a page's content stream.

Each of these areas has its own list of features and issues discussed in Determining Glyph Encoding and Sorting and Packaging the List of Words.

  • Getting Started
    • Compiling and running the samples with Eclipse
    • Compiling and running the samples with IntelliJ IDEA
    • Compiling and running the samples with javac and java
  • Guides
    • Using RELite to reader enable PDFs
    • Extracting text from PDF files
      • Model
      • Text Encodings
      • Word Extraction
      • Text Extraction from PDF Files
    • Working with Digital Signatures
      • Digital Signatures in PDF
      • Digital Signatures in Java
      • Applying Signatures
      • Time stamping
    • Working with the Security Framework
      • Encryption in PDF and Java
      • Structure of the Security Framework
      • Security Framework Interfaces
      • Encryption Classes
  • Core API Reference
  • Talkeetna API Reference
  • Samples API Reference
  • Release Notes, PDF Java Toolkit
© 2021 Datalogics, Inc. All rights reserved