OCR Processing

Adobe PDF Library Java/Maven

A default set of fonts for OCR processing is available, but you can add fonts from other languages to the CandidateFontNames member of the OCRParams object.

E.g., to set the language for OCR to Chinese Simplified, update the CandidateFontNames list to include it:

List<String> newFontNames = new ArrayList<String>();

for(String fontName : ocrParams.getCandidateFontNames()) {

newFontNames.add(fontName);

}

newFontNames.add("Chin"); //Chin.ttf is the name of the font file

ocrParams.setCandidateFontNames(newFontNames);

Make sure you provide enough fonts to cover the expected languages and scripts that might appear in the images embedded in your source PDF documents. If you have PDF documents with images that feature text in multiple languages, supply fonts applicable to each of those languages, especially if more than one language appears within a single sentence or phrase. For example, an image might feature text written in the Korean alphabet but also featuring western (Arabic) numerals...

The OCR processing engine will use the first font that it finds that can successfully render text drawn from a graphics image. If you are providing fonts for a Latin alphabet, set up OCR processing so that proportional fonts appear before non-proportional fonts.

The quality of the output provided by the OCR engine depends on the fonts you choose. Decorative fonts, such as Zapf Chancery, generally provide poor results. Try to use standard block fonts that would appear in a novel or magazine instead.

The OCR engine does not support languages where characters are not presented from left to right, including:

  • Korean vertical
  • Hebrew
  • Arabic
  • Urdu
  • Persian
  • Syriac
  • Sindhi
  • Kurdish with Arabic script

TIP:

The GetAvailableLanguages() method of the OCREngine class can be used to list the available languages and the IsLanguageAvailble() method can be used to check if a specific language is available.