Join our Discord community for support and discussions! Connect with us →

Best Practices

OCR Processing

Adobe PDF Library Java/Maven

A default set of fonts for OCR processing is available in the Maven package, but you can add fonts from other languages to the candidateFontNames member of the OCRParams object.

For example, to set the OCR language to Chinese Simplified, update the candidateFontNames list:

List<String> newFontNames = new ArrayList<String>();

for (String fontName : ocrParams.getCandidateFontNames()) {
    newFontNames.add(fontName);
}

newFontNames.add("Chin"); // Chin.ttf is the name of the font file
ocrParams.setCandidateFontNames(newFontNames);

Make sure you provide enough fonts to cover the expected languages and scripts in your source PDF documents. If documents contain text in multiple languages, supply fonts for each — especially if more than one language appears within a single sentence. The OCR engine will use the first font that can successfully render text from a graphics image. For Latin alphabets, set up OCR processing so that proportional fonts appear before non-proportional fonts.

The quality of the output depends on the fonts you choose. Decorative fonts such as Zapf Chancery generally provide poor results. Use standard block fonts that would appear in a novel or magazine instead.

Use GetAvailableLanguages() on the OCREngine class to list installed language packs, and IsLanguageAvailable() to check if a specific language is present.

Request a Language

Datalogics works with the languages/scripts supported by Tesseract. Contact us for more information about a specific language.