Join our Discord community for support and discussions! Connect with us →

Best Practices

OCR Processing

Adobe PDF Library .NET

Get Started

Optical Character Recognition (OCR) is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file — you can't edit, search, or count words in it. OCR converts the image into a text document with its contents stored as text data, making it editable and searchable.

With Adobe PDF Library, a default set of fonts for OCR processing is available, but you can add fonts from other languages.

Language Options

Datalogics offers NuGet packages for the most requested language options. Click a language below to go to its training data package on NuGet.

If you don't see the language you need, Datalogics works with the languages/scripts supported by Tesseract. Contact us for more information.

Adding a Language

After installing the NuGet package for your language, update the candidateFontNames property. Using Chinese – Simplified as an example:

List<string> newFontNames = new List<string>();
foreach (string fontName in ocrParams.candidateFontNames) {
    newFontNames.Add(fontName);
}

newFontNames.Add("Chin"); // Chin.ttf is the name of the font file
ocrParams.candidateFontNames = newFontNames;

Make sure you provide enough fonts to cover the expected languages and scripts in your source PDF documents. If documents contain text in multiple languages, supply fonts for each — especially if more than one language appears within a single sentence. The OCR engine will use the first font that can successfully render text from a graphics image. For Latin alphabets, set up OCR processing so that proportional fonts appear before non-proportional fonts.

The quality of the output depends on the fonts you choose. Decorative fonts such as Zapf Chancery generally provide poor results. Use standard block fonts that would appear in a novel or magazine instead.

Use GetAvailableLanguages() on the OCREngine class to list installed language packs, and IsLanguageAvailable() to check if a specific language file is present.
The OCR engine does not support right-to-left or vertical scripts, including Chinese vertical, Korean vertical, Japanese vertical, Hebrew, Arabic, Urdu, Persian, Syriac, Sindhi, and Kurdish with Arabic script.