OCR Processing
Get Started
Optical Character Recognition (OCR) is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file — you can't edit, search, or count words in it. OCR converts the image into a text document with its contents stored as text data, making it editable and searchable.
With Adobe PDF Library, a default set of fonts for OCR processing is available, but you can add fonts from other languages.
Language Options
Datalogics offers NuGet packages for the most requested language options. Click a language below to go to its training data package on NuGet.
Chinese – Simplified
Chinese – Traditional
Dutch
English
French
German
Italian
Japanese
Korean
Portuguese
Spanish
Adding a Language
After installing the NuGet package for your language, update the candidateFontNames property. Using Chinese – Simplified as an example:
List<string> newFontNames = new List<string>();
foreach (string fontName in ocrParams.candidateFontNames) {
newFontNames.Add(fontName);
}
newFontNames.Add("Chin"); // Chin.ttf is the name of the font file
ocrParams.candidateFontNames = newFontNames;
Make sure you provide enough fonts to cover the expected languages and scripts in your source PDF documents. If documents contain text in multiple languages, supply fonts for each — especially if more than one language appears within a single sentence. The OCR engine will use the first font that can successfully render text from a graphics image. For Latin alphabets, set up OCR processing so that proportional fonts appear before non-proportional fonts.
The quality of the output depends on the fonts you choose. Decorative fonts such as Zapf Chancery generally provide poor results. Use standard block fonts that would appear in a novel or magazine instead.
GetAvailableLanguages() on the OCREngine class to list installed language packs, and IsLanguageAvailable() to check if a specific language file is present.