OCR Processing

.NET Framework Adobe PDF Library SDK

Get Started

Optical Character Recognition, or OCR, is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file, meaning you can’t use a text editor to edit, search, or count the words in the image. OCR converts the image into a text document with its contents stored as text data, therefore it can be edited and searched.

With Adobe PDF Library, a default set of fonts for OCR processing is available, but you can add fonts from other languages.

Language Options

Chinese – Simplified

Chinese – Traditional

Adding Language

After installing the NuGet package for your language, update the CandidateFontNames as follows. We’ll use Chinese – Simplified as an example:

List<string> newFontNames = new List<string>();  
     foreach(string fontName in ocrParams.candidateFontNames) {  
     newFontNames.Add(fontName);  
    }  

newFontNames.Add("Chin"); //Chin.ttf is the name of the font file  
ocrParams.candidateFontNames = newFontNames;

Make sure you provide enough fonts to cover the expected languages and scripts that might appear in the images embedded in your source PDF documents. If you have PDF documents with images that feature text in multiple languages, supply fonts applicable to each of those languages, especially if more than one language appears within a single sentence or phrase. For example, an image might feature text written in the Korean alphabet but also featuring western (Arabic) numerals. The OCR processing engine will use the first font that it finds that can successfully render text drawn from a graphics image. If you are providing fonts for a Latin alphabet, set up OCR processing so that proportional fonts appear before non-proportional fonts.

The quality of the output provided by the OCR engine depends on the fonts you choose. Decorative fonts, such as Zapf Chancery, generally provide poor results. Try to use standard block fonts that would appear in a novel or magazine instead. Currently, Datalogics’ OCR engine does not support languages where characters are not presented from left to right, including:

Korean vertical

Hebrew

Arabic

Urdu

Persian

Syriac

Sindhi

Kurdish with Arabic script

Tip:

The GetAvailableLanguages() method of the OCREngine class can be used to list the available languages and the IsLanguageAvailble() method can be used to check if a specific language file is available.

On This Page

Star on GitHub