OCR Processing
Adobe PDF Library .NET
Get Started
Optical Character Recognition, or OCR, is the process that converts an image of text into a machine-readable text format. For example, if you scan a form or a receipt, your computer saves the scan as an image file, meaning you can’t use a text editor to edit, search, or count the words in the image. OCR converts the image into a text document with its contents stored as text data, therefore it can be edited and searched.
With Adobe PDF Library, a default set of fonts for OCR processing is available, but you can add fonts from other languages.
Language Options
Datalogics offers NuGet packages for the most requested language options we offer. Clicking the links below will take you to the specific language training data for Adobe PDF Library.
If you don't see the language you need, please note that Datalogics works with the languages/scripts supported by Tesseract. Contact us for more information.
Adding Language
After installing the NuGet package for your language, update the CandidateFontNames as follows. We’ll use Chinese – Simplified as an example:
Make sure you provide enough fonts to cover the expected languages and scripts that might appear in the images embedded in your source PDF documents. If you have PDF documents with images that feature text in multiple languages, supply fonts applicable to each of those languages, especially if more than one language appears within a single sentence or phrase. For example, an image might feature text written in the Korean alphabet but also featuring western (Arabic) numerals. The OCR processing engine will use the first font that it finds that can successfully render text drawn from a graphics image. If you are providing fonts for a Latin alphabet, set up OCR processing so that proportional fonts appear before non-proportional fonts.
The quality of the output provided by the OCR engine depends on the fonts you choose. Decorative fonts, such as Zapf Chancery, generally provide poor results. Try to use standard block fonts that would appear in a novel or magazine instead.
TIP:
The GetAvailableLanguages() method of the OCREngine class can be used to list the available languages and the IsLanguageAvailble() method can be used to check if a specific language file is available.
Currently, Datalogics’ OCR engine does not support languages where characters are presented from left to right, or vertical, including Chinese vertical, Korean vertical, Japanese vertical, Hebrew, Arabic, Urdu, Persian, Syriac, Sindhi, and Kurdish with Arabic script.