Adobe® PDF Library

Required Components, .NET Core

The .NET Core interface requires its native Windows dependencies when deploying to Windows, its native Linux dependencies when deploying to Linux, and its native macOS dependencies when deploying to macOS.

For .NET Core, the application must have Datalogics.PDFL.dll for use with Windows, Linux, or macOS, and it must also have:

  • DL180PDFLPINVOKE.dll (Windows)
  • libDL180PDFLPINVOKE.so (Linux)
  • libDL180PDFLPINVOKE.dylib (macOS)

For Windows systems only, the Microsoft Visual Studio 2017 C++ Runtime .dll files are also required.  Copy the .dll files for the Microsoft runtime libraries from the Microsoft.VC141.CRT subdirectory into the executable directory, the same directory where you copy Datalogics.PDFL.dll. It is important to use the runtime libraries supplied with the specific distribution to ensure the correct version will be found.

The table below lists the rest of the distribution requirements.

Windows Linux macOS
DL180PDFL.dll libDL180pdfl.so DL180pdfl.framework PDF Library primary DLL file
DL180ACE.dll libDL180ACE.so DL180ACE.framework Adobe Color Engine
DL180AdobeXMP.dll libDL180AdobeXMP.so DL180AdobeXMP.framework XMP metadata
DL180AGM.dll libDL180AGM.so DL180AGM.framework Adobe Graphics Manager printing engine
DL180ARE.dll libDL180ARE.so DL180ARE.framework Adobe Raster Express
DL180AXE8SharedExpat.dll libDL180AXE8SharedExpat.so DL180AXE8SharedExpat.framework XML processing
DL180BIB.dll libDL180BIB.so DL180BIB.framework Bravo interface binder
DL180BIBUtils.dll libDL180BIBUtils.so DL180BIBUtils.framework Bravo interface binder utilities
DL180CoolType.dll libDL180CoolType.so DL180CoolType.framework CoolType Typography Engine
DL180JP2K.dll libDL180JP2K.so DL180JP2K.framework JPEG2000 Library
icuuc68.dll libicuuc.68.so libicuuc.dylib International Components for Unicode
icudt68.dll libicudata.68.so libicudata.dylib International Components for Unicode

Plug-ins

There are three optional plug-ins with supporting library files:

DL180XPS2PDF.ppi XPS to PDF conversion
DL180PDFlattener.ppi Transparency Flattener
DL180PDFProcessor.ppi PDF/A and PDF/X conversions

These two library files are needed if you want to use the plug-ins:

Windows Linux macOS
DL180pdfport.dll libDL180PDFPort.so DL180PDFPort.framework  
DL180pdfsettings.dll libDL180pdfsettings.so DL180pdfsettings.framework 

Optical Character Recognition (OCR) processing

The Java, .NET and .NET Core interfaces for the Adobe PDF Library provide an Optical Character Recognition (OCR) feature that can recognize images in a PDF document. The OCR utility recognizes text within each image, and then allows you to save that text to a new PDF export file, with that text underlaying the image where it was found.

The OCR engine library is stored in the Binaries directory, under DotNETCore:

dltesseract4.dll Windows
libdltesseract4.so Linux
libdltesseract4.dylib macOS

Two sample programs show how to use OCR processing with DotNET Core, AddTextToImage and AddTextToDocument. These samples are found in the OpticalCharacterRecognition folder.

Working with Fonts and Languages

The Adobe PDF Library uses the Tesseract 4 OCR Engine. The tessdata4 directory holds the language files to support OCR processing to identify text in images in PDF documents. The default languages offered include English, Dutch, French, German, Italian, Spanish, Portuguese, Mandarin, Japanese, and Korean, but many more languages are available:

https://github.com/tesseract-ocr/tessdoc/blob/master/Data-Files-in-different-versions.md

The Adobe PDF Library provides a default set of fonts for OCR processing, and this set of fonts should serve for the languages that we ship with the product.  But you can add fonts from other languages if you like, to the candidateFontNames list within the OCRParams object.

If you want to use OCR for processing documents in a different language, download the files you need from this repository:

https://github.com/tesseract-ocr/tessdata/tree/4.0.0

Move the files to the tessdata4 folder.

For example, to set the language for OCR to Hindi, update the candidateFontNames list to include the Hindi font:

List<string> newFontNames = new List<string>();
foreach(string fontName in ocrParams.candidateFontNames)
{
     newFontNames.Add(fontName);
}
newFontNames.Add("Hind");//Hind.ttf is the name of the font file
ocrParams.candidateFontNames = newFontNames;

Add the Tesseract training data file for Hindi, hin.traineddata, to the tessdata4 folder.

Make sure that when you use OCR processing, you provide enough fonts to cover the expected languages and scripts that might appear in the graphics images embedded in your source PDF documents. If you have PDF documents with images that feature text in multiple languages, supply fonts applicable to each of those languages, especially if more than one language appears within a single sentence or phrase. For example, an image might feature text written in the Korean alphabet but also featuring western (Arabic) numerals.

The OCR processing engine will use the first font that it finds that can successfully render text drawn from a graphics image. If you are providing fonts for a Latin alphabet, set up OCR processing so that proportional fonts appear before non-proportional fonts.

The quality of the output provided by the OCR engine depends on the fonts you choose. Decorative fonts, such as Zapf Chancery, generally provide poor results. Try to use standard block fonts that would appear in a novel or magazine instead.

The OCR engine does not support languages where characters are not presented from left to right, including:

  • Korean vertical
  • Hebrew
  • Arabic
  • Urdu
  • Persian
  • Syriac
  • Sindhi
  • Kurdish with Arabic script

A pair of APIs to list the available languages for use with OCR processing and to determine if a specific language file is found in the tessdata4 folder.

Note that the OCR engine in Adobe PDF Library is only compatible with the Windows, Linux, and macOS 64-bit platforms.

Resources

The files found in the folders under Resources are used for a variety of operations, including creating/setting text and extracting or parsing content. Some of these are font files. Datalogics recommends that you include all of the resources in the Resources tree with your distribution. If, however, you need to limit the total size of the files included in your application, some of the components in the Resources folder can be removed.

  • Font. This folder includes a collection of fonts, including CJKV fonts. CJKV fonts are Multi-byte (16 bit) character fields mostly used by Chinese, Japanese, Korean, and Vietnamese characters. Thus a CJKV character is twice as wide as a normal single-byte character space. The double byte character field is needed because these languages have so many characters that a single 8 bit character field is not enough to represent all of them, but the font files that result tend to be larger than typical fonts. You can leave these files out of your distribution package if you will not be processing documents with CJKV content.
  • CMap. Some fonts in PDF files use predefined mappings between character encodings and specific, predefined character identifier sets. These mappings are called Character Maps (CMaps). We recommend that they all be included with your distribution, even though they can be quite large.
  • Joboptions. The Joboption file is only used by the XPS plug-in provided with the Library. You can leave this file out of your distribution if your application does not use this plug-in.
  • Color. This is used for rendering, printing, and conversion operations.
  • Unicode. Used for text extraction and for text conversion during printing, rendering, and conversion operations. Unicode is an international font standard.
  • tessdata4. This folder contains language files for applications that use the OCR engine.  It can be removed if OCR is not used.

The .NET and Java interfaces will look for the Resources folder under the primary deployment folder. The Font, CMap, Color, and Unicode path names can be specified during the Library initialization.