PDF Alchemist

Using the PDF Alchemist SDK

The Datalogics PDF Alchemist SDK provides a simple C language API to convert PDF files to HTML, XML, or EPUB content. The API (in Windows, the link library) is contained within the SDK subdirectory of the product folder.

PDF Alchemist SDK supports the following platforms:

Operating System Architecture Binary SDK Files Notes
Microsoft Windows x86_64 PDFAlchemist_x64.lib Visual Studio 2013 required.
PDFAlchemist_x64.dll Windows 7 or higher supported.
Linux x86_64 libPDFAlchemist.so Verified on Ubuntu and 14.04 and RedHat Enterprise Linux 7 target systems;
other Linux versions using Linux kernel 3.2 or higher are also supported.

PDF Alchemist SDK declarations are contained within the file PDFAlchemist.h.

Parameters Description
blackText Render any text found in a PDF document as black text.
cmapDir Path to the Adobe CMap files for supporting the conversion of text represented in CID form to Unicode. This is a C‐style string and must have a trailing slash character. The string must remain valid until the return of the PDF processing call.
disableAcroForm Emit Acrobat PDF forms as flattened content instead of interactive fixed‐layout HTML forms.
enableAcroformsReflow Emit PDF forms documents as reflowable HTML forms. This parameter is currently experimental; reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements. The disableAcroform setting will override this parameter.
enablebookmarks The destinations found in the PDF outline in the source document are exported to the output HTML file as a series of bookmarks. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. The program will write a separate file called bookmarks.html. This file will have an IFrame view with the table of contents on one pane and the HTML output in another.
enablecaptions By default, PDF Alchemist does not look for caption text under images that appear in a PDF document. Set this option to “true” to tell the software to look for text appearing under images and define that text, when found, as a caption.
enableInfographicDetection Detect infographic contents (including charts and diagrams).
enableLayoutDirectionDetection Detect CJK (Chinese/Japanese/Korean) vertical layout text.
enableLogging Output progress logging to the console (stdout).
enableXmlOutput Convert the PDF input file to XML. No HTML files are generated.
graphicsOutputDpi Specify the target resolution for images in the PDF document that you want to export to rasterized graphic files. Valid values are from 12 to 2400. Set a value higher than 200 DPI if you want to improve the resolution of graphics drawn from the PDF document so that they will look better in a browser window. Set a value lower than 200 DPI to generate export graphic files that are smaller.
merge_span Disable the output of style information and <span> tags to XML files.
ocrLanguage If you are using Optical Character Recognition (OCR) to pull text from images in an input PDF document (see -ocrMode), the OCR utility defaults to English language text. But the OCR utility supports other languages.

Use this parameter to select German (deu), French (fra), Italian (ita), Dutch (nld), Portuguese (por) or Spanish (spa) if your input files use one of these languages instead.

Note you can only select one language at a time for this option.

Default: eng (English)

ocrMode By default, PDF Alchemist passes images in PDF files through to its output without looking for text in these images. If this option is set to “tag,” PDF Alchemist uses optical character recognition (OCR) to scan images when converting PDF files. Any text found within an image is embedded in the image reference alt attributes.If ocrMode is set to “replace,” the OCR feature is turned on and the OCR text replaces the original image in the output file. The process creates selectable text in the HTML or XML output, and the source image is removed. The text is also tagged as OCR text within the export file. This allows the person reviewing the output file to know where the text came from, and it also serves as a warning, as the OCR text might not be rendered perfectly. For an HTML or EPUB output file, any text generated from an image in the PDF input file using OCR is marked with this tag:

data-ocr-text="true"

For an XML file, the tag looks like this:

ocr-text="true"

Note: OCR processing can lead to substantial increases in processing time and should be enabled only when desired.

Default: off – No OCR processing is performed during conversion.

pageRanges PDF Alchemist converts every page in a PDF document to HTML, XML or EPUB output by default. Use this option to select a specific set or range of pages for processing. The rest of the pages in the document are ignored. You may also include multiple page ranges in the statement. Each page range is described with two page numbers separated by a hyphen, as in 1-4, for the first four pages of a document, or 1,3-5,7,9,14-18. If a page range does not include a second value, PDF Alchemist will process all of the pages in the document following the page number provided. So for “22-“ PDF Alchemist would start at page 22 and process all of the rest of the pages in the document.
purpose Specify the goal of the PDF conversion. Three enumeration values are provided:

kIndexing: Do not rasterize any text. This preserves text so it can be used to index the content to allow for efficient searching. But the appearance of the output might differ significantly from the appearance of the input PDF.

kBalanced: Create output for general purpose uses. In this case the need for searchable output is balanced with the need for output with an appearance that is similar to the original PDF input file. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.

kVisual. Reserved for future use.

removeHyphen Remove hyphens that appear at the end of lines in the PDF input document. The hyphens are used to divide words at syllables or create hyphenated phrases. Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined into single words.
removeInvisibleText By default, PDF Alchemist exports all of the text found in every layer of a PDF document, and in its native color, usually black. If a PDF document has white text, this text might not be visible against a white background in an HTML export file. Set this option to “true” to discard this white text so that it is not included in the export file. Note that PDF Alchemist can also convert all of the text found in a PDF document to black text for export. See blackText above.
singleFile Emit HTML and CSS in one file instead of separate HTML and CSS files. If singleFile is set to True skipPageBackground and splitByBookmarkDepth must be zero (0).
skipHeaderFooter Don’t output content detected as repeated header and footer in HTML output.
skipPageBackground Don’t output image detected as page background in HTML output.
splitByBookmarkDepth PDF Alchemist generates by default output content in one section that contains all of the generated output. But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this parameter determines the levels of the sections to use.

Enter 0 for no splitting.

Enter “1” if you want the output file to break the content into sections based on the first level of the table of contents (“Heading 1” if you will).

If you enter “2” the output file will break into two sections (Heading 1 and Heading 2).

Enter “3” for three sections levels (Heading 1 Heading 2 and Heading 3).

splitByEveryNumberOfPages PDF Alchemist generates by default output content in one section that contains all of the generated output. Enter a number for this parameter to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter “3” and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.

Note: If both splitbyBookmarkDepth and splitbyNumberofPages are set to a value other than zero PDF Alchemist will use splitbyBookmarkDepth and ignore splitByEveryNumberOfPages.

tablesOnly Only output table content from a PDF document to an export file and disregard text, images, and any other content found.

Defaults to false, all content exported.

useAccurateGlyphBox Use glyphs metrics based on the embedded font data instead of the info provided in PDF font dictionary.
usePDFEmbeddedFont For fonts embedded in the PDF input. Emit renditions of the fonts and references in the HTML output to the generated fonts instead of emitting only references to fonts in the HTML output.

To convert a PDF file to HTML or XML, use the processPdf API call:

Result processPdf (
     const char* pdfFilePath,
     const char* outputDir,
     Parameters* params,
     Float *confidenceScore)

This API call works the same way for HTML and XML. The only difference involves setting the enableXmlOutput parameter to True.

To convert a PDF file to EPUB, use the processPdf2Epub API call:

Result processPdf2Epub (
     const char* pdfFilePath,
     const char* outputDir,
     Parameters* params,
     Float *confidenceScore)

Where these calls both accept the following parameters:

Parameter Description
pdfFilePath Input PDF file path
outputDir Directory to write output into. This directory must exist before calling. This may be either a relative or absolute path and it may be the current working directory.
params Parameters for conversion. This may be NULL.
confidenceScore Pointer to a float filled in by the call on successful conversion. This will be filled with a value between 0 and 100 to describe the system's confidence in the conversion. Complex layouts that might prompt the system to "guess" about the type of content and structure during the layout process tend to lead to lower confidence scores. Examples might include borderless tables or complex tables with multiple columns and large quantities of infographics.

If a NULL value for the Parameters structure pointer is specified to either call, the following set of default conversion parameters will be used:

Parameter Default value
blackText false
cmapDir No default path
disableAcroForm false
enableAcroformReflow false
enableBookmarks true
enablecaptions false
enableInfographicDetection true
enableLayoutDirectionDetection true
enableLogging false
enableXmlOutput false
graphicsOutputDpi 200
linkStyleUnspecified false
merge_span false
pageRanges all pages in document
purpose kBalanced
removeHyphen false
removeInvisibleText false
singleFile false
skipHeaderFooter true
skipPageBackground true
splitbyBookmarkDepth 0 for HTML output; 1 for EPUB output
splitbyEveryNumberofPages 0
tablesOnly false
useAccurateGlyphBox true
usePDFEmbeddedFont true

For best results, we recommend explicitly supplying a Parameters structure with the specific conversion parameters that best fit your usage workflow.

Return value:

Parameter Description
kSuccess Indicates conversion is done successfully
kFailed Indicates an error happened during conversion
kLicenseInvalid Indicates that a license could not be found or successfully used
kParameterOutofRange A parameter supplied in the control structure is outside the range of permitted values. For example if the graphicsOutputDPI was set to 3600 you would see this return value. When setting an output resolution for a graphic file the highest value allowed is 2400 DPI. No processing performed.
kInvalidPDFInput The file provided for processing was not a valid PDF file or has syntax errors that prevent PDF Alchemist from being able to open and process it.
kInternalError Ann internal error occurred during processing. PDF Alchemist cannot provide any additional information. No output was generated.