PDF Alchemist

Using the PDF Alchemist SDK

The Datalogics PDF Alchemist SDK provides a simple C language API to convert PDF files to HTML, XML, or EPUB content. The API (in Windows, the link library) is contained within the SDK subdirectory of the product folder.

PDF Alchemist SDK supports the following platforms:

Operating System Architecture Binary SDK Files Notes
Microsoft Windows x86 (32-bit) PDFAlchemist_x86.lib Visual Studio 2013 required.
PDFAlchemist_x86.dll Windows 7 or higher supported.
Microsoft Windows x86_64 PDFAlchemist_x64.lib Visual Studio 2013 required.
PDFAlchemist_x64.dll Windows 7 or higher supported.
Apple Mac OS X x86_64 libPDFAlchemist.dylib XCode 5.x or 6.x required.
OS X 10.9 or higher supported.
Linux x86_64 libPDFAlchemist.so Verified on Ubuntu and 14.04 and RedHat Enterprise Linux 7 target systems;
other Linux versions using Linux kernel 3.2 or higher are also supported.

PDF Alchemist SDK declarations are contained within the file PDFAlchemist.h.

Parameters Description
cmapDir Path to the Adobe CMap files for supporting the conversion of text represented in CID form to Unicode. This is a C‐style string and must have a trailing slash character. The string must remain valid until the return of the PDF processing call.
disableAcroForm Emit Acrobat PDF forms as flattened content instead of interactive fixed‐layout HTML forms.
enableAcroformsReflow Emit PDF forms documents as reflowable HTML forms. This parameter is currently experimental; reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements. The disableAcroform setting will override this parameter.
enablebookmarks The destinations found in the PDF outline in the source document are exported to the output HTML file as a series of bookmarks. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. The program will write a separate file called bookmarks.html. This file will have an IFrame view with the table of contents on one pane and the HTML output in another.
enableInfographicDetection Detect infographic contents (including charts and diagrams).
enableLayoutDirectionDetection Detect CJK (Chinese/Japanese/Korean) vertical layout text.
enableLogging Output progress logging to the console (stdout).
enableOCR By default, PDF Alchemist passes images in PDF files through to its output without looking for textual data in these images. If this option is specified as true, PDF Alchemist uses optical character recognition (OCR) to scan images when converting PDF files. Text found in images are embedded in the image reference alt attributes as a textual substitution for the image. Note: OCR processing can lead to substantial increases in processing time and should be enabled only when desired. OCR processing is limited to English-language text processing at present.
enableXmlOutput Convert the PDF input file to XML. No HTML files are generated.
graphicsOutputDpi Specify the target resolution for images in the PDF document that you want to export to rasterized graphic files. Valid values are from 12 to 2400. Set a value higher than 200 DPI if you want to improve the resolution of graphics drawn from the PDF document so that they will look better in a browser window. Set a value lower than 200 DPI to generate export graphic files that are smaller.
merge_span Disable the output of style information and <span> tags to XML files.
purpose Specify the goal of the PDF conversion. Three enumeration values are provided:

kIndexing: Do not rasterize any text. This preserves text so it can be used to index the content to allow for efficient searching. But the appearance of the output might differ significantly from the appearance of the input PDF.

kBalanced: Create output for general purpose uses. In this case the need for searchable output is balanced with the need for output with an appearance that is similar to the original PDF input file. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.

kVisual. Reserved for future use.

removeHyphen Remove hyphens that appear at the end of lines in the PDF input document. The hyphens are used to divide words at syllables or create hyphenated phrases. Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined into single words.
singleFile Emit HTML and CSS in one file instead of separate HTML and CSS files. If singleFile is set to True skipPageBackground and splitByBookmarkDepth must be zero (0).
skipHeaderFooter Don’t output content detected as repeated header and footer in HTML output.
skipPageBackground Don’t output image detected as page background in HTML output.
splitByBookmarkDepth PDF Alchemist generates by default output content in one section that contains all of the generated output. But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this parameter determines the levels of the sections to use.

Enter 0 for no splitting.

Enter “1” if you want the output file to break the content into sections based on the first level of the table of contents (“Heading 1” if you will).

If you enter “2” the output file will break into two sections (Heading 1 and Heading 2).

Enter “3” for three sections levels (Heading 1 Heading 2 and Heading 3).

splitByEveryNumberOfPages PDF Alchemist generates by default output content in one section that contains all of the generated output. Enter a number for this parameter to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter “3” and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.

Note: If both splitbyBookmarkDepth and splitbyNumberofPages are set to a value other than zero PDF Alchemist will use splitbyBookmarkDepth and ignore splitByEveryNumberOfPages.

useAccurateGlyphBox Use glyphs metrics based on the embedded font data instead of the info provided in PDF font dictionary.
usePDFEmbeddedFont For fonts embedded in the PDF input. Emit renditions of the fonts and references in the HTML output to the generated fonts instead of emitting only references to fonts in the HTML output.

To convert a PDF file to HTML or XML, use the processPdf API call:

Result processPdf (
     const char* pdfFilePath,
     const char* outputDir,
     Parameters* params,
     Float *confidenceScore)

This API call works the same way for HTML and XML. The only difference involves setting the enableXmlOutput parameter to True.

To convert a PDF file to EPUB, us the processPdf2Epub API call:

Result processPdf2Epub (
     const char* pdfFilePath,
     const char* outputDir,
     Parameters* params,
     Float *confidenceScore)

Where these calls both accept the following parameters:

Parameter Description
pdfFilePath Input PDF file path
outputDir Directory to write output into. This directory must exist before calling. This may be either a relative or absolute path and it may be the current working directory.
params Parameters for conversion. This may be NULL.
confidenceScore Pointer to a float filled in by the call on successful conversion. This will be filled with a value between 0 and 100 to describe the system's confidence in the conversion. Complex layouts that might prompt the system to "guess" about the type of content and structure during the layout process tend to lead to lower confidence scores. Examples might include borderless tables or complex tables with multiple columns and large quantities of infographics.

If a NULL value for the Parameters structure pointer is specified to either call, the following set of default conversion parameters will be used:

Parameter Default value
cmapDir No default path
disableAcroForm false
enableAcroformReflow false
enableBookmarks true
enableInfographicDetection true
enableLayoutDirectionDetection true
enableLogging false
enableXmlOutput false
graphicsOutputDpi 200
linkStyleUnspecified false
merge_span false
purpose kBalanced
removeHyphen false
singleFile false
skipHeaderFooter true
skipPageBackground true
splitbyBookmarkDepth 0 for HTML output; 1 for EPUB output
splitbyEveryNumberofPages 0
useAccurateGlyphBox true
usePDFEmbeddedFont true

For best results, we recommend explicitly supplying a Parameters structure with the specific conversion parameters that best fit your usage workflow.

Return value:

Parameter Description
kSuccess Indicates conversion is done successfully
kFailed Indicates an error happened during conversion
kLicenseInvalid Indicates that a license could not be found or successfully used
kParameterOutofRange A parameter supplied in the control structure is outside the range of permitted values. For example if the graphicsOutputDPI was set to 3600 you would see this return value. When setting an output resolution for a graphic file the highest value allowed is 2400 DPI. No processing performed.
kInvalidPDFInput The file provided for processing was not a valid PDF file or has syntax errors that prevent PDF Alchemist from being able to open and process it.
kInternalError Ann internal error occurred during processing. PDF Alchemist cannot provide any additional information. No output was generated.