PDF Alchemist

Using the PDF Alchemist SDK

The Datalogics PDF Alchemist SDK provides a simple C language API to convert PDF files to HTML, XML, JSON, EPUB, plain text, or CSV content. The API (in Windows, the link library) is contained within the SDK subdirectory of the product folder.

PDF Alchemist SDK supports the following platforms:

Operating System Architecture Binary SDK Files Notes
Microsoft Windows x86_64 PDFAlchemist_x64.lib Visual Studio 2013 required.
PDFAlchemist_x64.dll Windows 7 or higher supported.
Linux x86_64 libPDFAlchemist.so Verified on Ubuntu and 14.04 and RedHat Enterprise Linux 7 target systems;
other Linux versions using Linux kernel 3.2 or higher are also supported.

PDF Alchemist SDK declarations are contained within the file PDFAlchemist.h.

Parameters Description
blackText Render any text found in a PDF document as black text.
borderlessTableDetectionOnPages By default, PDF Alchemist looks for tables without borders. Enter a range of page numbers to limit a search for borderless tables to only part of a document. Set this option to “none” to disable that feature. Only tables with borders will be rendered as tables in the output file, any borderless tables found will be exported as standard text.

Defaults to 1-, or all pages in a document, meaning that PDF Alchemist will by default look for and render borderless tables.

cmapDir Path to the Adobe CMap files for supporting the conversion of text represented in CID form to Unicode. This is a C‐style string and must have a trailing slash character. The string must remain valid until the return of the PDF processing call.
disableAcroForm Emit Acrobat PDF forms as flattened content instead of interactive fixed‐layout HTML forms.
enableAcroformReflow Emit PDF forms documents as reflowable HTML forms. This parameter is currently experimental; reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements. The disableAcroform setting will override this parameter.
enablebookmarks The destinations found in the PDF outline in the source document are exported to the output HTML file as a series of bookmarks. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. The program will write a separate file called bookmarks.html. This file will have an IFrame view with the table of contents on one pane and the HTML output in another.
enablecaptions By default, PDF Alchemist does not look for caption text under images that appear in a PDF document. Set this option to “true” to tell the software to look for text appearing under images and define that text, when found, as a caption.
enableInfographicDetection Detect infographic contents (including charts and diagrams).
enableLayoutDirectionDetection Detect CJK (Chinese/Japanese/Korean) vertical layout text.
enableLogging Output progress logging to the console (stdout).
enableXmlOutput Convert the PDF input file to XML. No HTML files are generated.

Note: The enableXmlOutput parameter has been deprecated. We recommend you use the outputFormat parameter instead.

fontDirectoryPath Provide a custom folder name for the font (TTF) files extracted from a PDF source file.

Defaults to ./fonts.

fontFilenamePrefix Provide a custom prefix to add to the name of each TTF font file extracted from a PDF source file.

Defaults to no prefix name.

graphicsOutputDpi Specify the target resolution for images in the PDF document that you want to export to rasterized graphic files. Valid values are from 12 to 2400. Set a value higher than 200 DPI if you want to improve the resolution of graphics drawn from the PDF document so that they will look better in a browser window. Set a value lower than 200 DPI to generate export graphic files that are smaller.
imageDirectoryPath Provide a custom folder name for the image (PNG) files extracted from a PDF source file.

Defaults to ./images.

imageFilenamePrefix Provide a custom prefix to add to the name of each PNG image file extracted from a PDF source file.

Defaults to no prefix name.

merge_span Disable the output of style information and <span> tags to XML files.
ocrLanguage If you are using Optical Character Recognition (OCR) to pull text from images in an input PDF document (see ocrMode below), the OCR utility defaults to English language text. But the OCR utility supports other languages.

Use this parameter to select German (deu), French (fra), Italian (ita), Dutch (nld), Portuguese (por) or Spanish (spa) if your input files use one of these languages instead.

Note you can only select one language at a time for this option.

Default: eng (English)

ocrMode By default, PDF Alchemist passes images in PDF files through to its output without looking for text in these images. If this option is set to “tag,” PDF Alchemist uses optical character recognition (OCR) to scan images when converting PDF files. Any text found within an image is embedded in the image reference alt attributes.If ocrMode is set to “replace,” the OCR feature is turned on and the OCR text replaces the original image in the output file. The process creates selectable text in the HTML or XML output, and the source image is removed. The text is also tagged as OCR text within the export file. This allows the person reviewing the output file to know where the text came from, and it also serves as a warning, as the OCR text might not be rendered perfectly. For an HTML or EPUB output file, any text generated from an image in the PDF input file using OCR is marked with this tag:


For an XML file, the tag looks like this:


And for a JSON file, it looks like this:

"ocr-text": "true"

Note: OCR processing can lead to substantial increases in processing time and should be enabled only when desired.

Default: off – No OCR processing is performed during conversion.

outputFilename Use this parameter to assign a custom file name for your export file.

Default: Name of the PDF input file

For XML, file name defaults to “exportedXML.xml”

outputFormat Use this parameter to specify the type of output to be generated by PDF Alchemist, HTML, XML,  EPUB,  JSON, CSV, or plain text (txt) output. Enumeration values include kHtml, kXml, kEPub, kJson, kCsv, and kPlainText.

Default: kHtml – generate HTML output

pageRanges PDF Alchemist converts every page in a PDF document to HTML, XML, JSON or EPUB output by default. Use this option to select a specific set or range of pages for processing. The rest of the pages in the document are ignored. You may also include multiple page ranges in the statement. Each page range is described with two page numbers separated by a hyphen, as in 1-4, for the first four pages of a document, or 1,3-5,7,9,14-18. If a page range does not include a second value, PDF Alchemist will process all of the pages in the document following the page number provided. So for “22-“ PDF Alchemist would start at page 22 and process all of the rest of the pages in the document.
purpose Specify the goal of the PDF conversion. Three enumeration values are provided:

kIndexing: Do not rasterize any text. This preserves text so it can be used to index the content to allow for efficient searching. But the appearance of the output might differ significantly from the appearance of the input PDF.

kBalanced: Create output for general purpose uses. In this case the need for searchable output is balanced with the need for output with an appearance that is similar to the original PDF input file. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.

kVisual. Reserved for future use.

reflowText PDF Alchemist by default reflows all of the text found in a PDF source document.  Select false to turn this option off, and to add a line break at the end of every line of text.
removeHyphen Remove hyphens that appear at the end of lines in the PDF input document. The hyphens are used to divide words at syllables or create hyphenated phrases. Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined into single words.
removeInvisibleText By default, PDF Alchemist exports all of the text found in every layer of a PDF document, and in its native color, usually black. If a PDF document has white text, this text might not be visible against a white background in an HTML export file. Set this option to “true” to discard this white text so that it is not included in the export file. Note that PDF Alchemist can also convert all of the text found in a PDF document to black text for export. See blackText above.
singleFile Emit HTML and CSS in one file instead of separate HTML and CSS files. If singleFile is set to True skipPageBackground and splitByBookmarkDepth must be zero (0).
skipHeaderFooter Don’t output content detected as repeated header and footer in HTML output.The PDF document must be at least four pages long for PDF Alchemist to identify any headers and footers as being repeated in that document.
skipPageBackground Don’t output image detected as page background in HTML output.
splitByBookmarkDepth PDF Alchemist generates by default output content in one section that contains all of the generated output. But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this parameter determines the levels of the sections to use.

Enter 0 for no splitting.

Enter "1" if you want the output file to break the content into sections based on the first level of the table of contents (“Heading 1” if you will).

If you enter "2" the output file will break into two sections (Heading 1 and Heading 2).

Enter "3" for three sections levels (Heading 1 Heading 2 and Heading 3).

splitByEveryNumberOfPages PDF Alchemist generates by default output content in one section that contains all of the generated output. Enter a number for this parameter to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter "3" and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.

Note: If both splitbyBookmarkDepth and splitbyNumberofPages are set to a value other than zero PDF Alchemist will use splitbyBookmarkDepth and ignore splitByEveryNumberOfPages

stylesheetPath Provide a custom name and path name for the stylesheet (css) file created by PDF Alchemist.

Defaults to stylesheet.css.

tableBorders Specify how table borders will be handled in export file.

  • always. All tables in the export file will have borders.
  • never. None of the tables in the export file will have borders.
  • detect. The product exports tables as it finds them i the source document, with or without borders.

Defaults to detect, system exports tables as they appear in the source document.

tablesOnly Only output table content from a PDF document to an export file and disregard text, images, and any other content found.

Defaults to false, all content exported.

useAccurateGlyphBox Use glyphs metrics based on the embedded font data instead of the info provided in PDF font dictionary.
usePDFEmbeddedFont For fonts embedded in the PDF input. Emit renditions of the fonts and references in the HTML output to the generated fonts instead of emitting only references to fonts in the HTML output.
xsltStylesheetPath Provide the file name and path for an XSLT stylesheet. The stylesheet will be used to transform the XML output file. XSLT (Extensible Stylesheet Language Transformations) is a language for transforming XML documents, or for converting XML files into other file formats. PDF Alchemist supports XSLT language up to version 1.1.

If you use xsltStylesheetPath in creating output, you need to define XML as the output file type with outputFormat. The transformation you define in your XSLT stylesheet will be applied to PDF Alchemist's XML output, and the result will be saved by default as a .txt file. You may override this default by specifying a custom file name and file extension using the outputFilename option.

Note: PDF Alchemist provides a sample XSLT stylesheet called paragraph.xslt in the Samples directory. This sample selects paragraph content from the input document.

To learn more about XSLT, visit:



To convert a PDF file to HTML, XML, EPUB, JSON, CSV, or plain text, use the processPdf API call:

Result processPdf (
     const char* pdfFilePath,
     const char* outputDir,
     Parameters* params,
     Float *confidenceScore)

Where this call accepts the following parameters:

Parameter Description
pdfFilePath Input PDF file path
outputDir Directory to write output into. This directory must exist before calling. This may be either a relative or absolute path and it may be the current working directory.
params Pointer to control Parameters structure for conversion as documented above. This may be NULL.
confidenceScore Pointer to a float filled in by the call on successful conversion. This will be filled with a value between 0 and 100 to describe the system's confidence in the conversion. Complex layouts that might prompt the system to "guess" about the type of content and structure during the layout process tend to lead to lower confidence scores. Examples might include borderless tables or complex tables with multiple columns and large quantities of infographics.

When a NULL value for the Parameters structure pointer is specified to the call, the following set of default conversion parameters will be used:

Parameter Default Value
blackText false
borderlessTableDetectioOnPages 1-
cmapDir No default path
disableAcroForm false
enableAcroformReflow false
enableBookmarks true
enablecaptions false
enableInfographicDetection true
enableLayoutDirectionDetection true
enableLogging false
enableXmlOutput false (this has been deprecated)
fontDirectoryPath ./fonts
fontFilenamePrefix no prefix name
graphicsOutputDpi 300
imageDirectoryPath ./images
imageFilenamePrefix no prefix name
linkStyleUnspecified false
merge_span false
outputFormat HTML
outputFilename Name of PDF input file
pageRanges all pages in document
purpose kBalanced
reflowText true
removeHyphen false
removeInvisibleText false
singleFile false
skipHeaderFooter true
skipPageBackground true
splitbyBookmarkDepth 0 for HTML outputm, 1 for EPUB output
splitbyEveryNumberofPages 0
stylesheetPath stylesheet.css
tableBorders detect
tablesOnly false
useAccurateGlyphBox true
usePDFEmbeddedFont true
xsltStylesheetPath none

For best results, we recommend explicitly supplying a Parameters structure with the specific conversion parameters that best fit your usage workflow.

Return value:

Parameter Description
kSuccess Indicates conversion is done successfully
kFailed Indicates an error happened during conversion
kLicenseInvalid Indicates that a license could not be found or successfully used
kParameterOutofRange A parameter supplied in the control structure is outside the range of permitted values. For example if the graphicsOutputDPI was set to 3600 you would see this return value. When setting an output resolution for a graphic file the highest value allowed is 2400 DPI. No processing performed.
kInvalidPDFInput The file provided for processing was not a valid PDF file or has syntax errors that prevent PDF Alchemist from being able to open and process it.
kInternalError An internal error occurred during processing. PDF Alchemist cannot provide any additional information. No output was generated.