The Datalogics PDF Alchemist SDK provides a simple C language API to convert PDF files to HTML, XML, JSON, or EPUB content. The API (in Windows, the link library) is contained within the SDK subdirectory of the product folder.
PDF Alchemist SDK supports the following platforms:
|Operating System||Architecture||Binary SDK Files||Notes|
|Microsoft Windows||x86_64||PDFAlchemist_x64.lib||Visual Studio 2013 required.|
|PDFAlchemist_x64.dll||Windows 7 or higher supported.|
|Linux||x86_64||libPDFAlchemist.so||Verified on Ubuntu and 14.04 and RedHat Enterprise Linux 7 target systems;|
|other Linux versions using Linux kernel 3.2 or higher are also supported.|
PDF Alchemist SDK declarations are contained within the file PDFAlchemist.h.
|blackText||Render any text found in a PDF document as black text.|
|borderlessTableDetectionOnPages||By default, PDF Alchemist looks for tables without borders. Enter a range of page numbers to limit a search for borderless tables to only part of a document. Set this option to “none” to disable that feature. Only tables with borders will be rendered as tables in the output file, any borderless tables found will be exported as standard text.
Defaults to 1-, or all pages in a document, meaning that PDF Alchemist will by default look for and render borderless tables.
|cmapDir||Path to the Adobe CMap files for supporting the conversion of text represented in CID form to Unicode. This is a C‐style string and must have a trailing slash character. The string must remain valid until the return of the PDF processing call.|
|disableAcroForm||Emit Acrobat PDF forms as flattened content instead of interactive fixed‐layout HTML forms.|
|enableAcroformReflow||Emit PDF forms documents as reflowable HTML forms. This parameter is currently experimental; reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements. The
|enablebookmarks||The destinations found in the PDF outline in the source document are exported to the output HTML file as a series of bookmarks. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. The program will write a separate file called
|enablecaptions||By default, PDF Alchemist does not look for caption text under images that appear in a PDF document. Set this option to “true” to tell the software to look for text appearing under images and define that text, when found, as a caption.|
|enableInfographicDetection||Detect infographic contents (including charts and diagrams).|
|enableLayoutDirectionDetection||Detect CJK (Chinese/Japanese/Korean) vertical layout text.|
|enableLogging||Output progress logging to the console (stdout).|
|enableXmlOutput||Convert the PDF input file to XML. No HTML files are generated.
Note: The enableXmlOutput parameter has been deprecated. We recommend you use the outputFormat parameter instead.
|fontDirectoryPath||Provide a custom folder name for the font (TTF) files extracted from a PDF source file.
Defaults to /fonts.
|fontFilenamePrefix||Provide a custom prefix to add to the name of each TTF font file extracted from a PDF source file.
Defaults to no prefix name.
|graphicsOutputDpi||Specify the target resolution for images in the PDF document that you want to export to rasterized graphic files. Valid values are from 12 to 2400. Set a value higher than 200 DPI if you want to improve the resolution of graphics drawn from the PDF document so that they will look better in a browser window. Set a value lower than 200 DPI to generate export graphic files that are smaller.|
|imageDirectoryPath||Provide a custom folder name for the image (PNG) files extracted from a PDF source file.
Defaults to /images.
|imageFilenamePrefix||Provide a custom prefix to add to the name of each PNG image file extracted from a PDF source file.
Defaults to no prefix name.
|linkstyleunspecified||Make hyperlinks appear as standard HTML links. The feature will ignore any link styles found in the source PDF document and will use default style of the browser for presenting links.|
|merge_span||Disable the output of style information and <span> tags to XML files.|
|ocrLanguage||If you are using Optical Character Recognition (OCR) to pull text from images in an input PDF document (see -ocrMode), the OCR utility defaults to English language text. But the OCR utility supports other languages.
Use this parameter to select German (deu), French (fra), Italian (ita), Dutch (nld), Portuguese (por) or Spanish (spa) if your input files use one of these languages instead.
Note you can only select one language at a time for this option.
Default: eng (English)
|ocrMode||By default, PDF Alchemist passes images in PDF files through to its output without looking for text in these images. If this option is set to “tag,” PDF Alchemist uses optical character recognition (OCR) to scan images when converting PDF files. Any text found within an image is embedded in the image reference alt attributes.If ocrMode is set to “replace,” the OCR feature is turned on and the OCR text replaces the original image in the output file. The process creates selectable text in the HTML or XML output, and the source image is removed. The text is also tagged as OCR text within the export file. This allows the person reviewing the output file to know where the text came from, and it also serves as a warning, as the OCR text might not be rendered perfectly. For an HTML or EPUB output file, any text generated from an image in the PDF input file using OCR is marked with this tag:
For an XML file, the tag looks like this:
And for a JSON file, it looks like this:
Note: OCR processing can lead to substantial increases in processing time and should be enabled only when desired.
Default: off – No OCR processing is performed during conversion.
|outputFilename||Use this parameter to assign a custom file name for your export file.
Default: Name of the PDF input file
For XML, file name defaults to “exportedXML.xml”
|outputFormat||Use this parameter to specify the type of output to be generated by PDF Alchemist, HTML, XML, EPUB, or JSON output. Enumeration values include kHtml, kXml, kEPub, and kJson.
Default: kHtml – generate HTML output
|pageRanges||PDF Alchemist converts every page in a PDF document to HTML, XML, JSON or EPUB output by default. Use this option to select a specific set or range of pages for processing. The rest of the pages in the document are ignored. You may also include multiple page ranges in the statement. Each page range is described with two page numbers separated by a hyphen, as in 1-4, for the first four pages of a document, or 1,3-5,7,9,14-18. If a page range does not include a second value, PDF Alchemist will process all of the pages in the document following the page number provided. So for “22-“ PDF Alchemist would start at page 22 and process all of the rest of the pages in the document.|
|purpose||Specify the goal of the PDF conversion. Three enumeration values are provided:
kIndexing: Do not rasterize any text. This preserves text so it can be used to index the content to allow for efficient searching. But the appearance of the output might differ significantly from the appearance of the input PDF.
kBalanced: Create output for general purpose uses. In this case the need for searchable output is balanced with the need for output with an appearance that is similar to the original PDF input file. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.
kVisual. Reserved for future use.
|reflowText||PDF Alchemist by default reflows all of the text found in a PDF source document. Select false to turn this option off, and to add a line break at the end of every line of text.|
|removeHyphen||Remove hyphens that appear at the end of lines in the PDF input document. The hyphens are used to divide words at syllables or create hyphenated phrases. Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined into single words.|
|removeInvisibleText||By default, PDF Alchemist exports all of the text found in every layer of a PDF document, and in its native color, usually black. If a PDF document has white text, this text might not be visible against a white background in an HTML export file. Set this option to “true” to discard this white text so that it is not included in the export file. Note that PDF Alchemist can also convert all of the text found in a PDF document to black text for export. See blackText above.|
|singleFile||Emit HTML and CSS in one file instead of separate HTML and CSS files. If singleFile is set to True
|skipHeaderFooter||Don’t output content detected as repeated header and footer in HTML output.The PDF document must be at least four pages long for PDF Alchemist to identify any headers and footers as being repeated in that document.|
|skipPageBackground||Don’t output image detected as page background in HTML output.|
|splitByBookmarkDepth||PDF Alchemist generates by default output content in one section that contains all of the generated output. But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this parameter determines the levels of the sections to use.
Enter 0 for no splitting.
Enter “1” if you want the output file to break the content into sections based on the first level of the table of contents (“Heading 1” if you will).
If you enter “2” the output file will break into two sections (Heading 1 and Heading 2).
Enter “3” for three sections levels (Heading 1 Heading 2 and Heading 3).
|splitByEveryNumberOfPages||PDF Alchemist generates by default output content in one section that contains all of the generated output. Enter a number for this parameter to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter “3” and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.
Note: If both
|stylesheetPath||Provide a custom name and path name for the stylesheet (css) file created by PDF Alchemist.
Defaults to stylesheet.css.
|tableBorders||Specify how table borders will be handled in export file.
Defaults to detect, system exports tables as they appear in the source document.
|tablesOnly||Only output table content from a PDF document to an export file and disregard text, images, and any other content found.
Defaults to false, all content exported.
|useAccurateGlyphBox||Use glyphs metrics based on the embedded font data instead of the info provided in PDF font dictionary.|
|usePDFEmbeddedFont||For fonts embedded in the PDF input. Emit renditions of the fonts and references in the HTML output to the generated fonts instead of emitting only references to fonts in the HTML output.|
|xsltStylesheetPath||Provide the file name and path for an XSLT stylesheet. The stylesheet will be used to transform the XML output file. XSLT (Extensible Stylesheet Language Transformations) is a language for transforming XML documents, or for converting XML files into other file formats. PDF Alchemist supports XSLT language up to version 1.1.
If you use xsltStylesheetPath in creating output, you need to define XML as the output file type with outputFormat. The transformation you define in your XSLT stylesheet will be applied to PDF Alchemist’s XML output, and the result will be saved by default as a .txt file. You may override this default by specifying a custom file name and file extension using the -outputFilename option.
Note: PDF Alchemist provides a sample XSLT spreadsheet called xmltocsv-tables.xslt that demonstrates how to convert a PDF document into a CSV export file.
To convert a PDF file to HTML, XML, EPUB, or JSON, use the processPdf API call:
Result processPdf ( const char* pdfFilePath, const char* outputDir, Parameters* params, Float *confidenceScore)
Where this call accepts the following parameters:
|pdfFilePath||Input PDF file path|
|outputDir||Directory to write output into. This directory must exist before calling. This may be either a relative or absolute path and it may be the current working directory.|
|params||Pointer to control Parameters structure for conversion as documented above. This may be NULL.|
|confidenceScore||Pointer to a float filled in by the call on successful conversion. This will be filled with a value between 0 and 100 to describe the system's confidence in the conversion. Complex layouts that might prompt the system to "guess" about the type of content and structure during the layout process tend to lead to lower confidence scores. Examples might include borderless tables or complex tables with multiple columns and large quantities of infographics.|
When a NULL value for the Parameters structure pointer is specified to the call, the following set of default conversion parameters will be used:
|cmapDir||No default path|
|enableXmlOutput||false (this has been deprecated)|
|fontFilenamePrefix||no prefix name|
|imageFilenamePrefix||no prefix name|
|outputFilename||Name of PDF input file|
|pageRanges||all pages in document|
|splitbyBookmarkDepth||0 for HTML output; 1 for EPUB output|
For best results, we recommend explicitly supplying a Parameters structure with the specific conversion parameters that best fit your usage workflow.
|kSuccess||Indicates conversion is done successfully|
|kFailed||Indicates an error happened during conversion|
|kLicenseInvalid||Indicates that a license could not be found or successfully used|
|kParameterOutofRange||A parameter supplied in the control structure is outside the range of permitted values. For example if the graphicsOutputDPI was set to 3600 you would see this return value. When setting an output resolution for a graphic file the highest value allowed is 2400 DPI. No processing performed.|
|kInvalidPDFInput||The file provided for processing was not a valid PDF file or has syntax errors that prevent PDF Alchemist from being able to open and process it.|
|kInternalError||Ann internal error occurred during processing. PDF Alchemist cannot provide any additional information. No output was generated.|