The Datalogics PDF Alchemist SDK provides a simple C language API to convert PDF files to HTML, XML, or EPUB content. The API (in Windows, the link library) is contained within the SDK subdirectory of the product folder.
PDF Alchemist SDK supports the following platforms:
|Operating System||Architecture||Binary SDK Files||Notes|
|Microsoft Windows||x86_64||PDFAlchemist_x64.lib||Visual Studio 2013 required.|
|PDFAlchemist_x64.dll||Windows 7 or higher supported.|
|Linux||x86_64||libPDFAlchemist.so||Verified on Ubuntu and 14.04 and RedHat Enterprise Linux 7 target systems;|
|other Linux versions using Linux kernel 3.2 or higher are also supported.|
PDF Alchemist SDK declarations are contained within the file PDFAlchemist.h.
|cmapDir||Path to the Adobe CMap files for supporting the conversion of text represented in CID form to Unicode. This is a C‐style string and must have a trailing slash character. The string must remain valid until the return of the PDF processing call.|
|disableAcroForm||Emit Acrobat PDF forms as flattened content instead of interactive fixed‐layout HTML forms.|
|enableAcroformsReflow||Emit PDF forms documents as reflowable HTML forms. This parameter is currently experimental; reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements. The
|enablebookmarks||The destinations found in the PDF outline in the source document are exported to the output HTML file as a series of bookmarks. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. The program will write a separate file called
|enablecaptions||By default, PDF Alchemist does not look for caption text under images that appear in a PDF document. Set this option to “true” to tell the software to look for text appearing under images and define that text, when found, as a caption.|
|enableInfographicDetection||Detect infographic contents (including charts and diagrams).|
|enableLayoutDirectionDetection||Detect CJK (Chinese/Japanese/Korean) vertical layout text.|
|enableLogging||Output progress logging to the console (stdout).|
|enableXmlOutput||Convert the PDF input file to XML. No HTML files are generated.|
|graphicsOutputDpi||Specify the target resolution for images in the PDF document that you want to export to rasterized graphic files. Valid values are from 12 to 2400. Set a value higher than 200 DPI if you want to improve the resolution of graphics drawn from the PDF document so that they will look better in a browser window. Set a value lower than 200 DPI to generate export graphic files that are smaller.|
|linkstyleunspecified||Make hyperlinks appear as standard HTML links. The feature will ignore any link styles found in the source PDF document and will use default style of the browser for presenting links.|
|merge_span||Disable the output of style information and <span> tags to XML files.|
|ocrLanguage||If you are using Optical Character Recognition (OCR) to pull text from images in an input PDF document (see -ocrMode), the OCR utility defaults to English language text. But the OCR utility supports other languages.
Use this parameter to select German (deu), French (fra), Italian (ita), Dutch (nld), Portuguese (por) or Spanish (spa) if your input files use one of these languages instead.
Note you can only select one language at a time for this option.
Default: eng (English)
|ocrMode||By default, PDF Alchemist passes images in PDF files through to its output without looking for text in these images. If this option is set to “tag,” PDF Alchemist uses optical character recognition (OCR) to scan images when converting PDF files. Any text found within an image is embedded in the image reference alt attributes.If ocrMode is set to “replace,” the OCR feature is turned on and the OCR text replaces the original image in the output file. The process creates selectable text in the HTML or XML output, and the source image is removed. The text is also tagged as OCR text within the export file. This allows the person reviewing the output file to know where the text came from, and it also serves as a warning, as the OCR text might not be rendered perfectly. For an HTML or EPUB output file, any text generated from an image in the PDF input file using OCR is marked with this tag:
For an XML file, the tag looks like this:
Note: OCR processing can lead to substantial increases in processing time and should be enabled only when desired.
Default: off – No OCR processing is performed during conversion.
|purpose||Specify the goal of the PDF conversion. Three enumeration values are provided:
kIndexing: Do not rasterize any text. This preserves text so it can be used to index the content to allow for efficient searching. But the appearance of the output might differ significantly from the appearance of the input PDF.
kBalanced: Create output for general purpose uses. In this case the need for searchable output is balanced with the need for output with an appearance that is similar to the original PDF input file. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.
kVisual. Reserved for future use.
|removeHyphen||Remove hyphens that appear at the end of lines in the PDF input document. The hyphens are used to divide words at syllables or create hyphenated phrases. Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined into single words.|
|singleFile||Emit HTML and CSS in one file instead of separate HTML and CSS files. If singleFile is set to True
|skipHeaderFooter||Don’t output content detected as repeated header and footer in HTML output.|
|skipPageBackground||Don’t output image detected as page background in HTML output.|
|splitByBookmarkDepth||PDF Alchemist generates by default output content in one section that contains all of the generated output. But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this parameter determines the levels of the sections to use.
Enter 0 for no splitting.
Enter “1” if you want the output file to break the content into sections based on the first level of the table of contents (“Heading 1” if you will).
If you enter “2” the output file will break into two sections (Heading 1 and Heading 2).
Enter “3” for three sections levels (Heading 1 Heading 2 and Heading 3).
|splitByEveryNumberOfPages||PDF Alchemist generates by default output content in one section that contains all of the generated output. Enter a number for this parameter to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter “3” and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.
Note: If both
|useAccurateGlyphBox||Use glyphs metrics based on the embedded font data instead of the info provided in PDF font dictionary.|
|usePDFEmbeddedFont||For fonts embedded in the PDF input. Emit renditions of the fonts and references in the HTML output to the generated fonts instead of emitting only references to fonts in the HTML output.|
To convert a PDF file to HTML or XML, use the processPdf API call:
Result processPdf ( const char* pdfFilePath, const char* outputDir, Parameters* params, Float *confidenceScore)
This API call works the same way for HTML and XML. The only difference involves setting the enableXmlOutput parameter to True.
To convert a PDF file to EPUB, use the processPdf2Epub API call:
Result processPdf2Epub ( const char* pdfFilePath, const char* outputDir, Parameters* params, Float *confidenceScore)
Where these calls both accept the following parameters:
|pdfFilePath||Input PDF file path|
|outputDir||Directory to write output into. This directory must exist before calling. This may be either a relative or absolute path and it may be the current working directory.|
|params||Parameters for conversion. This may be NULL.|
|confidenceScore||Pointer to a float filled in by the call on successful conversion. This will be filled with a value between 0 and 100 to describe the system's confidence in the conversion. Complex layouts that might prompt the system to "guess" about the type of content and structure during the layout process tend to lead to lower confidence scores. Examples might include borderless tables or complex tables with multiple columns and large quantities of infographics.|
If a NULL value for the Parameters structure pointer is specified to either call, the following set of default conversion parameters will be used:
|cmapDir||No default path|
|splitbyBookmarkDepth||0 for HTML output; 1 for EPUB output|
For best results, we recommend explicitly supplying a Parameters structure with the specific conversion parameters that best fit your usage workflow.
|kSuccess||Indicates conversion is done successfully|
|kFailed||Indicates an error happened during conversion|
|kLicenseInvalid||Indicates that a license could not be found or successfully used|
|kParameterOutofRange||A parameter supplied in the control structure is outside the range of permitted values. For example if the graphicsOutputDPI was set to 3600 you would see this return value. When setting an output resolution for a graphic file the highest value allowed is 2400 DPI. No processing performed.|
|kInvalidPDFInput||The file provided for processing was not a valid PDF file or has syntax errors that prevent PDF Alchemist from being able to open and process it.|
|kInternalError||Ann internal error occurred during processing. PDF Alchemist cannot provide any additional information. No output was generated.|