PDF Alchemist

Descriptions of Command Line Options

We describe these parameters below.

Option name Comments
-blackText

Boolean “true” or “false”

By default, PDF Alchemist exports text from a PDF document in its native color. That is usually black, but if a PDF has a different color text against a colored background, such as white text on a dark blue page, the original text might not be visible in the HTML export file. This option allows a user to make it so that any text drawn from a PDF document is always rendered as black text in the export file.

Defaults to False. PDF Alchemist  exports text from PDF documents in its native color, and does not convert it to black.

-borderlessTableDetectionOnPages

Comma-separated list of whole numbers, with dashes

Borderless table detection is a feature in PDF Alchemist where the product looks for tables without borders and then renders that content as tables in the export HTML or XML file.  You may want to turn this function off for part of a PDF source document. That way, the product only renders tables with borders when exporting content in the section of the document that you specify. This might be useful if part of a source PDF document has content that that might look like a table but in fact is not, such as a table of contents.

You can enter a page range to apply borderless table detection only to a certain number of pages. Enter this value if you want to start at page five and continue until the end of the document:

5-

Or this statement to direct PDF Alchemist to skip the first four  pages of the PDF document, then look for borderless tables between pages five and ten, and then stop after that.

5-10

You may also include multiple page ranges in the statement. In this example, you are telling PDF Alchemist to look for borderless tables on pages 1, 3, 4, 5, 7, 9, and 14-18 in a PDF document:

-borderlessTableDetectionOnPages 1,3-5,7,9,14-18

If you are working with a source document that you know does not have any tables, you can disable the Borderless Table Detection feature by setting this option to “none.”

Default: 1- – PDF Alchemist looks for tables without borders across the entire document.

-cmap

Path or directory name

A folder containing PDF format character code map (CMap) files used for converting text in character ID (CID) form to Unicode in the HTML output. The folder name must have a trailing slash character. To learn more about cMAP see the description in the Datalogics Knowledge Base. A CMAPS folder is included with the software installation package for PDF Alchemist.

Default:  “./cmaps/”

-enableBookmarks

Boolean “true” or “false”

The destinations found in the PDF outline in the source document (the table of contents) are exported to the output HTML file as a series of bookmarks. This HTML file contains a frame view of the generated PDF with the generated table of contents in one pane and the generated HTML content in a main viewing pane. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. You can turn this feature off by setting the enableBookmarks value to False.

Defaults to True. The PDF file table of contents is generated.

-enableCaptions

Boolean “true” or “false”

By default, PDF Alchemist does not look for caption text under images that appear in a PDF document.  Set this option to “true” to tell the software to look for text appearing under images and define that text, when found, as a caption.

Defaults to False. PDF Alchemist  never looks for captions under graphics.

-enableInfographicDetection

Boolean “true” or “false”

By default, PDF Alchemist seeks to detect infographic content, such as charts and diagrams, and render them as separate export image files. But this process can interfere with how PDF Alchemist formats tables and hyperlinks. This option allows a user to turn this feature off.

Defaults to True. PDF Alchemist  detects infographic contents.

-flattenForms

Boolean “true” or “false”

By default, PDF Alchemist converts Acrobat Form documents (AcroForms) to interactive HTML forms in fixed-layout format. If this option is specified as true, PDF Alchemist will instead convert Acrobat PDF Forms in the PDF into flattened, non-interactive versions of the form that represent the appearance of the forms, using the normal appearance stream defined for each form element.

Defaults to False. Acrobat PDF forms will be converted to HTML forms.

-fontDirectoryPath

String, name of directory where image files are stored

By default PDF Alchemist extracts any fonts it finds embedded in the source PDF document and stores them in a separate /fonts directory.  You may want to provide a custom name for the directory where PDF Alchemist stores these font files. This could be useful if you are processing hundreds of PDF files for multiple clients and want to keep the output files distinct.

Enter a value to define the name of the output directory where the font files are stored. For example, you could use “/JonesMFG” as the name of the fonts directory. The directory is created relative to the output directory.

-fontDirectoryPath JonesMFG

You can also specify an absolute path. If you use an absolute path, the product must be able to form a relative path from the fonts to the style sheet file:

-fontDirectoryPath d:\pdfalchemist\JonesMFGDefaults

Default: /fonts

-fontFilenamePrefix

String, a value to assign to the name of every font file as a prefix

By default PDF Alchemist extracts any fonts it finds embedded in the source PDF document and stores them in a separate /fonts directory. You may want to assign a custom prefix to these font files. This could be useful if you are processing hundreds of files for multiple clients and want to keep them distinct.

Enter a value to add as a prefix to every font file.  For example, you could make the prefix “jonesmfg,” and the font files would be named jonesmfg1.ttf, jonesmfg2.ttf, jonesmfg3.ttf, and so on.

Defaults to digits, f0.ttf, f1.ttf, f2.ttf, and so on

-imageDPI

A whole number from 12 to 2400

Enter a number to specify the target resolution (in Dots per Inch) for images in the PDF document that you want to export to rasterized graphic files. If you want to improve the resolution of graphics drawn from the PDF document you could for example set this at 600 or 1000 DPI and the resulting image will look better in a browser. But the graphic file will also be larger. Or if you wanted the export PNG graphic files to be smaller you could set the DPI to 72. The permitted range is from 12 DPI to 2400 DPI.

Defaults to 200 for 200 DPI

-imageDirectoryPath

String, name of directory where image files are stored

You may want to provide a custom name for the directory where PDF Alchemist stores export image files extracted from a PDF source document. This could be useful if you are processing hundreds of PDF files for multiple clients, and want to keep the output files distinct.

Enter a value to define the name of the output directory where the image (PNG) output files are stored. For example, you could use “/JonesMFG” as the name of the image directory. The directory is created relative to the output directory.
-imageDirectoryPath JonesMFG
You can also specify an absolute path. If you use an absolute path, the product must be able to form a relative path from the image files to the style sheet file:
-imageDirectoryPath d:\pdfalchemist\JonesMFG

Defaults to /images

-imageFilenamePrefix

String, a value to assign to the name of every image file as a prefix

You may want to assign a custom prefix to the PNG graphics files that PDF Alchemist extracts from a source PDF document. This could be useful if you are processing hundreds of files for multiple clients, and want to keep them distinct.

Enter a value to add as a prefix to every PNG file.  For example, you could make the prefix “jonesmfg,” and the PNG files would be named jonesmfg1.png, jonesmfg2.png, jonesmfg3.png, and so on.

Defaults to digits, 0.png, 1.png, 2.png, and so on

-keepBackground

Boolean “true” or “false”

By default PDF Alchemist discards images that are determined to be background images. If this option is specified as “True” images that PDF Alchemist detects as background images will be retained in the output.

Defaults to False. Background images are discarded during conversion.

-keepEmbeddedFonts

Boolean “true” or “false”

By default PDF Alchemist emits fonts that are embedded in the PDF input to font files in the output directory. The product adds references to those fonts in its generated HTML. If this option is specified as “True” PDF Alchemist will not emit these font files. Instead PDF Alchemist will emit references for the font names found in the PDF file. This will cause fonts installed on the local target processing/ viewing environment to be used when the HTML file is opened.

Defaults to False. Emit fonts embedded in the PDF as font files so that the HTML file will reference these generated fonts.

-keepHeaderFooter

Boolean “true” or “false”

By default PDF Alchemist discards page contents that it can determine are page headers and/or footers. This includes page numbers and titles. If this option is specified as “True” PDF Alchemist will preserve header and footer text in the output.

The PDF document must be at least four pages long for PDF Alchemist to identify any headers and footers as being repeated in that document.

Defaults to False. Headers and footers are discarded during conversion.

-logging

Boolean “true” or “false”

If set to “True” additional information is output to the console during conversion. This information is of limited interest to users and should only be used when instructed by a representative from Datalogics.

Defaults to False. Only critical messages will be written to the console during execution.

-mergeSpan

Boolean “true” or “false”

If set to “True” the product discards HTML style values when exporting content to XML output files. Examples include references to fonts/indented text/and color. Any <span> tags needed to define where these style values apply are also removed. The result will be cleaner text sent to the output file. The text will be easier to parse for content.

Defaults to False. HTML style information and <span> tags will be preserved.

 -ocrLanguage

String, “deu” / “eng” / “fra” / “ita” / “nld” / “por” / “spa”

If you are using Optical Character Recognition (OCR) to pull text from images in an input PDF document (see -ocrMode), the OCR utility defaults to English language text.  But the OCR utility supports other languages. Use this parameter to select German (deu), French (fra), Italian (ita), Dutch (nld), Portuguese (por) or Spanish (spa) if your input files use one of these languages instead. Note you can only select one language at a time for this option.

Default: eng (English)

-ocrMode

String, “off” / “tag” / “replace”

By default, PDF Alchemist passes images in PDF files through to its output without looking for text in these images.

If the -ocrMode option is set to “tag,” PDF Alchemist uses optical character recognition (OCR) to scan images when converting PDF files.  Any text found within an image is embedded in the image reference alt attributes.

If the -ocrMode option is set to “replace,” the OCR feature is turned on and the OCR text replaces the original image in the output file. The process creates selectable text in the HTML or XML output, and the source image is removed. The text is also tagged as OCR text within the export file. This allows the person reviewing the output file to know where the text came from, and it also serves as a warning, as the OCR text might not be rendered perfectly. For an HTML or EPUB output file, any text generated from an image in the PDF input file using OCR is marked with this tag:

data-ocr-text="true"

For an XML file, the tag looks like this:

ocr-text="true"

And for a JSON file, it looks like this:

"ocr-text": "true"

Note: OCR processing can lead to substantial increases in processing time and should be enabled only when desired.

Default: off – No OCR processing is performed during conversion.

-outputFormat

String “html” / “epub” / “xml” / “json”

Specify the type of output file–HTML, XML, EPUB, or JSON.

Defaults to “html” to generate HTML output.

-outputFilename

String, full name of output file

Create a custom name for your export file.

Default: Name of the PDF input file

For XML file name defaults to “exportedXML.xml”

-pageRanges

Comma-separated list of whole numbers, with dashes

By default, PDF Alchemist converts every page in a PDF document to HTML, XML, JSON or EPUB output. But you can use this parameter to select a specific set or range of pages for processing. The rest of the pages in the document are  ignored. You may also include multiple page ranges in the statement. Each page range is described with two page numbers separated by a hyphen:
-pageRanges 1-4
Process the first four pages and discard the rest
-pageRanges 1,3-5,7,9,14-18
Process pages 1, 3, 4, 5, 7, 9, and 14-18
-pageRanges 22-
Start at page 22 and process all of the pages to follow to the end of the document.
Default: All pages in document processed.
-purpose

String “indexing” or “balanced”

The type of output you generate using PDF Alchemist can vary depending on your goals. You may want to extract the content from a PDF document so that you can index the content as text for later searching. In that case it may not be important to you what the content looks like after the conversion process is complete. Or you may prefer instead to convert the content in a PDF document to HTML but preserve as much of the layout and appearance of the original PDF as possible.

PDF Alchemist supports two modes for the purpose option in the command-line interface:

indexing. PDF Alchemist will not rasterize any text. This preserves text for searching and indexing workflows. But the appearance of the output might differ significantly from the appearance of the input PDF.

balanced. The product creates output targeted to general purpose workflows. In this case the need for searchable output is balanced with the need for visually correct output. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.

Defaults to “balanced” to preserve original appearance if output is to HTML or EPUB.
Defaults to “indexing” if output is to XML or JSON.

-reflowForms

Boolean “true” or “false”

PDF Alchemist emits PDF forms documents as fixed-layout HTML forms by default. If this parameter is set to True PDF Alchemist will instead convert PDF forms to reflowable HTML forms. This parameter is currently experimental. reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements.

Defaults to False to emit fixed-layout HTML forms.

-reflowText

Boolean, “true” / “false”

PDF Alchemist reflows all of the text found in a PDF source document.  Select false to turn this option off, and to add a line break at the end of every line of text.

Defaults to true to reflow text in the document

-removeHyphen

Boolean “true” or “false”

By default PDF Alchemist leaves in place hyphens that end lines in the PDF document. These hyphens are commonly used to divide words at syllables but can also be used for phrases. Set the option “-removeHyphen” to True if you want the product to remove these hyphens in the output file.

Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined.

Defaults to False. Do not remove trailing hyphens.

-removeInvisibleText

Boolean “true” or “false”

By default, PDF Alchemist exports all of the text found in every layer of a PDF document to an output file, and in its native color, usually black.  If a PDF document has white text, this text might not be visible against a white background in an HTML export file. This option allows you to discard this white text so that it is not included in the export file.

Note that PDF Alchemist can also convert all of the text found in a PDF document to black text for export. See the -blackText parameter.

Defaults to False. All text found in every layer in a PDF document is exported to the output file.

-singleFile

Boolean “true” or “false”

By default PDF Alchemist outputs separate HTML and CSS files. If set to “True” the PDF to HTML conversion will instead generate a single HTML file with the CSS style information included within that HTML file. Fonts and images will still be emitted as separate files in their respective fonts and images folders.

Defaults to False. Emit separate files for HTML and CSS information.

-splitByBookmarkDepth

Positive integer

PDF Alchemist will generate output content in one section that contains all of the generated output.

But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this setting will determine the levels of the sections to use. If you enter “1” the output file will break the content into sections based on the first level of the table of contents (Heading 1 for example). If you enter “2” the output file will break into two sections (Heading 1 and Heading 2). Enter “3” for three sections levels (Heading 1 Heading 2 and Heading 3).

For HTML defaults to 0, do not split output.
For EPUB defaults to 1
, split output on highest level bookmarks.

-splitByEveryNumberofPages

Positive integer

PDF Alchemist will generate output content in one section that contains all of the generated output. Enter a number for this value if you want to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter “3” and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.

Defaults to 0, do not split output.

-stylesheetPath

String, name and path of style sheet file

You may want to provide a custom name for the stylesheet (css) file PDF Alchemist creates when converting a PDF document to an HTML export file. This could be useful if you are processing hundreds of files for multiple clients, and want to keep them distinct. Enter a value to assign a path and file name to the stylesheet file.The directory is created relative to the output directory. For example, you could name the style sheet file “JonesMFG.css,” in the directory “/style:”
-stylesheetPath style/jonesMFG.css
You can also specify an absolute path. If you use an absolute path, the product must be able to form a relative path from the HTML file to the style sheet file:
-stylesheetPath d:\pdfalchemist\output\style\jonesMFG.css
Defaults to stylesheet.css
-tableBorders

String, “always”/ “never”/ “detect”

By default PDF Alchemist identifies tables in a source PDF document and then formats those tables in the output file to match. If a table in the source PDF document has borders, it will appear with borders in the output file. If the original table does not have borders, the matching table in the HTML output file will not have borders.

This option allows you to format table borders in the output files however you like. You can export the tables in the PDF document so that all of them have borders, or so that none of them have borders.

  • always. Always add borders to tables in the export file. Tables that do not have borders in the source PDF document will have borders added.
  • never. Never add borders to tables in the export file. If a table in the source PDF document has borders, the borders will be removed.
  • detect. Export all tables as they appear in the source document, with borders or without borders.

Defaults to detect. PDF Alchemist by default emits tables to the export file to match the tables that appear in the source PDF document. Tables with borders will appear with borders, tables without borders will not have borders in the export file.

-tablesOnly

Boolean “true” or “false”

By default PDF Alchemist emits everything it finds in a source PDF document to an export HTML, XML or ePUB file. If you only want to export the content found in tables in the PDF file, you could use this option. The product will send the content that it identifies as tables to the export file and ignore everything else in the document, including standard text, images, and graphics.

Defaults to False. PDF Alchemist emits all content found in the PDF document to the export file.

-xsltStylesheetPath

Full path and file name for XSLT stylesheet file

 

Provide the file name and path for an XSLT stylesheet. The stylesheet will be used to transform the XML output file. XSLT (Extensible Stylesheet Language Transformations) is a language for transforming XML documents, or for converting XML files into other file formats. PDF Alchemist supports XSLT language up to version 1.1.

If you use -xsltStylesheetPath in creating output, you also need to define XML as the output file type with -outputFormat. The transformation you define in your XSLT stylesheet will be applied to PDF Alchemist’s XML output, and the result will be saved by default as a .txt file. You may override this default by specifying a custom file name and file extension using the -outputFilename option.

Default: None

Note: PDF Alchemist provides a sample XSLT spreadsheet called xmltocsv-tables.xslt that demonstrates how to convert a PDF document into a CSV export file.