PDF Alchemist

Using the PDF Alchemist Command Line Utility

Command Line Syntax

To run PDF Alchemist from the command line, type:

  1. the program name
  2. the name of the file in quotes
  3. the name of the output directory where the program will write the output EPUB or HTML and associated files

These are the only required parameters.

In this example, the output directory is called “export” under C:Alchemist:
C:\Alchemist\pdfalchemist “pathfinder.pdf” c:\alchemist\export
The input PDF file does not need to be in the same directory as the program. Also, as long as you include in quotes the path and file name for the input PDF, and the path name for the export files, it doesn’t matter if the name of the directories feature blank spaces:
C:\Alchemist\pdfalchemist “E:\SJonesPDF AlchemistPDF Alchemist test doc.PDF” “E:\SJonesPDF AlchemistExport files”
Add any optional commands to the end of the statement, one after the other, with a space in between each command. Also include a space between each command and the variable (generally “true”), as in “-keepEmbeddedFont true”.

In this example, the command would convert a PDF document called Pathfinder.PDF to HTML and store the output in a folder called “/export.” It will also include references to fonts in the HTML file, so that the HTML file will look for fonts already available on the local machine. It will not export fonts embedded in the PDF source file and then call those fonts from the export directory (we describe “keepEmbeddedFont” in more detail below).
C:\Alchemist\pdfalchemist “pathfinder.pdf” c:\alchemistexport -keepEmbeddedFont true
Make sure you include a dash (“-keepEmbeddedFont”) in front of each parameter name.

The default output is two files written to two folders. The HTML file generated is called page1.html, and the product also provides a cascading style sheet, stylesheet.css.

The folders named /fonts and /images contain extracted and generated fonts and images that are referenced by the output and CSS files. If you convert a PDF form file into an HTML form, PDF Alchemist will also write a style sheet file called AcroForm.css.

If your input PDF contains bookmarks, PDF Alchemist will write a file called bookmarks.html, that holds a set of links for bookmarks in the PDF document to corresponding sections in the HTML output file.

When you use PDF Alchemist to generate EPUB output, all of the necessary files will be stored within the EPUB file written to the output directory.

If you simply type
and don’t enter the name of an input file or an output directory, the program will display a summary of the command syntax, with a list of the optional parameters.

We describe these parameters below.

Command Line Parameters

Option name Type Default Comments
-cmap path or directory name None A folder containing PDF format character code map (CMap) files used for converting text in character ID (CID) form to Unicode in the HTML output. The folder name must have a trailing slash character. To learn more about cMAP see the description in Frequently Asked Questions.
-enableBookmarks Boolean “true”   or “false” True. Will happen automatically. The destinations found in the PDF outline in the source document (the table of contents) are exported to the output HTML file as a series of bookmarks. This HTML file contains a frame view of the generated PDF with the generated table of contents in one pane and the generated HTML content in a main viewing pane. This feature allows you to create a table of contents in your HTML output file based on the outline provided in the PDF source file. You can turn this feature off by setting the enableBookmarks value to False.
-flattenForms Boolean “true”   or “false” False. Acrobat PDF forms will be converted   to HTML forms. PDF Alchemist converts PDF Forms by default to interactive HTML forms in fixed layout format. In other words you can use PDF Alchemist to turn a PDF form document created using Adobe Acrobat into a matching HTML form. The text fields and check boxes and other interactive features work. But if you set "formflatten" to "True" PDF Alchemist will instead convert your Acrobat PDF Form into a flattened HTML document. The content will appear but it will not be interactive. PDF Alchemist will use the normal appearance stream defined for each form element.
-htmlLinkStyleUnspecified Boolean “true”   or “false” False. The style of any links found in the source PDF document will be preserved. Make hyperlinks appear as standard HTML links. The feature will ignore any link styles found in the source PDF document. If set to True the use the default style of the browser for presenting links.
-imageDPI A whole number from 12 to 2400 The default is 200 for 200 DPI Enter a number to specify the target resolution (in Dots per Inch) for images in the PDF document that you want to export to rasterized graphic files. If you want to improve the resolution of graphics drawn from the PDF document you could for example set this at 600 or 1000 DPI and the resulting image will look better in a browser. But the graphic file will also be larger. Or if you wanted the export PNG graphic files to be smaller you could set the DPI to 72. The permitted range is from 12 DPI to 2400 DPI.
-keepBackground Boolean “true”   or “false” False. Background images are discarded during conversion. By default PDF Alchemist discards images that are determined to be background images. If this option is specified as "True" images that PDF Alchemist detects as background images will be retained in the output.
-keepEmbeddedFonts Boolean “true”   or “false” False. Emit fonts embedded in the PDF  as font files so that the HTML file will reference these generated fonts. By default PDF Alchemist emits fonts that are embedded in the PDF input to font files in the output directory. The product adds references to those fonts in its generated HTML. If this option is specified as "True" PDF Alchemist will not emit these font files. Instead PDF Alchemist will emit references for the font names found in the PDF file. This will cause fonts installed on the local target processing/ viewing environment to be used when the HTML file is opened.
-keepHeaderFooter Boolean “true”   or “false” False. headers and footers are discarded during conversion. By default PDF Alchemist discards page contents that it can determine are page headers and/or footers. This includes page numbers and titles. If this option is specified as "True" PDF Alchemist will preserve header and footer text in the output.
-logging Boolean “true”   or “false” False. Only critical messages will be written to the console during execution. If set to "True" additional information is output to the console during conversion. This information is of limited interest to users and should only be used when instructed by a representative from Datalogics.
-outputFormat String "html" or "epub" Defaults to "html" to generate HTML output. Specify the type of output file either HTML or EPUB.
-purpose String "indexing" or "balanced" Defaults to "balanced" Preserve original appearance. The type of output you generate using PDF Alchemist can vary depending on your goals. You may want to extract the content from a PDF document so that you can index the content as text for later searching. In that case it may not be important to you what the content looks like after the conversion process is complete. Or you may prefer instead to convert the content in a PDF document to HTML but preserve as much of the layout and appearance of the original PDF as possible.
PDF Alchemist supports two modes for the "–purpose" option in the command-line interface:
indexing. PDF Alchemist will not rasterize any text. This preserves text for searching and indexing workflows. But the appearance of the output might differ significantly from the appearance of the input PDF.
balanced. The product creates output targeted to general purpose workflows. In this case the need for searchable output is balanced with the need for visually correct output. Text may be rasterized when required to preserve the visual appearance and ordering of text in relation to other elements.
-reflowForms Boolean “true”   or “false” Defaults to False to emit fixed-layout HTML forms. PDF Alchemist emits PDF forms documents as fixed-layout HTML forms by default. If this parameter is set to True PDF Alchemist will instead convert PDF forms to reflowable HTML forms. This parameter is currently experimental; reflowForms may generate unwanted results for forms where the appearance of the form is a mix of PDF page elements and PDF form elements.
-removeHyphen Boolean “true”   or “false” False. Do not remove trailing hyphens. By default PDF Alchemist leaves in place hyphens that end lines in the PDF document. These hyphens are commonly used to divide words at syllables but can also be used for phrases. Set the option “-removeHyphen” to True if you want the product to remove these hyphens in the output file.
Note that this algorithm does not interpret the text involved. It simply removes hyphens wherever they appear. Therefore enabling this option will cause hyphenated phrases that span lines to be combined.
-singleFile Boolean “true”   or “false” False. Emit separate   files for HTML and     CSS information. By default PDF Alchemist outputs separate HTML and CSS files. If  set to "True" the PDF to HTML conversion will instead generate a single HTML file with the CSS style information included within that HTML file. Fonts and images will still be emitted as separate files in their respective fonts and images folders.
-splitByBookmarkDepth Positive integer For HTML defaults to 0 do not split output PDF Alchemist will generate output content in one section that contains all of the generated output.
For EPUB defaults to 1 split output on highest level bookmarks But you can direct the product to generate output in sections based on the bookmarks (table of contents entries) found in the source PDF document. The number you enter for this setting will determine the levels of the sections to use. If you enter "1" the output file will break the content into sections based on the first level of the table of contents (Heading 1 for example). If you enter "2" the output file will break into two sections (Heading 1 and Heading 2). Enter "3" for three sections levels (Heading 1 Heading 2 and Heading 3).
-splitByEveryNumberofPages Positive integer Defaults to 0 do not split output PDF Alchemist will generate output content in one section that contains all of the generated output. Enter a number for this value if you want to split the content into sections based on the page breaks found in the original PDF input document. For example you could enter "3" and PDF Alchemist would take three pages from the PDF input file and put them into a single output file. Then it will take the next three pages from the input PDF document and place them into a second HTML output file. The product would continue to create new sections for the output content (output HTML files) until all of the content in the input PDF document is converted.