PDF Alchemist

Product Features

Using PDF Alchemist

Reflowing text

PDF Alchemist analyzes the placement of characters, words, lines and graphical elements in the source PDF and uses advanced heuristics to reconstruct sentences and paragraphs as a person would read them. The resulting HTML document reflows text for different browser windows of various sizes and merges the contents from multiple pages into a single continuous display. PDF Alchemist can also merge multiple columns on PDF pages into a single column and allow for font resizing in a browser window, at user request. Finally, PDF Alchemist inserts images in text flows into the HTML output file as inline image references, and can capture simple captions provided with images found in a source PDF document.

You can disable this feature by setting the --reflowText command line argument or the reflowText API parameter to false. This will prompt PDF Alchemist to add a line break at the end of every line of text in the source document.

Cleaning up page breaks

PDF Alchemist analyzes PDF files to find and eliminate the artifacts of page breaks. By default the software automatically removes headers and footers, page numbers, and page background images.

Extracting fonts embedded in the PDF file

By default PDF Alchemist extracts any fonts it finds embedded in the source PDF document and saves them to a separate ./fonts directory.  But you can enter an optional parameter to turn this feature off, and direct PDF Alchemist to convert the fonts used in the PDF document to font reference tags in the output HTML file. The HTML file will use the fonts found installed on the local environment instead. Not exporting font files reduces the size of the output files.

Preserving style and layout

When converting a PDF document to HTML, the fonts will remain the same. PDF Alchemist preserves bold, italic, underlined, colored, shaded and strikethrough text in the HTML output. PDF Alchemist also preserves:

  • Justification of text--right, left, and center
  • Indents
  • Margin settings
  • Lists
  • Table layouts, including continuing tables across page boundaries
  • Links to internal anchors and bookmarks
  • Links to external web pages, by converting them to HTML HRef references. Note that you can choose to override the link appearances to keep the original appearance from the PDF or convert them to the standard HTML link appearance.
  • Vector art in the PDF by converting it to raster images.
  • Table of Contents (PDF outlines), by converting them to a frame-based view of the converted PDF document

PDF Alchemist can also convert and resample the images in the input PDF document to a common format, color space, and resolution.

Cleanly converting PDF Form documents into HTML forms

PDF Alchemist converts Acrobat PDF forms (AcroForm format) into HTML forms that can be filled out in a browser window or on a mobile device. The product also preserves the order of the form elements by creating a fix layout HTML file.

The product preserves the appearance of push button elements when converting to HTML, such as Submit or Reset buttons. The button attributes that are preserved include:

  • border color, width, and style
  • fill color
  • font size, color, name, style, and weight

The following PDF form Trigger events are supported, and will be converted from a PDF form to a matching HTML form:

  • E: mouse enter event. When the pointer enters the field.
  • X: mouse leave event. When the pointer exits the field.
  • D: mouse down event. When the mouse button is clicked without being released.
  • U: mouse up event. When the mouse button is released after a click.
  • Fo: receive focus event. Media clips only. The link area receives focus through a mouse over.
  • Bl: lose focus (blue) event. Media clips only. The focus moves to a different link area when the mouse is moved away.

PDF Alchemist converts most common PDF Form actions into Javascript actions, including:

  • Submit-form
  • Launch
  • URI (URL or web site access)
  • Hide
  • Print

When possible, the product also converts Javascript triggered by PDF Form actions into Javascript actions for the appropriate HTML form element.

PDF Alchemist cannot convert digital signature fields or bar code fields into matching fields in an export HTML files. If a digital signature appears in a PDF document, the product will remove the interactive elements of the field but preserve the appearance.

Optical Character Recognition (OCR)

PDF Alchemist provides an OCR tool that can scan graphics images in a PDF document, identify text within those images, and add that text to an export file when the OCR option is enabled. This output text is inserted as an “alt” attribute within the <img> tag that describes the source image file.

It is also possible to scan a graphic image in a PDF input file using the OCR tool, draw text from that image, and then replace that image in the output HTML or XML file with the text found in that image. The OCR text is flagged in the output file to make it easier to identify.

The OCR utility in PDF Alchemist supports English, Dutch, French, German, Italian, Spanish, and Portuguese.

Converting PDF input documents into CSV spreadsheet files or text files

You can also use PDF Alchemist to convert a PDF document into a Comma Separated Values (CSV) or plain text.  The CSV export format would be useful if your PDF source document has data in tables. The value in the table fields can be copied to cells within a spreadsheet and then opened and displayed using a product like Excel. Or if you have a PDF document that is mostly text, and you want to export this text to a lightly formatted text (.txt) file, you can select the plain text option.

To use the CSV output, we provide a command line argument and an API parameter.

To use the plain text output, you can also use a command line argument and an API parameter.

Converting PDF input documents into JSON export files

When using PDF Alchemist to convert a PDF document to a JSON export file, the product extracts the following types of content from the input PDF document.

  • Title
  • Tables
  • Paragraphs
  • Lists
  • Links

The first dictionary contains a single key-value pair with the title of the input document.

The "content" object holds the entire data set as an array. The array is populated by dictionaries with the following keys:

Key Type Usage
“page” integer The page number where the data resides in the original document.
"data-type" string The type of data within the dictionary. See the "Data Type table" below for the defined values.
"ocr-text" bool Set to true when the data was extracted via optical character recognition. This key value is optional and only appears when true.
"data" array Holds the data for the given type. The contents of the array change depending on the "data-type".
Data Type table
data-type "data" array content
"paragraph" The "data" array contains paragraph data as a single string.
If the option --reflowText is used, multiple strings will be used to represent page breaks.
"table" The "data" array contains a table represented as a set of arrays, each representing a row.
"header" The "data" array contains header data as a single string.
If the option --reflowText is used, multiple strings will be used to represent page breaks.
"footer" The "data" array contains footer data as a single string.
If the option --reflowText is used, multiple strings will be used to represent page breaks.
"list" The "data" array contains a set of strings, each representing a singular item of the list.

To use the JSON output format, we provide a command line argument and an API parameter.


If the source PDF document contains hyperlinks, these hyperlinks are preserved using Markdown syntax in the JSON output file, where the link originally appears on the data.

Link data may appear within any of the data-types.