Using PDF Alchemist
PDF Alchemist analyzes the placement of characters, words, lines and graphical elements in the source PDF and uses advanced heuristics to reconstruct sentences and paragraphs as a person would read them. The resulting HTML document reflows text for different browser windows of various sizes and merges the contents from multiple pages into a single continuous display. PDF Alchemist can also merge multiple columns on PDF pages into a single column and allow for font resizing in a browser window, at user request. Finally, PDF Alchemist inserts images in text flows into the HTML output file as inline image references, and can capture simple captions provided with images found in a source PDF document.
Cleaning up page breaks
PDF Alchemist analyzes PDF files to find and eliminate the artifacts of page breaks. By default the software automatically removes headers and footers, page numbers, and page background images.
Extracting fonts embedded in the PDF file
By default PDF Alchemist extracts any fonts it finds embedded in the source PDF document and saves them to a separate /fonts directory. But you can enter an optional parameter to turn this feature off, and direct PDF Alchemist to convert the fonts used in the PDF document to font reference tags in the output HTML file. The HTML file will use the fonts found installed on the local environment instead. Not exporting font files reduces the size of the output files.
Preserving style and layout
When converting a PDF document to HTML, the fonts will remain the same. PDF Alchemist preserves bold, italic, underlined, colored, shaded and strikethrough text in the HTML output. PDF Alchemist also preserves:
- Justification of text–right, left, and center
- Margin settings
- Table layouts, including continuing tables across page boundaries
- Links to internal anchors and bookmarks
- Links to external web pages, by converting them to HTML HRef references. Note that you can choose to override the link appearances to keep the original appearance from the PDF or convert them to the standard HTML link appearance.
- Vector art in the PDF by converting it to raster images.
- Table of Contents (PDF outlines), by converting them to a frame-based view of the converted PDF document
PDF Alchemist can also convert and resample the images in the input PDF document to a common format, color space, and resolution.
Cleanly converting PDF Form documents into HTML forms
PDF Alchemist converts Acrobat PDF forms (AcroForm format) into HTML forms that can be filled out in a browser window or on a mobile device. The product also preserves the order of the form elements by creating a fix layout HTML file.
The product preserves the appearance of push button elements when converting to HTML, such as Submit or Reset buttons. The button attributes that are preserved include:
- border color, width, and style
- fill color
- font size, color, name, style, and weight
The following PDF form Trigger events are supported, and will be converted from a PDF form to a matching HTML form:
- E: mouse enter event. When the pointer enters the field.
- X: mouse leave event. When the pointer exits the field.
- D: mouse down event. When the mouse button is clicked without being released.
- U: mouse up event. When the mouse button is released after a click.
- Fo: receive focus event. Media clips only. The link area receives focus through a mouse over.
- Bl: lose focus (blue) event. Media clips only. The focus moves to a different link area when the mouse is moved away.
- URI (URL or web site access)
PDF Alchemist cannot convert digital signature fields or bar code fields into matching fields in an export HTML files. If a digital signature appears in a PDF document, the product will remove the interactive elements of the field but preserve the appearance.
Optical Character Recognition (OCR)
PDF Alchemist provides an OCR tool that can scan graphics images in a PDF document, identify text within those images, and add that text to an export file when the OCR option is enabled. This output text is inserted as an “alt” attribute within the <img> tag that describes the source image file.
It is also possible to scan a graphic image in a PDF input file using the OCR tool, draw text from that image, and then replace that image in the output HTML or XML file with the text found in that image. The OCR text is flagged in the output file to make it easier to identify.
The OCR utility in PDF Alchemist supports English, Dutch, French, German, Italian, Spanish, and Portuguese.