Using PDF Alchemist
PDF Alchemist analyzes the placement of characters, words, lines and graphical elements in the source PDF and uses advanced heuristics to reconstruct sentences and paragraphs as a person would read them. The resulting HTML document reflows text for different browser windows of various sizes, and merges the contents from multiple pages into a single continuous display. PDF Alchemist can also merge multiple columns on PDF pages into a single column and allow for font resizing in a browser window, at user request. Finally, PDF Alchemist inserts images in text flows into the HTML output file as inline image references, and can capture simple captions provided with images found in a source PDF document.
Cleaning up page breaks
PDF Alchemist analyzes PDF files to find and eliminate the artifacts of page breaks. By default the software automatically removes headers and footers, page numbers, and page background images.
Extracting fonts embedded in the PDF file
By default PDF Alchemist extracts any fonts it finds embedded in the source PDF document and saves them to a separate fonts directory. But you can enter an optional parameter to turn this feature off, and direct PDF Alchemist to convert the fonts used in the PDF document to font reference tags in the output HTML file. The HTML file will use the fonts found installed on the local environment instead. Not exporting font files reduces the size of the output files.
Preserving style and layout
When converting a PDF document to HTML, the fonts will remain the same. PDF Alchemist preserves bold, italic, underlined, colored, shaded and strikethrough text in the HTML output. PDF Alchemist also preserves:
- Justification of text–right, left, and center
- Margin settings
- Table layouts, including continuing tables across page boundaries
- Links to internal anchors and bookmarks
- Links to external web pages, by converting them to HTML HRef references. Note that you can choose to override the link appearances to keep the original appearance from the PDF or convert them to the standard HTML link appearance.
- Vector art in the PDF by converting it to raster images.
- Table of Contents (PDF outlines), by converting them to a frame-based view of the converted PDF document
PDF Alchemist can also convert and resample the images in the input PDF document to a common format, color space, and resolution.
Cleanly converting PDF Form documents into HTML forms
PDF Alchemist converts Acrobat PDF forms (AcroForm format) into HTML forms that can be filled out in a browser window or on a mobile device. The product also preserves the order of the form elements by creating a fix layout HTML file.
The product preserves the appearance of push button elements when converting to HTML, such as Submit or Reset buttons. The button attributes that are preserved include:
- border color, width, and style
- fill color
- font size, color, name, style, and weight
The following PDF form Trigger events are supported, and will be converted from a PDF form to a matching HTML form:
- E: mouse enter event. When the pointer enters the field.
- X: mouse leave event. When the pointer exits the field.
- D: mouse down event. When the mouse button is clicked without being released.
- U: mouse up event. When the mouse button is released after a click.
- Fo: receive focus event. Media clips only. The link area receives focus through a mouse over.
- Bl: lose focus (blue) event. Media clips only. The focus moves to a different link area when the mouse is moved away.
- URI (URL or web site access)
PDF Alchemist cannot convert digital signature fields or bar code fields into matching fields in an export HTML files. If a digital signature appears in a PDF document, the product will remove the interactive elements of the field but preserve the appearance.
Optical Character Recognition (OCR)
PDF Alchemist provides an OCR tool that can scan graphics images in a PDF document, identify text within those images, and add that text to an export file when the OCR option is enabled. This output text is inserted as an “alt” attribute within the <img> tag that describes the source image file.
It is also possible to scan a graphic image in a PDF input file using the OCR tool, draw text from that image, and then replace that image in the output HTML or XML file with the text found in that image. The OCR text is flagged in the output file to make it easier to identify.
The OCR utility in PDF Alchemist supports English, Dutch, French, German, Italian, Spanish, and Portuguese.
Some issues in using PDF Alchemist
Converting files from a complicated format like PDF to HTML is a difficult process. PDF Alchemist quickly and efficiently produces HTML content, but it has some limits.
Acrobat PDF Forms
- Submit and Reset form actions only support the complete set of form fields in the PDF document. PDF Alchemist does not support predefined subsets of fields for Reset and Submit.
- The product only supports the FDF format for form submission.
- PDF Alchemist only supports the conversion form files based on Acrobat PDF Form (AcroForm) standard. PDF Alchemist does not support the conversion of XFA (XML forms architecture) forms or files.
Images and Line Art
- When PDF Alchemist finds an image repeated as part of a header or footer it discards the image with the header or footer.
- If PDF Alchemist detects images as page backgrounds in the PDF input document, it will discard them. Sometimes the product will discard valid images that it accidentally detects as backgrounds. You may decide to disable the detection and removal of background images to prevent this, but the HTML output file will be larger as a result.
- The product converts line art on PDF pages into raster images in the HTML output, or it is removed from the output, depending on whether the line art is detected as necessary page content or as page artifacts. This preserves visual fidelity at some cost to file size.
Lists and Tables
- PDF Alchemist renders every list it finds in the input PDF document as an HTML unordered list (<ul>). The product will take the literal characters found in each list and add them to the unordered list in the HTML output file. That is, if the PDF document has a numbered list, the literal numbers 1, 2, 3, and so on will be copied to the unordered list in the HTML file; for a bulleted list, PDF Alchemist will copy a literal bullet character to the front of each row in the list. This can cause unexpected results when pasting the HTML into programs such as Microsoft Word. In Word, if you use the default list style, the exported list might end up with a duplicate set of bullets. To copy this content for editing, we suggest changing the style of lists to be without bullets, or removing the characters used as bullets in the PDF Alchemist output.
- Tables with cells that span multiple pages are emitted with separate cells for before and after the page break.
- PDF Alchemist does not support nested tables, or tables within tables.
- PDF Alchemist does not support lists nested within tables without borders.
- Annotations: PDF annotations that do not have appearance streams will not appear in the output HTML.
- Layers: In selecting layers (optional content) for processing, PDF Alchemist will use the default Optional Content Group (OCG) state. The output will reflect the layers that are visible by default when opening the PDF file.
- Page number removal: PDF Alchemist does not detect page numbers that are written as Roman numerals. These page numbers will be removed.
- Password-protected PDFs: In order to convert a PDF document to HTML, PDF Alchemist requires the source PDF documents to not be password protected. The PDF may not have any Digital Rights Management settings or any other security or encryption.
- PDF Patterns and Shadings: If the original PDF document features highlighting or shading of text and cells within a table, using PDF shading or pattern colorspaces, PDF Alchemist will not support the conversion to HTML. The product will write the text and table cells to output without the original color, shading or pattern.