PDF Alchemist

Release Notes, PDF Alchemist

Version 2.3.9 (Monday June 3, 2019)

  • Added a command line statement, “-blackText.” A user can add this option to force all text emitted from a source PDF document to an HTML file to be rendered in black. Normally PDF text will be black anyway, but this option can be used in case a source document features white text on a dark background.
  • Added a command line statement, “-enableInfographicDetection.” Defaults to True, but the user can add this statement to a command to turn the feature off.  By default the system will detect infographic content, such as charts and diagrams, and render them as separate export image files. But this process can interfere with how PDF Alchemist formats tables and hyperlinks.
  • Added a set of PDF sample files for use in testing PDF Alchemist. When processed, these sample files demonstrate how the product manages large blocks of text, and tables with and without borders, and how the OCR tool converts text in images, including text in tables.
  • Corrected a pair of issues related to tables, where text from tables with borders was duplicated, and where borders disappeared when a table was split across several pages in an HTML output file.

Version 2.3.8 (Friday May 3, 2019)

  • Corrected an issue in which complex tables comprised of cells with many varied widths would not be preserved.
  • Improved detection of borderless tables containing column headers with multiple lines of text and cells with alignment differences.
  • Improved detection of borderless tables containing a single row that is shorter or longer than other rows.
  • Corrected a problem in which large images were incorrectly determined to be part of headers and were removed when keepHeaderFooter was not set.
  • Improved consistency of paragraph line spacing detection by updating the default property for each paragraph to match the previous paragraph.
  • Corrected a problem where some graphics were not completely exported if they had white filled areas.

Version 2.3.5 (April 16, 2019)

  • Completed a series of changes to how the software manages the licensing process, including how the license file is stored and accessed and how the system responds when the license expires.
  • Applied a variety of changes to make sure that the output from PDF Alchemist is deterministic—that the font and the format are the same each time the system is run. This involves edits to the stylesheet and to how an EPUB output file is structured.
  • Added enableCaptions option to allow the software to look for caption text after images in PDF documents.
  • PDF Link annotations are now sent to more precise locations in HTML export files, as close as possible to the linked-to content.

Version 2.3.4 (January 29, 2019)

  • Improvements to table output, as well as improvements to the software’s ability to detect rows and columns in tables.
  • When processing images, the OCR utility does not remove images if no text is found in those images. Text attributes are not created.
  • All image types are supported by PDF Alchemist are also supported for OCR processing, and all images are normalized to the same resolution.
  • Corrected problem with generating ePUB output file on Linux platforms.

Version 2.3.1 (January 15, 2019)

  • PDF Alchemist now offers Optical Character Recognition (OCR) technology, allowing the software to scan images within PDF files and extract text found in those images. Users can select an option to add this OCR text to the export file as a supplement to the images, or they can replace the images with the recovered text. By default, the OCR tool supports English language text, but users can also select Dutch, French, German, Italian, Portuguese, or Spanish.
  • A new parameter, “ocrMode,” has been added to the command line and to the API. This allows a user to turn the OCR function on (OCR is disabled by default) and either add text extracted from an image to an export file or use that text to replace the image in the export file.
  • A new parameter, “ocrLanguage,” has been added to the command line and to the API. This allows a user to select the language to use for OCR processing when OCR is enabled.
  • The ability to reconstruct tables in HTML or XML output files has been significantly improved. This includes creating borderless tables and cells, better alignment of text and images in tables, and improved address blocks.
  • PDF Alchemist now honors “ActualText” span entries for text spans that contain this PDF tagging. If ActualText entries appear, these entries will override information found in the font ToUnicode tables or any font encoding information.
  • Fixed a problem where PDF Alchemist had been too aggressive in identifying images in PDF documents as background images and then removing those images.

Known Issues

  • The “replace” setting for ocrMode does not currently support images stored as JBIG2 or JPEG2000 images within the input PDF data stream. Contents from these images will be discarded. These image formats are supported using the “tag” setting for ocrMode, and the text generated using OCR is properly rendered.
  • The “replace” setting for ocrMode replaces images which do not have any OCR-detectable text with blank regions within its output, and these images will disappear. The “tag” setting ocrMode can be used to use OCR for scanning images when you prefer to maintain images within the output.

Version 2.2.3 (October 3, 2018)

  • Reduced excessive spacing in rows in tables. After exporting table content from PDF documents to HTML files, the browser tended to create large spaces around the data in table cells when the paragraph tag, <p>, appeared in those cells. Added settings in the style sheets of 0 points for the top and bottom margins for each paragraph flag (<p>) in a table, eliminating these unnecessary spaces. As a result, the tables that now appear in HTML output feature spacing that matches the spacing seen in the original PDF document.
  • Improved table detection in cases where column headings in tables were center or left justified and subsequent rows were right justified. This allows the headings to align properly with the rows below them in the HTML output content.

Version 2.2.0 (August 22, 2018)

  • PDF Alchemist can now convert PDF documents into XML files.
  • When exporting a table from a PDF document, the total count of rows and columns in that table are added to the <table> element to make analysis of the export process more efficient.
  • Corrected a problem with detecting and providing an appropriate font for text when exporting to XML. The system had been sending text to an XML file and changing the original serif font, such as Times New Roman, with a non-serif font, like Arial.
  • Improved table detection in cases where column headings did not align with the text below.
  • When generating XML output, the “-purpose” tag now defaults to “indexing.” In this case the text copied to the XML file is not formatted for specifically for layout and appearance, but rather to make it easier to create an index for this text to make searching easier. It is possible for the user to manually change this setting to “balanced” instead.

Version 2.1.7 (September 18, 2017)

  • Corrected a memory corruption issue that caused applications to crash when calling PDF Alchemist from .NET.

Version 2.1.3 (August 2, 2016)

  • When a customer extracts the PDF Alchemist software package when installing the product, the installation process now creates a folder for storing the software files that is named to include the version number of the PDF Alchemist release.
  • The documentation now describes the return values provided from the processPDF function.
  • Improved separation of table rows when text is indented in the first cell of a row.
  • Control of hyphenation removal added. Previous versions removed hyphens by default, hyphens are now preserved by default and you can use the removeHyphen parameter to enable removing them. Note that hyphenation removal is performed by a simple algorithm without language or dictionary support, so it is a good idea to run a few sample files through and then decide whether or not you need to use the removeHyphen parameter.
  •  Added the “purpose” option to control the handling of element processing for different workflows. Defaults to “balanced.” Use “indexing” to optimize PDF Alchemist output for indexing/search applications. When specifying “indexing” as the purpose the preservation of text as text is more important than preserving the appearance of the original PDF document when overlapping or transparent elements are involved.
  • Updates to improve detection and output of multi-column sub layouts on a page where the page is predominantly single-column
  • Updates to better break lines when visual layout indicates that adjacent lines are different lines, instead of continued lines of a paragraph
  • Resolves issue with order of images being written in output in cases where input might be in multiple columns
  • Resolves issue with improper handling of some non-letter characters in unembedded fonts that do not specify an encoding
  • Resolves issue with association of text to proper image in multi-column image and caption layouts
  • Resolves rendering issue with PDFs having image masks that specify their bits per component value

Updates that should not impact output text or visual appearance:

  • EPUB output now defaults to writing new sections or chapters for every highest-level bookmark or Table of Contents (ToC) entry in input PDF
  • Resolves issues with EPUB output validation.
  • Resolved problem with an erroneous member property “width,” that had been written out for “col” members
  • Resolves issue with EPUB output not having bookmarks/table of contents written when requested
  • Resolves issue with invalid output being generated when splitting on every page, and when a page ended with a list item
  • Improves error handling in cases where invalid PDF files are supplied for processing

Version 2.0.3 (December 22, 2015)

  • Resolved line-breaking issue. Some lines that ended in numbers were being combined with lines to follow, and the product was converting the lines into a table of content-type lists.
  • Resolved alignment issue for graphics, where some graphics that were right-aligned in the PDF input file were not right-aligned in output.

Version 2.0.2 (December 7, 2015)

  • Fixed issue where some images were left out of the output file because the software incorrectly associated them with page headers and footers.
  • Additional checking for line groups and associated splitting heuristics added for lines that end with delimiter symbols, including ‘]’, ‘)’, ‘}’, or ‘>’.

Version 2.0.1 (November 24, 2015)

  • Fixed issue where text was left out of the output file because the software incorrectly associated that text with page headers and footers.
  • Fixed issue where the software was incorrectly formatting lists of content as borderless tables.
  • Fixed issue with table column positions, featuring cells in a column were improperly centered, or otherwise out of place.
  • Fixed issue with paragraph breaking when lines only contain one word per line.

Version 2.0.0 (October 20, 2015)

  • PDF Alchemist now supports generating EPUB output files.
  • The product supports dividing the source content into multiple output files for HTML or chapters for EPUB files. The content is split based on bookmarks found in the PDF document.
  • Support also added for dividing source content into multiple output files for HTML or chapters for EPUB files, with the content split by a pre-determined number of pages. For example, a document could be set to divide into a separate HTML output file every ten pages of the original PDF document.
  • Fixed problem in generating tables that caused multiple input cells to be merged into a single output cell.

Version 1.2 (October 9, 2015)

  • The system can now detect a PDF table of contents (outlines / “bookmarks”) and convert it into a Table of Contents view in a separate frame in an HTML file. Output is generated with the Bookmark View of the HTML as a separate file from the standard output, so that users can distribute the same output both with a bookmarks pane and without one.
  • Allows links to be generated in standard HTML formatting. The product can override formatting in the source PDF. This allows users, for example, to make invisible PDF links visible in the HTML output.
  • Image generation improvements & control over generated image resolution.
  • A variety of improvements and fixes, including improvements related to lists, tables, and form generation.
  • Support for generating reflowable AcroForms to HTML forms has improved but the feature is still under development.

Version 1.1, Initial Production Release (September 10, 2015)

PDF Alchemist is available for Windows 64, Windows 32, Linux 64, and MacOS 64 platforms, and provides both Command Line access and API access.

The product accepts one PDF file as input and generates a zip file, containing a folder with a single HTML file (page1.html), a stylesheet (stylesheet.css), a fonts folder, and an images folder.

The software positions images and text in the output to match their placing in the PDF source document. The images are extracted from the PDF source document and references them using inline tags in the HTML output.

It also formats tables in the PDF document as HTML tables to match the original, and detects page background images so that they also appear in the output as background images.

AcroForms are detected and converted into HTML forms.

  • Paragraphs that span across pages are combined into a single continuous text flow.
  • Multi-column text in the input is converted into single column reflowable and resizable text in output.
  • Running headers and footers are removed.
  • Extracted fonts are referenced in the stylesheet when possible.
  • Font styles, such as bold, italic, and underline, are preserved from the input document to the HTML export
  • Text indentation is detected and preserved
  • Text justification (right, center, left) is preserved
  • Text flow margins are detected and preserved
  • Numbered and bulleted lists are detected and preserved
  • Internal links/references are detected and converted to HTML anchors and references in output.
  • Links to external URL addresses are detected and converted to HTML hyperlinks.