PDF Alchemist

PDF Alchemist

Datalogics PDF Alchemist is a C/C++ Software Development Kit (SDK) for intelligently extracting text and images from PDF files and exporting that content to HTML 5, Extensible Markup Language (XML), or EPUB. It employs sophisticated techniques to identify and reconstruct “text flows” within the PDF. These text flows are often lost in PDFs, and yet are vital for repurposing the information locked within the PDF.

One of the advantages of the PDF format is that the appearance of a PDF page remains the same regardless of the hardware or software used to display it. This was the original goal of PDF.  PDF has been widely adopted, but with wide adoption comes a number of problems. PDF Alchemist helps you recover information that may have been lost when a PDF was created. Specifically, PDF Alchemist helps recover the structure of the content of a document, such as the idea that a line of text near the top or bottom of a page is a page artifact (a header or footer) or that a line of text that is larger and centered over other text could indicate a heading followed by a paragraph.

PDF Alchemist analyzes the layout of text in a PDF as a person reading the content would read it, linking related text and paragraphs together with an option to skip page-specific artifacts like running headers and footers. The output from PDF Alchemist is saved as reflowable HTML or as an EPUB 3.0 file, and from there can be used in a variety of ways.

Sometimes, such as when you want to review the page layout of a PDF flyer or brochure, page fidelity is important. In other situations, accessing the content of a PDF is what matters instead. Consider:

  • An account manager on the road needs to review the updated terms and conditions paragraph in a contract to send to a prospective customer. She really just needs to check the language of that one specific paragraph, and she’s viewing it on her phone. Having text which is reflowable and resizable would make it easier for her to find and approve that text, as opposed to paging through a PDF file, and zooming and panning the text in question.
  • An executive receives a financial report as a PDF document on his phone, while on a flight to New York from Los Angeles. He wants to scroll down to the “bottom line” numbers on the last page. Doing this in HTML, where text is reflowable and resizable, is much easier than in PDF.
  • A workshop wants to distribute learning materials, in PDF format, to an audience of learners who expect to be reading this material on their phones. Converting the PDF into EPUB format eBooks allows each of the students to read the content comfortably, and without the panning and scanning typically required to view PDFs on mobile devices.