PDF Alchemist

PDF Alchemist

Datalogics PDF Alchemist is a C/C++ SDK for intelligently extracting text and images from PDFs and exporting to HTML 5 or EPUB. It employs sophisticated techniques to identify and reconstruct “text flows” within the PDF. These text flows are often lost in PDFs, and yet are vital for repurposing the information locked within the PDF.

One of the advantages of the PDF format is that the appearance of a PDF page remains the same regardless of the hardware or software used to display it, this was the original goal of PDF. As a result of PDF meeting its goal, it has been widely adopted and with wide adoption comes a number of problems. PDF Alchemist helps you recover information that may have been lost when a PDF was created, specifically it helps recover the structure of the content of a document – the idea that a line of text near the top or bottom of a page is a page artifact or that a line of text that is larger and centered over text other text could indicate a heading followed by a paragraph.

PDF Alchemist analyzes the layout of text in a PDF like a human reading the content would read it, linking related text and paragraphs together with an option to skip page-specific artifacts like running headers and footers. The output from PDF Alchemist is saved as reflowable HTML or as an EPUB 3.0 file, and from there can be used in a variety of ways.

Sometimes, such as when you want to review the page layout of a PDF flyer or brochure, page fidelity is important. In other situations, accessing the content of a PDF is what matters instead. Consider:

  • An account manager on the road needs to review the updated terms and conditions paragraph in a contract to send to a prospective customer. She really just needs to check the language of that one specific paragraph, and she’s viewing it on her phone. Having text which is reflowable and resizable would make it easier for her to find and approve that text, as opposed to paging through a PDF file, and zooming and panning the text in question.
  • An executive receives a financial report as a PDF document on his phone, while on a flight to New York from Los Angeles. He wants to scroll down to the “bottom line” numbers on the last page. Doing this in HTML, where text is reflowable and resizable, is much easier than in PDF.
  • A workshop wants to distribute learning materials, in PDF format, to an audience of learners who expect to be reading this material on their phones. Converting the PDF into EPUB format eBooks allows each of the students to read the content comfortably, and without the panning and scanning typically required to view PDFs on mobile devices.