PDF Alchemist

PDF Alchemist

Search the PDF Alchemist index

Datalogics PDF Alchemist is a Software Development Kit (SDK) for intelligently extracting text and images from PDF files and exporting that content to HTML 5, Extensible Markup Language (XML), or EPUB 3.0 files. It employs sophisticated techniques to identify and reconstruct how text flows within the PDF so that this structure can be preserved in the output file.

One of the advantages of the popular PDF format is that the appearance of a PDF page remains the same regardless of the hardware or software used to display it. This was the original goal of creating the PDF format. But in converting a PDF to another type of file format, the original PDF document structure is sometimes lost. PDF Alchemist helps you recover this document structure information when converting the document. For example, PDF Alchemist can recognize that a line of text near the top or bottom of a page is a page artifact (a header or footer) or that a line of text that is larger and centered over other text could indicate a heading followed by a paragraph.

PDF Alchemist analyzes the layout of text in a PDF as a person reading the content would read it, linking related text and paragraphs together with an option to skip page-specific artifacts like running headers and footers. The output from PDF Alchemist can be saved as reflowable content in an export file, and from there the content can be used in a variety of ways.

Sometimes, such as when you want to review the page layout of a PDF flyer or brochure, page fidelity is important. In other situations, accessing the content of a PDF is what matters instead. Consider:

  • An account manager on the road needs to review the updated terms and conditions paragraph in a contract to send to a prospective customer. She really just needs to check the language of that one specific paragraph, and she’s viewing it on her phone. Having text which is reflowable and resizable would make it easier for her to find and approve that text, as opposed to paging through a PDF file, and zooming and panning the text in question.
  • An executive receives a financial report as a PDF document on his phone, while on a flight to New York from Los Angeles. He wants to scroll down to the “bottom line” numbers on the last page. Doing this in HTML, where text is reflowable and resizable, is much easier than in PDF.
  • A workshop wants to distribute learning materials, in PDF format, to an audience of learners who expect to be reading this material on their phones. Converting the PDF into EPUB format eBooks allows each of the students to read the content comfortably, and without the panning and scanning typically required to view PDFs on mobile devices.