PDF Alchemist

PDF Alchemist

Search the PDF Alchemist index

Datalogics PDF Alchemist is a Software Development Kit (SDK) and Command Line Interface (CLI) application for intelligently extracting text and images from PDF files and exporting that content to HTML 5, Extensible Markup Language (XML), EPUB 3.0, csv, plain text, or JSON files. JSON, or JavaScript Object Notation, is an open standard file format that relies on easily readable English text, and it is used as an alternative to XML.

PDF Alchemist employs sophisticated techniques to identify and reconstruct how text flows within the PDF so that this structure can be preserved in the output file. And you can use the library file provided with the software installation package to build your own executable file for PDF Alchemist.

One of the advantages of the popular PDF format is that the appearance of a PDF page remains the same regardless of the hardware or software used to display it. This was the original goal of creating the PDF format. But in converting a PDF to another type of file format, the original PDF document structure is sometimes lost. PDF Alchemist helps you keep this document structure information when converting the document. For example, PDF Alchemist can recognize that a line of text near the top or bottom of a page is a header or footer, or that a line of text that is larger and centered over other text could indicate a heading followed by a paragraph.

PDF Alchemist analyzes the layout of text in a PDF as a person reading the content would read it, linking related text and paragraphs together with an option to skip page-specific artifacts like running headers and footers. The output from PDF Alchemist can be saved as reflowable content in an export file, and from there the content can be used in a variety of ways.

What you get when you buy PDF Alchemist

  • A command line function, allowing you to convert PDF documents into other file formats
  • An API that matches the features provided with the command line tool
  • The capacity to:
    • Cleanly convert PDF documents to HTML, XML, EPUB, CVS, text, or JSON output files
    • Preserve tables, indents, lists, hyperlinks, tables of contents, and art in source PDF documents when exporting them to other file formats
    • Scan images in a PDF document and extract text from those images using Optical Character Recognition technology
    • Export images and fonts found in a PDF document and save them as separate files
    • Technical support from our team of digital document specialists and professionals  You can contact your Datalogics Support representative directly by electronic mail or visit our support site for this product.

To learn more about past improvements, please look at our release notes.

How you might use PDF Alchemist

Sometimes, such as when you want to review the page layout of a PDF flyer or brochure, what matters is fidelity to how the original page appears. In other situations, all you need is to export the content of a PDF document so that you can work with it. You can choose either option, to export content from a PDF in a way that looks as much like the original as possible, or to simply extract text for storage and analysis.

For example, you can use PDF Alchemist to convert a PDF document to HTML or EPUB files, to make the content reflowable and thus easier to read on a mobile device. You can also convert PDF to HTML so that the content can be easily repurposed as a series of web site pages, while preserving the layout and format of the original text and tables.

You can also use PDF Alchemist to convert a PDF document to HTML so that you can extract the text in that document and use it to build an index, making it easier to search for the original PDF later. Or you can simply pull the text out of a PDF document so that it can be edited, reviewed, or stored in a database.

PDF Alchemist also includes an Optical Character Recognition (OCR) tool. With this OCR tool the software can identify and extract text found within images (such as PNG or BMP files) embedded in a PDF document. This means that PDF Alchemist won’t miss anything. Any information included within a PDF will be rendered as text in the output file, such as information found in photographs, screen shots, or images taken of spreadsheet pages.

Complying with GDPR

Use PDF Alchemist to help your organization satisfy requirements related to the General Data Protection Regulation (GDPR). The GDPR is a set of standards passed by the European Parliament in 2016 to strengthen data protection for individuals living in the 28 states of the European Union. The goal of GDPR was to give citizens control over their personal data and to simplify the regulatory environment for businesses and organizations offering citizens goods and services or monitoring their behavior.

The European Union plans to levy fines for non-compliance, as well as issue warnings, reprimands, and corrective orders. That gives you an incentive to avoid bad publicity. But observing the GDPR standards would also serve to protect your own employees, business partners, and suppliers. GDPR seeks to make sure that the protection of personal data becomes a core part of how any institution that works within the European Union conducts its business. Your organization must be able to demonstrate that a privacy culture is a central part of the design of your information systems environment and business practices. So conforming to GDPR standards benefits your organization by making it much more secure in a world of global data services that is increasingly dominated by stories of massive (and humiliating) data breaches.

The OCR tool provided with PDF Alchemist could be particularly useful to help you comply with the GDPR. With PDF Alchemist you can find and extract all of the private information embedded in all of your PDF documents, including text found within graphics images in the PDF files, and save those records to HTML files for tracking and safe storage.