Adobe® PDF Library

Exporting Text from a Set of PDF Files

Consider an example.  Your firm recently bought a subsidiary that has been maintaining detailed monthly product performance, sales, customer retention, and customer support & satisfaction reports for nearly 25 years.  All of these reports were preserved as PDF files and are quite thorough.  Now that you are managing the firm you would like to be able to convert the content in these PDF files to text so that you can apply it to an internal database.  That would allow you to search the content and generate statistics for historical analysis and data mining.  The problem, however, is that for the first eight years of these reports, the original spreadsheet files used to generate the PDF files have been lost.  They were created by a consultant, who provided the PDF files, but the subsidiary neglected to ask for the source content.  About ten years after the consultant decided to retire, and the reporting process was moved in house, one of the employees of this subsidiary sought to find the consultant.  The employee wanted to ask him to provide copies of the original spreadsheet files that were used to generate these early PDF reports.  At that point, however, the employee learned that this consultant had died two years earlier.  The source content could not be found.

So you have over 400 of these PDF files, compiled and written between 1988 and 1997 and converted into PDF files between 1994 and 1997.  Before you can make use of the data in these files, you need to convert them to text.  You can adapt one of the programs provided with APFL, TextExtract, to that end.    TextExtract will be faster and more accurate than Optical Character Recognition (OCR) software, and anyway OCR is designed for rendering scanned pages as text.  You already have the electronic content.  Further, the content in the PDF reports is formatted in such a way that you can transfer it to a spreadsheet in a table format and then import it from there into a database.  But you need a way to quickly and easily convert these PDF files to text files, and save these text files to a server directory.

You would need to create your own program based on TextExtract to look for all of the PDF files in a specific server directory you designate, and set up a selection structure to choose each of these files one by one to be converted to text.

The program should export the content of the PDF file to a text file.  In TextExtract the text file is stored in the same server directory as the sample program itself.  The C# code looks like this:

int nPages = doc.NumPages;
IList<Word> pageWords = null;

System.IO.StreamWriter logfile = new System.IO.StreamWriter("TextExtract-untagged-out.txt");

for (int i = 0; i < nPages; i++)

This is the same code in the Java sample program:

int nPages = doc.getNumPages();
List<Word> pageWords = null;

FileOutputStream logfile = new FileOutputStream("TextExtract-tagged-out.txt");
OutputStreamWriter logwriter = new OutputStreamWriter(logfile, "UTF-8");

for (int i = 0; i < nPages; i++)

You would need to save each text file to a different server directory, and add a command to save each text file separately, and with an incremented file name.

From there, a separate utility could run to pull the data from these spreadsheet files and load them into a database table.