Exporting images from a set of PDF files
Suppose you have a series of very long PDF files with dozens of embedded photographs and diagrams. You would like to quickly organize and catalog the graphics images in each of these PDF files. You have the original JPG and PNG files that you added to these PDF files, but the individual PDF files vary widely. The only way to find out which graphics are in each file would be to open each PDF file one by one, scroll through it, record the images that appear on each page of each file.
The Adobe PDF Library provides a faster way to complete that process. The sample program ImageExport is designed to review a PDF file, identify images that appear, and export them to a series of external graphics files. You can select the format you like for these files, TIF, JPG, PNG, GIF, or BMP. After extracting the images to a separate directory you could roll through these images quickly using a tool like the Microsoft Picture Manager.
The ImageExport program defines the types of graphics files that can be used for the export:
In Java, ImageExport.java calls the program ExportDocumentImages.java, where the values are defined:
And you can select the file type to use for export. This sample code would export images to JPG and PNG files, but you could write your program to only use one file type:
The Java code is in ExportDocumentImages.java:
The program is designed to ask a user to enter a file name at a command prompt:
The same prompt appears in ImageExport.java:
You could use this sort of prompt to allow a user to run your custom program manually. Or you might want to write a program that finds a set of PDF files in a server directory and cycles through them one by one. In that case you would probably want to include in the export statement a command that would copy the export graphics files for each PDF to a separate server directory.
Using Optical Character Recognition (OCR) technology to convert a PDF file to text
You receive a fax from a customer or vendor, and you would like to be able to turn the pages into text. You have an Optical Character Recognition (OCR) tool that can review PNG image files, convert images into text, and export that text to a separate TXT file. But before you can run the OCR utility, you need to be able to provide PNG files. You can take the fax and save it as a PDF file, rather than printing it. But you need to be able to convert the PDF file to a series of PNG files, one PNG for each page.
The DoctoImages sample program can convert each page of a PDF file to the graphics image file type that you select. This program does not find and extract images from the PDF; rather, each page is converted, as it appears, into a single graphics file.
The program sets the output format to use, in this case, JPG. This is the C# code:
Note that in the code snippets above, the C# program refers to the Compression code (DCT), and in Java, the code refers to the format option. These are both options that a user could enter with the DoctoImages command when running the program at a command prompt.
You could use the program code for DoctoImages and complete the conversion of a single PDF file to a series of JPG pages if you like. You could also write a new program that draws from DoctoImages that selects a PDF file from a server directory automatically, and by default converts the pages of that file to JPG or PNG images.
The sample program saves each exported page from the PDF file to a new JPG file, and assigns it a name, in C#:
And in Java: