Optical Character Recognition (OCR) technology is used to convert images of letters or characters found in a document into machine-readable text. The Java and .NET interfaces for the Adobe PDF Library include an OCR utility that can read a PDF document, recognize text found within images in that document, and then save that text to a new PDF export file. The text from an image is attached to the corresponding image.
When using OCR to extract text from images in a PDF document, the text found in each image is added to the PDF output file, underneath the images themselves. The original images are preserved in the PDF output document. This process is called text underlayment. If a user opens the PDF output document in a viewer like Adobe Acrobat or Reader, the PDF looks the same as before, but the user can select the text in each image. The text can also be extracted from the PDF document.
The sample uses a JPG image called “text_as_image.jpg” as the input file, and then processes this image with the Optical Recognition engine. The JPG input file features about 450 words from the text of an ancient speech. After the text is extracted from the image, the program places the image and the text drawn from that image in a PDF output document.
This sample uses a PDF document called “scanned_images.pdf” as the input file for OCR processing. This PDF features two pages, with a single graphic image on each page. The sample first identifies the image files contained in the PDF document and then processes each image with the Optical Recognition engine. The program places each of the images in the PDF output file, and adds the text for each image under that image in the output file.