The PDF Library offers a series of sample programs that find information in a PDF file and lists that information at the command prompt or saves it to a new PDF file.
Use the code provided in these sample programs to guide you in writing your own systems that can extract information from one or more PDF files and export that data to a command prompt, an external file, a spreadsheet, or a database. This information can be very useful for research and statistical analysis.
For example, suppose a client generates several thousand PDF files that are used for printing advertising inserts to be distributed broadly, but to different markets and in many different local and regional newspapers or advertising circulars. So the flyers vary depending on the products available in each market and the prices quoted, special local sales and offers, and so on. You could use one of these list programs to identify and list out text and graphics with variances from one PDF file to another and copy it to a database, so that you can track this information. The places where these text and graphics are found in the file can be identified by xy coordinates to make it easier to compare one PDF to another.
Several other sample programs allow you to look at information about PDF files:
- Metadata, a program that allows you to work with information about a PDF file, such as the author name and subject, or a time and date stamp.
- The page label in a PDF file is a data structure that defines the page numbering for each section.
- PDFObject and PDFObjectExplorer, two programs that allow you to look at information about objects embedded in a PDF file.
The ListBookmarks program walks through the PDF file that you select and identifies any bookmarks in that file. It then describes each one. After the command runs, the text describing each bookmark looks like this:
[ Title=Detailed 7-day Forecast, Color=[ space = DeviceRGB, Red=1, Green=0, Blue =0 ], Flags=[BOLD, ITALIC], Count=0, Indent=0, Open=false, ViewDestination=[ Page=0, Fit=XYZ, DestRect=[ left=-2, bottom=NullValue, right=NullValue, top=494 ],Zoom=1.36 ] ]: Detailed 7-day Forecast, page 0, fit XYZ, dest rect [ LLx=-2, LLy=-1.#INF, URx=-1.#INF, URy=494 ], zoom 1.36000061035156
Each bookmark description includes the name of the bookmark, colors, typeface changes, sizing and placement.
The ListInfo program lists the metadata information found in a PDF file. You can also change these values and copy your new information to a new document. The program names a default input PDF document. You can enter the name of your own file in the program if you like, or name the file in a command prompt. You can also edit the program so that it will prompt a user to enter new values, including the document Title, Subject, Author, Keywords, Creator, and Producer. The creator and producer refer to the software used to produce the original document that was then saved as the PDF file, such as MS Word or Excel. So you could use this code as the basis for writing a program that automatically lists for you the names of the authors and the subject lines for a large group of PDF files, or automatically changes these values.
When the program finishes it will export the values you entered to a new PDF file. If you open the PDF file and then click your right mouse button and click Document Properties, the values that you entered at the prompts for the ListinInfo program appear.
The ListLayers program lists the color layers in an existing PDF file. The program provides the name of a default PDF input file. You can enter the name of your own input document in the program code, or enter the file name in a command line prompt.
Optional Content Groups (OCG) are referred to as layers within Adobe Acrobat and Reader, and can be used to separate and manage content or graphics on a single page. Layers are a very useful way to present information when opening a PDF file. For example, you could create a brochure with multiple layers offering the same content but in different languages. The first layer would be the blank background page. The resulting PDF file could be set up, with some extra program code, to select the appropriate layer with French or Spanish or English, depending on the language of the reader, and then display that language in the PDF file.
If the PDF file does not include any layers, the program will show that none are identified:
Initialized the library.
Input file: testexport.PDF
Optional content states: 
If the file does include layers, the program will display lines of output:
Initialized the library.
Input file: datalayers.PDF
Guides and Grids
Intent: [View, Design] Layer 1
Intent: [View, Design] Layer 2
Intent: [View, Design] Layer 3
Intent: [View, Design] Optional content states: [False, True, True, True] C:\Datalogics\APDFL18.0.0\DotNET\Sample_Source\InformationExtraction\
The ListPaths program lists the path names found in an existing PDF file. The program defines the name of a default PDF input file. You can change the code to provide your own PDF document, or enter the file name in a command prompt.
With a PDF file, paths, or clipping paths, define shapes, lines, boundaries for clip art or graphics, and filled areas within graphics. You can use a clipping path to edit a graphic design by removing part of the art, so that only the shapes that you want appear. You could use a clipping path to remove a background in a photo, for example, so that a person or object is highlighted and the background is white. Or you could super-impose text over an image. With a clipping path, only part of an image appears through a shape or shape that you create.
When you run the ListPaths program it may respond with multiple lines of x and y coordinates for each line in the PDF file.
Use ListWords to list and describe the text of the words found in a PDF file.
The program provides the name of a default PDF input document. You can enter the name of your own PDF input file in the program, or enter a file name in a command prompt. If the file you select has text in it, the prompt will display detailed information about each individual word. The description will include the placement of the text on the page, in terms of coordinates from top left, top right, bottom left, and bottom right, spaces, the font size and name, and:
- HasNonalphanum, has non-alphanumeric character
This is a sample of the output at the command line:
[ TopLeft=( 72.024,481.75 ), TopRight=( 90.0523,481.75 ), BottomLeft=( 72.024,470.71 ), BottomRight=( 90.0523,470.71 ) ] [charIndex=0, style=[color=[ space = DeviceGray, Gray=0 ], fontsize=11.04, fontname=Calibri]] [charIndex=0, style=[color=[ space = DeviceGray, Gray=0 ], fontsize=11.04, fontname=Calibri]] HasLetter, HasUppercase, AdjacentToSpace, WordIsUnicode, ExtCharOffsets
[ TopLeft=( 92.5473,481.75 ), TopRight=( 99.4031,481.75 ), BottomLeft=( 92.5473,470.71 ), BottomRight=( 99.4031,470.71 ) ] [charIndex=0, style=[color=[ space = DeviceGray, Gray=0 ], fontsize=11.04, fontname=Calibri]] [charIndex=0, style=[color=[ space = DeviceGray, Gray=0 ], fontsize=11.04, fontname=Calibri]] HasLetter, AdjacentToSpace, WordIsUnicode, ExtCharOffsets
You can add metadata to a PDF file to provide background information about a PDF file to the file.
This sample program adds metadata to a file called sample.PDF, and lists out those changes at the command prompt:
Title: National Weather Service Zone Forecast CreatorTool: PScript5.dll Version 5.2.2 format: application/PDF Number of authors: 1 Author: kam Ducky CreatorTool:
These metadata values, such as the title and author, appear on the Properties window in Adobe Reader in the Sample.PDF file:
You could use this program as the basis for writing code to add metadata to a single PDF file or a group of PDF files, applying the same author name to several hundred PDF files at once, for example. Or you could as a search tag to identify the file later, or add a time and date stamp for when the original content was created.
Each PDF file features a data structure that describes the page numbering for that file. It will define the type of numeral used for page numbers in headers and footers (Arabic or Roman) and sets the page numbering for each section. The page number label that appears on each page in the PDF file, showing the current page number and the total number of pages in the document ( “page 3 of 14”), is in fact stored in this data structure. That way, you can insert or remove pages in a PDF file and the file will adjust the page numbering structure automatically.
The PageLabels program shows how to edit this page number data structure, and displays information about the structure at the command prompt. The program does not generate an output file. The program completes a series of steps to edit the Page Label for a PDF file, such as adding prefixes to page labels.
C:\Datalogics\APDFL18.0.0\DotNET\Sample_Source\ContentModification\PageLabels Page Labels Sample: Initialized the library. Input file: Resources/Sample_Input/pagelabels.PDF Last page in the document is labeled A-C A-C has an index of 12 in the document. Added page range starting on page 5. Changed the prefix for the third range. Label range starts on page 0, ends on page 1 The prefix is '' and begins with number 1
Label range starts on page 2, ends on page 4 The prefix is 'Body-' and begins with number 3
Label range starts on page 5, ends on page 9 The prefix is 'Section 3-' and begins with number 2
Label range starts on page 10, ends on page 12 The prefix is 'A-' and begins with number 1
This sample program demonstrates how to review a PDF file for objects, examine them in detail, and then display information about those objects at the command prompt. It is a simpler version of the program PDF Object Explorer. You can use this sample code to provide ideas on how to work with any type of PDF object.
But the PDFObjectSample program specifically extracts data for an object known as an URIAction. A Uniform Resource Identifier (URI) is a string of characters used to identify a name or a resource available on the Internet. Typically a URI is a hyperlink to a web page address. Usually people refer to a URI as a URL instead.
The output for this program at a command line prompt will look something like this:
C:\Datalogics\APDFL18.0.0\DotNET\Sample_Source\ContentModification\PDFobject PDFObject Sample: Resources/Sample_Input/sample_links.PDF Initial URL: https://www.datalogics.com Is Map property: False Does this dictionary have an IsMap entry? False Modified URL: http://www.google.com Is Map property (if not present, defaults to false): False
The program provides the name of a default PDF input file, sample_annotations.PDF You need to enter the file sample_links.PDF, found in the Resources\Sample_Input subdirectory.
If you enter the name of a different PDF file, even one with an embedded hyperlink, the program will fail, and you will see an Unhandled Exception error message.
PDF files are made up of individual elements called objects. PDF Objects can include:
- The null object, used to indicate an absence of value
- Boolean values
- Integer values
- Floating-point real numbers
- Strings of characters
- An array is a one-dimensional collection of objects.
- A dictionary is a container of a matching keys and values.
- A stream is a sequence of bytes of any length.
The PDF Object Explorer is a viewing tool that provides a way for you to open a PDF file and look at the objects associated with that PDF file in a tree view pane. So you can run this utility to look at the internal structure of a given PDF file, and you can open as many individual PDF Object Explorer sessions as you like at one time on your desktop.
The PDF Object Explorer has two sections, the Information Dictionary, where you can find the author, creation date, title, and other basic information about the file, and the Root. The Root allows you to view descriptions of the objects within the PDF file structure, such as names, metadata, pages, outlines, and page labels. The structure of the Root in a PDF file is complex and will vary from one PDF file to another.
The left side of the window is a tree view, with two primary sections, the Information Dictionary (InfoDict) and Root. Click on the plus signs to expand the sections under the InfoDict and Root. The ChangeLayerConfiguration sample program changes the settings in Layer.PDF to these:
Dictionary, a small red book
Array, a zero in brackets
Stream, a faucet
Boolean, inverted U
Integer, a negative 1
Real Numbers, the Pi symbol
The right panel on the screen shows information about the item selected, such as the type and value:
If you select a stream, the lower right panel will show the content of that stream, either Unfiltered raw data, as it would appear in a text editor, or Filtered.
For a list of the Keys, Types, and Values related to root structures in a PDF file, see the table in the ISO 32000-1:2008, Document Management—Portable Document Format—Part 1: PDF 1.7, Table 28, “Entries in the catalog dictionary,” page 73.