Searching for Content in a PDF Document

The PDDocTextFinder feature provides a means to search for any phrase or pattern across an entire PDF document using a regular expression.  For example, you can use this API to find telephone numbers that appear anywhere in a PDF file by searching for a telephone number pattern, which would be any numbers that appear in the format ###-###-####.  Or you could look for every instance of the phrase “limited liability partnership“ that appears in a 250 page PDF document. You might want to search for phrases in a PDF document to build a search index for that document, or to simply find key information.  You can also use PDDocTextFinder to redact or highlight content found in a PDF document.

The PDDocTextFinder API improves on the WordFinder utility offered in Adobe PDF Library, which is limited to searching exact text and individual words, and only one page at a time.  PDDocTextFinder creates a text finder that can extract words or phrases from the PDF file.

We provide a set of sample programs for use with PDDocTextFinder to search for content in PDF documents for C++, .NET, .NET Core, and Java, and another to search for content in PDF documents and export it to a JSON output file, also for C++.NET.NET Core, and Java.  A third set of samples is related to redacting content (C++, .NET, .NET Core, and Java). The redaction process removes your search content from a PDF document and replaces it with a solid rectangle.

The PDDocTextFinder API and the sample programs work on Windows, Linux, and macOS platforms.

PDDocTextFinder and the sample programs use regular expressions, or regex, to build searches.  A regex is a sequence of characters that specifies a search pattern, used by algorithms that can find text in strings.  For example, in the regex syntax an asterisk (*) means match the preceding character 0 or more times, so a search for ab*c could yield "abbbc" and "abc", while a set of brackets [] indicates that the search should return a match for any of the characters found within those brackets. For example, the regular expression [a-z] will match any lowercase letter, "a" through "z".

The PDDocTextFinder utility in Adobe PDF Library uses the default regex syntax for the C++ Standard Library, the ECMAScript syntax, for all searches in the C++, .NET, Java, and .NET Core interfaces.

To learn more visit https://www.cplusplus.com/reference/regex/ECMAScript.