The PDDocTextFinder feature provides a means to search for any phrase or pattern across an entire PDF document using a regular expression. For example, you can use this API to find telephone numbers that appear anywhere in a PDF file by searching for a telephone number pattern, which would be any numbers that appear in the format ###-###-####. Or you could look for every instance of the phrase “limited liability partnership“ that appears in a 250 page PDF document. You might want to search for phrases in a PDF document to build a search index for that document, or to simply find key information. You can also use PDDocTextFinder to redact or highlight content found in a PDF document.
The PDDocTextFinder API improves on the WordFinder utility offered in Adobe PDF Library, which is limited to searching exact text and individual words, and only one page at a time. PDDocTextFinder creates a text finder that can extract words or phrases from the PDF file.
We provide a set of sample programs for use with PDDocTextFinder to search for content in PDF documents for C++, .NET, .NET Core, and Java, and another to search for content in PDF documents and export it to a JSON output file, also for C++, .NET, .NET Core, and Java. A third set of samples is related to redacting content (C++, .NET, .NET Core, and Java). The redaction process removes your search content from a PDF document and replaces it with a solid rectangle.
The PDDocTextFinder API and the sample programs work on Windows, Linux, and macOS platforms.
PDDocTextFinder and the sample programs use regular expressions, or regex, to build searches. A regex is a sequence of characters that specifies a search pattern, used by algorithms that can find text in strings. For example, in the regex syntax an asterisk (*) means match the preceding character 0 or more times, so a search for ab*c could yield "abbbc" and "abc", while a set of brackets [] indicates that the search should return a match for any of the characters found within those brackets. For example, the regular expression [a-z] will match any lowercase letter, "a" through "z".
The PDDocTextFinder utility in Adobe PDF Library uses the default regex syntax for the C++ Standard Library, the ECMAScript syntax, for all searches in the C++, .NET, Java, and .NET Core interfaces.
To learn more visit https://www.cplusplus.com/reference/regex/ECMAScript.