Adobe PDF Library

Working with Unicode

Creating a font from a name (Font constructor)

The Adobe PDF Library Java and .NET Interface uses Identity-H or Identity-V encoding when rendering text in Unicode or with glyph ID codes in Type 0 fonts.

If the Java and .NET Interface is not working with a Type0 font, the Interface makes a font with the default font encoding:

  • MacRomanEncoding for Apple machines
  • WinAnsiEncoding for Windows, Linux, and Unix
  • Custom coding for symbol fonts

When the Java and .NET Interface sets text, it checks to see if the text is representable in the font’s encoding. If it is not, then it makes a Unicode font with an Identity-H encoding and uses that to set the text. For fonts that are not Type0, this may result in two versions of a font in the output document.

The Interface requires all fonts created for setting text in glyph IDs or Unicode to be embedded and subset.  This is done to ensure reliable PDF file processing across different systems.

With a PDF file, you can embed a font in the file as you create it, so that the font travels with the file.  Another user with a different platform can open the file and it will appear the same as it does for you. But you may want to embed only a subset of a font—only the characters in the font that you are actually going to use in the text.  When you use subsetting it reduces the size of the PDF file considerably, and it is practical for PDF files where you don’t expect the reader to attempt to make edits using Adobe Acrobat.

The Java and .NET Interface may spontaneously create a Unicode font for the document, with the embedding flag turned on, so you should write your program to call Document.EmbedFonts before saving the PDF file. Remember, even if a TrueType font is created without embed flags, if the font is used for Unicode output, the Java or .NET interface must turn embedding on.

When saving the PDF file, the Interface enumerates all the fonts in the document.  It also makes sure to create any required ToUnicode and Widths tables.

Object encoding

A PDFString object is stored within a PDF file encoded either in UTF-16BE or in PDFDocEncoding.  PDFName objects use UTF-8 encoding.

Unicode Transformation Format-16 bit (UTF-16BE) is a Unicode character encoding method, mapping code points of each Unicode character set to a sequence of two bytes ( 16 bits).   UTF-16BE encodes, or serializes, Unicode characters into a byte stream so that the characters can be stored or distributed. The stream is divided into blocks of two bytes each, and each block of two bytes is converted to a 16-bit integer.  PDFDocEncoding is an 8-bit encoding scheme that can encode all of ISO Latin-1. UTF-8 is a variable width encoding method that can also represent Unicode characters.  UTF-8 was introduced in January of 1993, UTF-16 in July of 1996.

See “Text String Type” on page 86 of the ISO 32000-1:2008, Document Management-Portable Document Format-Part 1: PDF 1.7 for more details.

PDFString objects can be created with an input string in either encoding. Most PDFString constructors will attempt to convert the data string to PDFDocEncoding if it can be converted without a loss of information.  Otherwise the PDFString constructor will save the data string in UTF-16BE. If you want the string to be encoded in UTF-16BE even if it could be converted to PDFDocEncoding, use a PDFString constructor that takes the storedAsUTF16 argument, with that flag set to True.

PDFString objects offer two properties for accessing their contents:

  • The Value property will convert the string from its internal encoding to a Unicode string object appropriate to the platform (encoded as UTF16 for both .NET and Java). This is useful for cases where the PDFString object represents human readable text.
  • The Bytes property returns the contents of the PDFString object as they are stored in the file, without respect to an encoding. This is useful for cases where the PDFString object contains binary data.