debian-pdfrw/pdfrw/objects/pdfstring.py

# A part of pdfrw (https://github.com/pmaupin/pdfrw)
# Copyright (C) 2006-2017 Patrick Maupin, Austin, Texas
#                    2016 James Laird-Wah, Sydney, Australia
# MIT license -- See LICENSE.txt for details

"""

================================
PdfString encoding and decoding
================================

Introduction
=============


This module handles encoding and decoding of PDF strings.  PDF strings
are described in the PDF 1.7 reference manual, mostly in chapter 3
(sections 3.2 and 3.8) and chapter 5.

PDF strings are used in the document structure itself, and also inside
the stream of page contents dictionaries.

A PDF string can represent pure binary data (e.g. for a font or an
image), or text, or glyph indices.  For Western fonts, the glyph indices
usually correspond to ASCII, but that is not guaranteed.  (When it does
happen, it makes examination of raw PDF data a lot easier.)

The specification defines PDF string encoding at two different levels.
At the bottom, it defines ways to encode arbitrary bytes so that a PDF
tokenizer can understand they are a string of some sort, and can figure
out where the string begins and ends.  (That is all the tokenizer itself
cares about.)  Above that level, if the string represents text, the
specification defines ways to encode Unicode text into raw bytes, before
the byte encoding is performed.

There are two ways to do the byte encoding, and two ways to do the text
(Unicode) encoding.

Encoding bytes into PDF strings
================================

Adobe calls the two ways to encode bytes into strings "Literal strings"
and "Hexadecimal strings."

Literal strings
------------------

A literal string is delimited by ASCII parentheses ("(" and ")"), and a
hexadecimal string is delimited by ASCII less-than and greater-than
signs ("<" and ">").

A literal string may encode bytes almost unmolested.  The caveat is
that if a byte has the same value as a parenthesis, it must be escaped
so that the tokenizer knows the string is not finished.  This is accomplished
by using the ASCII backslash ("\") as an escape character.  Of course,
now any backslash appearing in the data must likewise be escaped.

Hexadecimal strings
---------------------

A hexadecimal string requires twice as much space as the source data
it represents (plus two bytes for the delimiter), simply storing each
byte as two hexadecimal digits, most significant digit first.  The spec
allows for lower or upper case hex digits, but most PDF encoders seem
to use upper case.

Special cases -- Legacy systems and readability
-----------------------------------------------

It is possible to create a PDF document that uses 7 bit ASCII encoding,
and it is desirable in many cases to create PDFs that are reasonably
readable when opened in a text editor.  For these reasons, the syntax
for both literal strings and hexadecimal strings is slightly more
complicated that the initial description above.  In general, the additional
syntax allows the following features:

  - Making the delineation between characters, or between sections of
    a string, apparent, and easy to see in an editor.
  - Keeping output lines from getting too wide for some editors
  - Keeping output lines from being so narrow that you can only see the
    small fraction of a string at a time in an editor.
  - Suppressing unprintable characters
  - Restricting the output string to 7 bit ASCII

Hexadecimal readability
~~~~~~~~~~~~~~~~~~~~~~~

For hexadecimal strings, only the first two bullets are relevant.  The syntax
to accomplish this is simple, allowing any ASCII whitespace to be inserted
anywhere in the encoded hex string.

Literal readability
~~~~~~~~~~~~~~~~~~~

For literal strings, all of the bullets except the first are relevant.
The syntax has two methods to help with these goals.  The first method
is to overload the escape operator to be able to do different functions,
and the second method can reduce the number of escapes required for
parentheses in the normal case.

The escape function works differently, depending on what byte follows
the backslash.  In all cases, the escaping backslash is discarded,
and then the next character is examined:

  - For parentheses and backslashes (and, in fact, for all characters
    not described otherwise in this list), the character after the
    backslash is preserved in the output.
  - A letter from the set of "nrtbf" following a backslash is interpreted as
    a line feed, carriage return, tab, backspace, or form-feed, respectively.
  - One to three octal digits following the backslash indicate the
    numeric value of the encoded byte.
  - A carriage return, carriage return/line feed, or line feed following
    the backslash indicates a line break that was put in for readability,
    and that is not part of the actual data, so this is discarded.

The second method that can be used to improve readability (and reduce space)
in literal strings is to not escape parentheses.  This only works, and is
only allowed, when the parentheses are properly balanced.  For example,
"((Hello))" is a valid encoding for a literal string, but "((Hello)" is not;
the latter case should be encoded "(\(Hello)"

Encoding text into strings
==========================

Section 3.8.1 of the PDF specification describes text strings.

The individual characters of a text string can all be considered to
be Unicode; Adobe specifies two different ways to encode these characters
into a string of bytes before further encoding the byte string as a
literal string or a hexadecimal string.

The first way to encode these strings is called PDFDocEncoding.  This
is mostly a one-for-one mapping of bytes into single bytes, similar to
Latin-1.  The representable character set is limited to the number of
characters that can fit in a byte, and this encoding cannot be used
with Unicode strings that start with the two characters making up the
UTF-16-BE BOM.

The second way to encode these strings is with UTF-16-BE.  Text strings
encoded with this method must start with the BOM, and although the spec
does not appear to mandate that the resultant bytes be encoded into a
hexadecimal string, that seems to be the canonical way to do it.

When encoding a string into UTF-16-BE, this module always adds the BOM,
and when decoding a string from UTF-16-BE, this module always strips
the BOM.  If a source string contains a BOM, that will remain in the
final string after a round-trip through the encoder and decoder, as
the goal of the encoding/decoding process is transparency.


PDF string handling in pdfrw
=============================

Responsibility for handling PDF strings in the pdfrw library is shared
between this module, the tokenizer, and the pdfwriter.

tokenizer string handling
--------------------------

As far as the tokenizer and its clients such as the pdfreader are concerned,
the PdfString class must simply be something that it can instantiate by
passing a string, that doesn't compare equal (or throw an exception when
compared) to other possible token strings.  The tokenizer must understand
enough about the syntax of the string to successfully find its beginning
and end in a stream of tokens, but doesn't otherwise know or care about
the data represented by the string.

pdfwriter string handling
--------------------------

The pdfwriter knows and cares about two attributes of PdfString instances:

  - First, PdfString objects have an 'indirect' attribute, which pdfwriter
    uses as an indication that the object knows how to represent itself
    correctly when output to a new PDF.  (In the case of a PdfString object,
    no work is really required, because it is already a string.)
  - Second, the PdfString.encode() method is used as a convenience to
    automatically convert any user-supplied strings (that didn't come
    from PDFs) when a PDF is written out to a file.

pdfstring handling
-------------------

The code in this module is designed to support those uses by the
tokenizer and the pdfwriter, and to additionally support encoding
and decoding of PdfString objects as a convenience for the user.

Most users of the pdfrw library never encode or decode a PdfString,
so it is imperative that (a) merely importing this module does not
take a significant amount of CPU time; and (b) it is cheap for the
tokenizer to produce a PdfString, and cheap for the pdfwriter to
consume a PdfString -- if the tokenizer finds a string that conforms
to the PDF specification, it will be wrapped in a PdfString object,
and if the pdfwriter finds an object with an indirect attribute, it
simply calls str() to ask it to format itself.

Encoding and decoding are not actually performed very often at all,
compared to how often tokenization and then subsequent concatenation
by the pdfwriter are performed.  In fact, versions of pdfrw prior to
0.4 did not even support Unicode for this function.  Encoding and
decoding can also easily be performed by the user, outside of the
library, and this might still be recommended, at least for encoding,
if the visual appeal of encodings generated by this module is found
lacking.


Decoding strings
~~~~~~~~~~~~~~~~~~~

Decoding strings can be tricky, but is a bounded process.  Each
properly-encoded encoded string represents exactly one output string,
with the caveat that is up to the caller of the function to know whether
he expects a Unicode string, or just bytes.

The caller can call PdfString.to_bytes() to get a byte string (which may
or may not represent encoded Unicode), or may call PdfString.to_unicode()
to get a Unicode string.  Byte strings will be regular strings in Python 2,
and b'' bytes in Python 3; Unicode strings will be regular strings in
Python 3, and u'' unicode strings in Python 2.

To maintain application compatibility with earlier versions of pdfrw,
PdfString.decode() is an alias for PdfString.to_unicode().

Encoding strings
~~~~~~~~~~~~~~~~~~

PdfString has three factory functions that will encode strings into
PdfString objects:

  -  PdfString.from_bytes() accepts a byte string (regular string in Python 2
     or b'' bytes string in Python 3) and returns a PdfString object.
  -  PdfString.from_unicode() accepts a Unicode string (u'' Unicode string in
     Python 2 or regular string in Python 3) and returns a PdfString object.
  -  PdfString.encode() examines the type of object passed, and either
     calls from_bytes() or from_unicode() to do the real work.

Unlike decoding(), encoding is not (mathematically) a function.
There are (literally) an infinite number of ways to encode any given
source string.  (Of course, most of them would be stupid, unless
the intent is some sort of denial-of-service attack.)

So encoding strings is either simpler than decoding, or can be made to
be an open-ended science fair project (to create the best looking
encoded strings).

There are parameters to the encoding functions that allow control over
the final encoded string, but the intention is to make the default values
produce a reasonable encoding.

As mentioned previously, if encoding does not do what a particular
user needs, that user is free to write his own encoder, and then
simply instantiate a PdfString object by passing a string to the
default constructor, the same way that the tokenizer does it.

However, if desirable, encoding may gradually become more capable
over time, adding the ability to generate more aesthetically pleasing
encoded strings.

PDFDocString encoding and decoding
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To handle this encoding in a fairly standard way, this module registers
an encoder and decoder for PDFDocEncoding with the codecs module.

"""

import re
import codecs
import binascii
import itertools
from ..py23_diffs import convert_load, convert_store

def find_pdfdocencoding(encoding):
    """ This function conforms to the codec module registration
        protocol.  It defers calculating data structures until
        a pdfdocencoding encode or decode is required.

        PDFDocEncoding is described in the PDF 1.7 reference manual.
    """

    if encoding != 'pdfdocencoding':
        return

    # Create the decoding map based on the table in section D.2 of the
    # PDF 1.7 manual

    # Start off with the characters with 1:1 correspondence
    decoding_map = set(range(0x20, 0x7F)) | set(range(0xA1, 0x100))
    decoding_map.update((0x09, 0x0A, 0x0D))
    decoding_map.remove(0xAD)
    decoding_map = dict((x, x) for x in decoding_map)

    # Add in the special Unicode characters
    decoding_map.update(zip(range(0x18, 0x20), (
            0x02D8, 0x02C7, 0x02C6, 0x02D9, 0x02DD, 0x02DB, 0x02DA, 0x02DC)))
    decoding_map.update(zip(range(0x80, 0x9F), (
            0x2022, 0x2020, 0x2021, 0x2026, 0x2014, 0x2013, 0x0192, 0x2044,
            0x2039, 0x203A, 0x2212, 0x2030, 0x201E, 0x201C, 0x201D, 0x2018,
            0x2019, 0x201A, 0x2122, 0xFB01, 0xFB02, 0x0141, 0x0152, 0x0160,
            0x0178, 0x017D, 0x0131, 0x0142, 0x0153, 0x0161, 0x017E)))
    decoding_map[0xA0] = 0x20AC

    # Make the encoding map from the decoding map
    encoding_map = codecs.make_encoding_map(decoding_map)

    # Not every PDF producer follows the spec, so conform to Postel's law
    # and interpret encoded strings if at all possible.  In particular, they
    # might have nulls and form-feeds, judging by random code snippets
    # floating around the internet.
    decoding_map.update(((x, x) for x in range(0x18)))

    def encode(input, errors='strict'):
        return codecs.charmap_encode(input, errors, encoding_map)

    def decode(input, errors='strict'):
        return codecs.charmap_decode(input, errors, decoding_map)

    return codecs.CodecInfo(encode, decode, name='pdfdocencoding')

codecs.register(find_pdfdocencoding)

class PdfString(str):
    """ A PdfString is an encoded string.  It has a decode
        method to get the actual string data out, and there
        is an encode class method to create such a string.
        Like any PDF object, it could be indirect, but it
        defaults to being a direct object.
    """
    indirect = False


    # The byte order mark, and unicode that could be
    # wrongly encoded into the byte order mark by the
    # pdfdocencoding codec.

    bytes_bom = codecs.BOM_UTF16_BE
    bad_pdfdoc_prefix = bytes_bom.decode('latin-1')

    # Used by decode_literal; filled in on first use

    unescape_dict = None
    unescape_func = None

    @classmethod
    def init_unescapes(cls):
        """ Sets up the unescape attributes for decode_literal
        """
        unescape_pattern = r'\\([0-7]{1,3}|\r\n|.)'
        unescape_func = re.compile(unescape_pattern, re.DOTALL).split
        cls.unescape_func = unescape_func

        unescape_dict = dict(((chr(x), chr(x)) for x in range(0x100)))
        unescape_dict.update(zip('nrtbf', '\n\r\t\b\f'))
        unescape_dict['\r'] = ''
        unescape_dict['\n'] = ''
        unescape_dict['\r\n'] = ''
        for i in range(0o10):
            unescape_dict['%01o' % i] = chr(i)
        for i in range(0o100):
            unescape_dict['%02o' % i] = chr(i)
        for i in range(0o400):
            unescape_dict['%03o' % i] = chr(i)
        cls.unescape_dict = unescape_dict
        return unescape_func

    def decode_literal(self):
        """ Decode a PDF literal string, which is enclosed in parentheses ()

            Many pdfrw users never decode strings, so defer creating
            data structures to do so until the first string is decoded.

            Possible string escapes from the spec:
            (PDF 1.7 Reference, section 3.2.3, page 53)

                1. \[nrtbf\()]: simple escapes
                2. \\d{1,3}: octal. Must be zero-padded to 3 digits
                    if followed by digit
                3. \<end of line>: line continuation. We don't know the EOL
                    marker used in the PDF, so accept \r, \n, and \r\n.
                4. Any other character following \ escape -- the backslash
                    is swallowed.
        """
        result = (self.unescape_func or self.init_unescapes())(self[1:-1])
        if len(result) == 1:
            return convert_store(result[0])
        unescape_dict = self.unescape_dict
        result[1::2] = [unescape_dict[x] for x in result[1::2]]
        return convert_store(''.join(result))


    def decode_hex(self):
        """ Decode a PDF hexadecimal-encoded string, which is enclosed
            in angle brackets <>.
        """
        hexstr = convert_store(''.join(self[1:-1].split()))
        if len(hexstr) % 1: # odd number of chars indicates a truncated 0
            hexstr += '0'
        return binascii.unhexlify(hexstr)


    def to_bytes(self):
        """ Decode a PDF string to bytes.  This is a convenience function
            for user code, in that (as of pdfrw 0.3) it is never
            actually used inside pdfrw.
        """
        if self.startswith('(') and self.endswith(')'):
            return self.decode_literal()

        elif self.startswith('<') and self.endswith('>'):
            return self.decode_hex()

        else:
            raise ValueError('Invalid PDF string "%s"' % repr(self))

    def to_unicode(self):
        """ Decode a PDF string to a unicode string.  This is a
            convenience function for user code, in that (as of
            pdfrw 0.3) it is never actually used inside pdfrw.

            There are two Unicode storage methods used -- either
            UTF16_BE, or something called PDFDocEncoding, which
            is defined in the PDF spec.  The determination of
            which decoding method to use is done by examining the
            first two bytes for the byte order marker.
        """
        raw = self.to_bytes()

        if raw[:2] == self.bytes_bom:
            return raw[2:].decode('utf-16-be')
        else:
            return raw.decode('pdfdocencoding')

    # Legacy-compatible interface
    decode = to_unicode

    # Internal value used by encoding

    escape_splitter = None  # Calculated on first use

    @classmethod
    def init_escapes(cls):
        """ Initialize the escape_splitter for the encode method
        """
        cls.escape_splitter = re.compile(br'(\(|\\|\))').split
        return cls.escape_splitter

    @classmethod
    def from_bytes(cls, raw, bytes_encoding='auto'):
        """ The from_bytes() constructor is called to encode a source raw
            byte string into a PdfString that is suitable for inclusion
            in a PDF.

            NOTE:  There is no magic in the encoding process.  A user
            can certainly do his own encoding, and simply initialize a
            PdfString() instance with his encoded string.  That may be
            useful, for example, to add line breaks to make it easier
            to load PDFs into editors, or to not bother to escape balanced
            parentheses, or to escape additional characters to make a PDF
            more readable in a file editor.  Those are features not
            currently supported by this method.

            from_bytes() can use a heuristic to figure out the best
            encoding for the string, or the user can control the process
            by changing the bytes_encoding parameter to 'literal' or 'hex'
            to force a particular conversion method.
        """

        # If hexadecimal is not being forced, then figure out how long
        # the escaped literal string will be, and fall back to hex if
        # it is too long.

        force_hex = bytes_encoding == 'hex'
        if not force_hex:
            if bytes_encoding not in ('literal', 'auto'):
                raise ValueError('Invalid bytes_encoding value: %s'
                                 % bytes_encoding)
            splitlist = (cls.escape_splitter or cls.init_escapes())(raw)
            if bytes_encoding == 'auto' and len(splitlist) // 2 >= len(raw):
                force_hex = True

        if force_hex:
            # The spec does not mandate uppercase,
            # but it seems to be the convention.
            fmt = '<%s>'
            result = binascii.hexlify(raw).upper()
        else:
            fmt = '(%s)'
            splitlist[1::2] = [(b'\\' + x) for x in splitlist[1::2]]
            result = b''.join(splitlist)

        return cls(fmt % convert_load(result))

    @classmethod
    def from_unicode(cls, source, text_encoding='auto',
                     bytes_encoding='auto'):
        """ The from_unicode() constructor is called to encode a source
            string into a PdfString that is suitable for inclusion in a PDF.

            NOTE:  There is no magic in the encoding process.  A user
            can certainly do his own encoding, and simply initialize a
            PdfString() instance with his encoded string.  That may be
            useful, for example, to add line breaks to make it easier
            to load PDFs into editors, or to not bother to escape balanced
            parentheses, or to escape additional characters to make a PDF
            more readable in a file editor.  Those are features not
            supported by this method.

            from_unicode() can use a heuristic to figure out the best
            encoding for the string, or the user can control the process
            by changing the text_encoding parameter to 'pdfdocencoding'
            or 'utf16', and/or by changing the bytes_encoding parameter
            to 'literal' or 'hex' to force particular conversion methods.

            The function will raise an exception if it cannot perform
            the conversion as requested by the user.
        """

        # Give preference to pdfdocencoding, since it only
        # requires one raw byte per character, rather than two.
        if text_encoding != 'utf16':
            force_pdfdoc = text_encoding == 'pdfdocencoding'
            if text_encoding != 'auto' and not force_pdfdoc:
                raise ValueError('Invalid text_encoding value: %s'
                                 % text_encoding)

            if source.startswith(cls.bad_pdfdoc_prefix):
                if force_pdfdoc:
                    raise UnicodeError('Prefix of string %r cannot be encoded '
                                       'in pdfdocencoding' % source[:20])
            else:
                try:
                    raw = source.encode('pdfdocencoding')
                except UnicodeError:
                    if force_pdfdoc:
                        raise
                else:
                    return cls.from_bytes(raw, bytes_encoding)

        # If the user is not forcing literal strings,
        # it makes much more sense to use hexadecimal with 2-byte chars
        raw = cls.bytes_bom + source.encode('utf-16-be')
        encoding = 'hex' if bytes_encoding == 'auto' else bytes_encoding
        return cls.from_bytes(raw, encoding)

    @classmethod
    def encode(cls, source, uni_type = type(u''), isinstance=isinstance):
        """ The encode() constructor is a legacy function that is
            also a convenience for the PdfWriter.
        """
        if isinstance(source, uni_type):
            return cls.from_unicode(source)
        else:
            return cls.from_bytes(source)