First commit

Original PyPDF code. Updates should be coming from Noah soon.
This commit is contained in:
Adam Coleman 2011-12-30 09:04:56 -06:00
commit c59a212a4c
13 changed files with 4511 additions and 0 deletions

2
.gitignore vendored Normal file
View File

@ -0,0 +1,2 @@
*.pyc
*.swp

205
CHANGELOG Normal file
View File

@ -0,0 +1,205 @@
Version 1.12, 2008-09-02
------------------------
- Added support for XMP metadata.
- Fix reading files with xref streams with multiple /Index values.
- Fix extracting content streams that use graphics operators longer than 2
characters. Affects merging PDF files.
Version 1.11, 2008-05-09
------------------------
- Patch from Hartmut Goebel to permit RectangleObjects to accept NumberObject
or FloatObject values.
- PDF compatibility fixes.
- Fix to read object xref stream in correct order.
- Fix for comments inside content streams.
Version 1.10, 2007-10-04
------------------------
- Text strings from PDF files are returned as Unicode string objects when
pyPdf determines that they can be decoded (as UTF-16 strings, or as
PDFDocEncoding strings). Unicode objects are also written out when
necessary. This means that string objects in pyPdf can be either
generic.ByteStringObject instances, or generic.TextStringObject instances.
- The extractText method now returns a unicode string object.
- All document information properties now return unicode string objects. In
the event that a document provides docinfo properties that are not decoded by
pyPdf, the raw byte strings can be accessed with an "_raw" property (ie.
title_raw rather than title)
- generic.DictionaryObject instances have been enhanced to be easier to use.
Values coming out of dictionary objects will automatically be de-referenced
(.getObject will be called on them), unless accessed by the new "raw_get"
method. DictionaryObjects can now only contain PdfObject instances (as keys
and values), making it easier to debug where non-PdfObject values (which
cannot be written out) are entering dictionaries.
- Support for reading named destinations and outlines in PDF files. Original
patch by Ashish Kulkarni.
- Stream compatibility reading enhancements for malformed PDF files.
- Cross reference table reading enhancements for malformed PDF files.
- Encryption documentation.
- Replace some "assert" statements with error raising.
- Minor optimizations to FlateDecode algorithm increase speed when using PNG
predictors.
Version 1.9, 2006-12-15
-----------------------
- Fix several serious bugs introduced in version 1.8, caused by a failure to
run through our PDF test suite before releasing that version.
- Fix bug in NullObject reading and writing.
Version 1.8, 2006-12-14
-----------------------
- Add support for decryption with the standard PDF security handler. This
allows for decrypting PDF files given the proper user or owner password.
- Add support for encryption with the standard PDF security handler.
- Add new pythondoc documentation.
- Fix bug in ASCII85 decode that occurs when whitespace exists inside the
two terminating characters of the stream.
Version 1.7, 2006-12-10
-----------------------
- Fix a bug when using a single page object in two PdfFileWriter objects.
- Adjust PyPDF to be tolerant of whitespace characters that don't belong
during a stream object.
- Add documentInfo property to PdfFileReader.
- Add numPages property to PdfFileReader.
- Add pages property to PdfFileReader.
- Add extractText function to PdfFileReader.
Version 1.6, 2006-06-06
-----------------------
- Add basic support for comments in PDF files. This allows us to read some
ReportLab PDFs that could not be read before.
- Add "auto-repair" for finding xref table at slightly bad locations.
- New StreamObject backend, cleaner and more powerful. Allows the use of
stream filters more easily, including compressed streams.
- Add a graphics state push/pop around page merges. Improves quality of
page merges when one page's content stream leaves the graphics
in an abnormal state.
- Add PageObject.compressContentStreams function, which filters all content
streams and compresses them. This will reduce the size of PDF pages,
especially after they could have been decompressed in a mergePage
operation.
- Support inline images in PDF content streams.
- Add support for using .NET framework compression when zlib is not
available. This does not make pyPdf compatible with IronPython, but it
is a first step.
- Add support for reading the document information dictionary, and extracting
title, author, subject, producer and creator tags.
- Add patch to support NullObject and multiple xref streams, from Bradley
Lawrence.
Version 1.5, 2006-01-28
-----------------------
- Fix a bug where merging pages did not work in "no-rename" cases when the
second page has an array of content streams.
- Remove some debugging output that should not have been present.
Version 1.4, 2006-01-27
-----------------------
- Add capability to merge pages from multiple PDF files into a single page
using the PageObject.mergePage function. See example code (README or web
site) for more information.
- Add ability to modify a page's MediaBox, CropBox, BleedBox, TrimBox, and
ArtBox properties through PageObject. See example code (README or web site)
for more information.
- Refactor pdf.py into multiple files: generic.py (contains objects like
NameObject, DictionaryObject), filters.py (contains filter code),
utils.py (various). This does not affect importing PdfFileReader
or PdfFileWriter.
- Add new decoding functions for standard PDF filters ASCIIHexDecode and
ASCII85Decode.
- Change url and download_url to refer to new pybrary.net web site.
Version 1.3, 2006-01-23
-----------------------
- Fix new bug introduced in 1.2 where PDF files with \r line endings did not
work properly anymore. A new test suite developed with various PDF files
should prevent regression bugs from now on.
- Fix a bug where inheriting attributes from page nodes did not work.
Version 1.2, 2006-01-23
-----------------------
- Improved support for files with CRLF-based line endings, fixing a common
reported problem stating "assertion error: assert line == "%%EOF"".
- Software author/maintainer is now officially a proud married person, which
is sure to result in better software... somehow.
Version 1.1, 2006-01-18
-----------------------
- Add capability to rotate pages.
- Improved PDF reading support to properly manage inherited attributes from
/Type=/Pages nodes. This means that page groups that are rotated or have
different media boxes or whatever will now work properly.
- Added PDF 1.5 support. Namely cross-reference streams and object streams.
This release can mangle Adobe's PDFReference16.pdf successfully.
Version 1.0, 2006-01-17
-----------------------
- First distutils-capable true public release. Supports a wide variety of PDF
files that I found sitting around on my system.
- Does not support some PDF 1.5 features, such as object streams,
cross-reference streams.

28
LICENSE Normal file
View File

@ -0,0 +1,28 @@
Copyright (c) 2006-2008, Mathieu Fenniak
Some contributions copyright (c) 2007, Ashish Kulkarni <kulkarni.ashish@gmail.com>
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
met:
* Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice,
this list of conditions and the following disclaimer in the documentation
and/or other materials provided with the distribution.
* The name of the author may not be used to endorse or promote products
derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
POSSIBILITY OF SUCH DAMAGE.

1
MANIFEST.in Normal file
View File

@ -0,0 +1 @@
include CHANGELOG

4
PyPDF2/__init__.py Normal file
View File

@ -0,0 +1,4 @@
from pdf import PdfFileReader, PdfFileWriter
from merger import PdfFileMerger
__all__ = ["pdf", "PdfFileMerger"]

252
PyPDF2/filters.py Normal file
View File

@ -0,0 +1,252 @@
# vim: sw=4:expandtab:foldmethod=marker
#
# Copyright (c) 2006, Mathieu Fenniak
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
#
# * Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# * The name of the author may not be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
"""
Implementation of stream filters for PDF.
"""
__author__ = "Mathieu Fenniak"
__author_email__ = "biziqe@mathieu.fenniak.net"
from utils import PdfReadError
try:
from cStringIO import StringIO
except ImportError:
from StringIO import StringIO
try:
import zlib
def decompress(data):
return zlib.decompress(data)
def compress(data):
return zlib.compress(data)
except ImportError:
# Unable to import zlib. Attempt to use the System.IO.Compression
# library from the .NET framework. (IronPython only)
import System
from System import IO, Collections, Array
def _string_to_bytearr(buf):
retval = Array.CreateInstance(System.Byte, len(buf))
for i in range(len(buf)):
retval[i] = ord(buf[i])
return retval
def _bytearr_to_string(bytes):
retval = ""
for i in range(bytes.Length):
retval += chr(bytes[i])
return retval
def _read_bytes(stream):
ms = IO.MemoryStream()
buf = Array.CreateInstance(System.Byte, 2048)
while True:
bytes = stream.Read(buf, 0, buf.Length)
if bytes == 0:
break
else:
ms.Write(buf, 0, bytes)
retval = ms.ToArray()
ms.Close()
return retval
def decompress(data):
bytes = _string_to_bytearr(data)
ms = IO.MemoryStream()
ms.Write(bytes, 0, bytes.Length)
ms.Position = 0 # fseek 0
gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Decompress)
bytes = _read_bytes(gz)
retval = _bytearr_to_string(bytes)
gz.Close()
return retval
def compress(data):
bytes = _string_to_bytearr(data)
ms = IO.MemoryStream()
gz = IO.Compression.DeflateStream(ms, IO.Compression.CompressionMode.Compress, True)
gz.Write(bytes, 0, bytes.Length)
gz.Close()
ms.Position = 0 # fseek 0
bytes = ms.ToArray()
retval = _bytearr_to_string(bytes)
ms.Close()
return retval
class FlateDecode(object):
def decode(data, decodeParms):
data = decompress(data)
predictor = 1
if decodeParms:
predictor = decodeParms.get("/Predictor", 1)
# predictor 1 == no predictor
if predictor != 1:
columns = decodeParms["/Columns"]
# PNG prediction:
if predictor >= 10 and predictor <= 15:
output = StringIO()
# PNG prediction can vary from row to row
rowlength = columns + 1
assert len(data) % rowlength == 0
prev_rowdata = (0,) * rowlength
for row in xrange(len(data) / rowlength):
rowdata = [ord(x) for x in data[(row*rowlength):((row+1)*rowlength)]]
filterByte = rowdata[0]
if filterByte == 0:
pass
elif filterByte == 1:
for i in range(2, rowlength):
rowdata[i] = (rowdata[i] + rowdata[i-1]) % 256
elif filterByte == 2:
for i in range(1, rowlength):
rowdata[i] = (rowdata[i] + prev_rowdata[i]) % 256
else:
# unsupported PNG filter
raise PdfReadError("Unsupported PNG filter %r" % filterByte)
prev_rowdata = rowdata
output.write(''.join([chr(x) for x in rowdata[1:]]))
data = output.getvalue()
else:
# unsupported predictor
raise PdfReadError("Unsupported flatedecode predictor %r" % predictor)
return data
decode = staticmethod(decode)
def encode(data):
return compress(data)
encode = staticmethod(encode)
class ASCIIHexDecode(object):
def decode(data, decodeParms=None):
retval = ""
char = ""
x = 0
while True:
c = data[x]
if c == ">":
break
elif c.isspace():
x += 1
continue
char += c
if len(char) == 2:
retval += chr(int(char, base=16))
char = ""
x += 1
assert char == ""
return retval
decode = staticmethod(decode)
class ASCII85Decode(object):
def decode(data, decodeParms=None):
retval = ""
group = []
x = 0
hitEod = False
# remove all whitespace from data
data = [y for y in data if not (y in ' \n\r\t')]
while not hitEod:
c = data[x]
if len(retval) == 0 and c == "<" and data[x+1] == "~":
x += 2
continue
#elif c.isspace():
# x += 1
# continue
elif c == 'z':
assert len(group) == 0
retval += '\x00\x00\x00\x00'
continue
elif c == "~" and data[x+1] == ">":
if len(group) != 0:
# cannot have a final group of just 1 char
assert len(group) > 1
cnt = len(group) - 1
group += [ 85, 85, 85 ]
hitEod = cnt
else:
break
else:
c = ord(c) - 33
assert c >= 0 and c < 85
group += [ c ]
if len(group) >= 5:
b = group[0] * (85**4) + \
group[1] * (85**3) + \
group[2] * (85**2) + \
group[3] * 85 + \
group[4]
assert b < (2**32 - 1)
c4 = chr((b >> 0) % 256)
c3 = chr((b >> 8) % 256)
c2 = chr((b >> 16) % 256)
c1 = chr(b >> 24)
retval += (c1 + c2 + c3 + c4)
if hitEod:
retval = retval[:-4+hitEod]
group = []
x += 1
return retval
decode = staticmethod(decode)
def decodeStreamData(stream):
from generic import NameObject
filters = stream.get("/Filter", ())
if len(filters) and not isinstance(filters[0], NameObject):
# we have a single filter instance
filters = (filters,)
data = stream._data
for filterType in filters:
if filterType == "/FlateDecode":
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
elif filterType == "/ASCIIHexDecode":
data = ASCIIHexDecode.decode(data)
elif filterType == "/ASCII85Decode":
data = ASCII85Decode.decode(data)
elif filterType == "/Crypt":
decodeParams = stream.get("/DecodeParams", {})
if "/Name" not in decodeParams and "/Type" not in decodeParams:
pass
else:
raise NotImplementedError("/Crypt filter with /Name or /Type not supported yet")
else:
# unsupported filter
raise NotImplementedError("unsupported filter %s" % filterType)
return data
if __name__ == "__main__":
assert "abc" == ASCIIHexDecode.decode('61\n626\n3>')
ascii85Test = """
<~9jqo^BlbD-BleB1DJ+*+F(f,q/0JhKF<GL>Cj@.4Gp$d7F!,L7@<6@)/0JDEF<G%<+EV:2F!,
O<DJ+*.@<*K0@<6L(Df-\\0Ec5e;DffZ(EZee.Bl.9pF"AGXBPCsi+DGm>@3BB/F*&OCAfu2/AKY
i(DIb:@FD,*)+C]U=@3BN#EcYf8ATD3s@q?d$AftVqCh[NqF<G:8+EV:.+Cf>-FD5W8ARlolDIa
l(DId<j@<?3r@:F%a+D58'ATD4$Bl@l3De:,-DJs`8ARoFb/0JMK@qB4^F!,R<AKZ&-DfTqBG%G
>uD.RTpAKYo'+CT/5+Cei#DII?(E,9)oF*2M7/c~>
"""
ascii85_originalText="Man is distinguished, not only by his reason, but by this singular passion from other animals, which is a lust of the mind, that by a perseverance of delight in the continued and indefatigable generation of knowledge, exceeds the short vehemence of any carnal pleasure."
assert ASCII85Decode.decode(ascii85Test) == ascii85_originalText

1047
PyPDF2/generic.py Normal file

File diff suppressed because it is too large Load Diff

401
PyPDF2/merger.py Normal file
View File

@ -0,0 +1,401 @@
# vim: sw=4:expandtab:foldmethod=marker
#
# Copyright (c) 2006, Mathieu Fenniak
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
#
# * Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# * The name of the author may not be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
from generic import *
from pdf import PdfFileReader, PdfFileWriter, Destination
class _MergedPage(object):
"""
_MergedPage is used internally by PdfFileMerger to collect necessary information on each page that is being merged.
"""
def __init__(self, pagedata, src, id):
self.src = src
self.pagedata = pagedata
self.out_pagedata = None
self.id = id
class PdfFileMerger(object):
"""
PdfFileMerger merges multiple PDFs into a single PDF. It can concatenate,
slice, insert, or any combination of the above.
See the functions "merge" (or "append") and "write" (or "overwrite") for
usage information.
"""
def __init__(self):
"""
>>> PdfFileMerger()
Initializes a PdfFileMerger, no parameters required
"""
self.inputs = []
self.pages = []
self.output = PdfFileWriter()
self.bookmarks = []
self.named_dests = []
self.id_count = 0
def merge(self, position, fileobj, bookmark=None, pages=None, import_bookmarks=True):
"""
>>> merge(position, file, bookmark=None, pages=None, import_bookmarks=True)
Merges the pages from the source document specified by "file" into the output
file at the page number specified by "position".
Optionally, you may specify a bookmark to be applied at the beginning of the
included file by supplying the text of the bookmark in the "bookmark" parameter.
You may prevent the source document's bookmarks from being imported by
specifying "import_bookmarks" as False.
You may also use the "pages" parameter to merge only the specified range of
pages from the source document into the output document.
"""
my_file = False
if type(fileobj) in (str, unicode):
fileobj = file(fileobj, 'rb')
my_file = True
if type(fileobj) == PdfFileReader:
pdfr = fileobj
fileobj = pdfr.file
else:
pdfr = PdfFileReader(fileobj)
# Find the range of pages to merge
if pages == None:
pages = (0, pdfr.getNumPages())
elif type(pages) in (int, float, str, unicode):
raise TypeError('"pages" must be a tuple of (start, end)')
srcpages = []
if bookmark:
bookmark = Bookmark(TextStringObject(bookmark), NumberObject(self.id_count), NameObject('/Fit'))
outline = []
if import_bookmarks:
outline = pdfr.getOutlines()
outline = self._trim_outline(pdfr, outline, pages)
if bookmark:
self.bookmarks += [bookmark, outline]
else:
self.bookmarks += outline
dests = pdfr.namedDestinations
dests = self._trim_dests(pdfr, dests, pages)
self.named_dests += dests
# Gather all the pages that are going to be merged
for i in range(*pages):
pg = pdfr.getPage(i)
id = self.id_count
self.id_count += 1
mp = _MergedPage(pg, pdfr, id)
srcpages.append(mp)
self._associate_dests_to_pages(srcpages)
self._associate_bookmarks_to_pages(srcpages)
# Slice to insert the pages at the specified position
self.pages[position:position] = srcpages
# Keep track of our input files so we can close them later
self.inputs.append((fileobj, pdfr, my_file))
def append(self, fileobj, bookmark=None, pages=None, import_bookmarks=True):
"""
>>> append(file, bookmark=None, pages=None, import_bookmarks=True):
Identical to the "merge" function, but assumes you want to concatenate all pages
onto the end of the file instead of specifying a position.
"""
self.merge(len(self.pages), fileobj, bookmark, pages, import_bookmarks)
def write(self, fileobj):
"""
>>> write(file)
Writes all data that has been merged to "file" (which can be a filename or any
kind of file-like object)
"""
my_file = False
if type(fileobj) in (str, unicode):
fileobj = file(fileobj, 'wb')
my_file = True
# Add pages to the PdfFileWriter
for page in self.pages:
self.output.addPage(page.pagedata)
page.out_pagedata = self.output.getReference(self.output._pages.getObject()["/Kids"][-1].getObject())
# Once all pages are added, create bookmarks to point at those pages
self._write_dests()
self._write_bookmarks()
# Write the output to the file
self.output.write(fileobj)
if my_file:
fileobj.close()
def close(self):
"""
>>> close()
Shuts all file descriptors (input and output) and clears all memory usage
"""
self.pages = []
for fo, pdfr, mine in self.inputs:
if mine:
fo.close()
self.inputs = []
self.output = None
def _trim_dests(self, pdf, dests, pages):
"""
Removes any named destinations that are not a part of the specified page set
"""
new_dests = []
prev_header_added = True
for k, o in dests.items():
for j in range(*pages):
if pdf.getPage(j).getObject() == o['/Page'].getObject():
o[NameObject('/Page')] = o['/Page'].getObject()
assert str(k) == str(o['/Title'])
new_dests.append(o)
break
return new_dests
def _trim_outline(self, pdf, outline, pages):
"""
Removes any outline/bookmark entries that are not a part of the specified page set
"""
new_outline = []
prev_header_added = True
for i, o in enumerate(outline):
if type(o) == list:
sub = self._trim_outline(pdf, o, pages)
if sub:
if not prev_header_added:
new_outline.append(outline[i-1])
new_outline.append(sub)
else:
prev_header_added = False
for j in range(*pages):
if pdf.getPage(j).getObject() == o['/Page'].getObject():
o[NameObject('/Page')] = o['/Page'].getObject()
new_outline.append(o)
prev_header_added = True
break
return new_outline
def _write_dests(self):
dests = self.named_dests
for v in dests:
pageno = None
pdf = None
if v.has_key('/Page'):
for i, p in enumerate(self.pages):
if p.id == v['/Page']:
v[NameObject('/Page')] = p.out_pagedata
pageno = i
pdf = p.src
if pageno != None:
self.output.addNamedDestinationObject(v)
def _write_bookmarks(self, bookmarks=None, parent=None):
if bookmarks == None:
bookmarks = self.bookmarks
last_added = None
for b in bookmarks:
if type(b) == list:
self._write_bookmarks(b, last_added)
continue
pageno = None
pdf = None
if b.has_key('/Page'):
for i, p in enumerate(self.pages):
if p.id == b['/Page']:
b[NameObject('/Page')] = p.out_pagedata
pageno = i
pdf = p.src
if pageno != None:
last_added = self.output.addBookmarkDestination(b, parent)
def _associate_dests_to_pages(self, pages):
for nd in self.named_dests:
pageno = None
np = nd['/Page']
if type(np) == NumberObject:
continue
for p in pages:
if np.getObject() == p.pagedata.getObject():
pageno = p.id
if pageno != None:
nd[NameObject('/Page')] = NumberObject(pageno)
else:
raise ValueError, "Unresolved named destination '%s'" % (nd['/Title'],)
def _associate_bookmarks_to_pages(self, pages, bookmarks=None):
if bookmarks == None:
bookmarks = self.bookmarks
for b in bookmarks:
if type(b) == list:
self._associate_bookmarks_to_pages(pages, b)
continue
pageno = None
bp = b['/Page']
if type(bp) == NumberObject:
continue
for p in pages:
if bp.getObject() == p.pagedata.getObject():
pageno = p.id
if pageno != None:
b[NameObject('/Page')] = NumberObject(pageno)
else:
raise ValueError, "Unresolved bookmark '%s'" % (b['/Title'],)
def findBookmark(self, bookmark, root=None):
if root == None:
root = self.bookmarks
for i, b in enumerate(root):
if type(b) == list:
res = self.findBookmark(bookmark, b)
if res:
return [i] + res
if b == bookmark or b['/Title'] == bookmark:
return [i]
return None
def addBookmark(self, title, pagenum, parent=None):
"""
Add a bookmark to the pdf, using the specified title and pointing at
the specified page number. A parent can be specified to make this a
nested bookmark below the parent.
"""
if parent == None:
iloc = [len(self.bookmarks)-1]
elif type(parent) == list:
iloc = parent
else:
iloc = self.findBookmark(parent)
dest = Bookmark(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826))
if parent == None:
self.bookmarks.append(dest)
else:
bmparent = self.bookmarks
for i in iloc[:-1]:
bmparent = bmparent[i]
npos = iloc[-1]+1
if npos < len(bmparent) and type(bmparent[npos]) == list:
bmparent[npos].append(dest)
else:
bmparent.insert(npos, [dest])
def addNamedDestination(self, title, pagenum):
"""
Add a destination to the pdf, using the specified title and pointing
at the specified page number.
"""
dest = Destination(TextStringObject(title), NumberObject(pagenum), NameObject('/FitH'), NumberObject(826))
self.named_dests.append(dest)
class OutlinesObject(list):
def __init__(self, pdf, tree, parent=None):
list.__init__(self)
self.tree = tree
self.pdf = pdf
self.parent = parent
def remove(self, index):
obj = self[index]
del self[index]
self.tree.removeChild(obj)
def add(self, title, page):
pageRef = self.pdf.getObject(self.pdf._pages)['/Kids'][pagenum]
action = DictionaryObject()
action.update({
NameObject('/D') : ArrayObject([pageRef, NameObject('/FitH'), NumberObject(826)]),
NameObject('/S') : NameObject('/GoTo')
})
actionRef = self.pdf._addObject(action)
bookmark = TreeObject()
bookmark.update({
NameObject('/A') : actionRef,
NameObject('/Title') : createStringObject(title),
})
pdf._addObject(bookmark)
self.tree.addChild(bookmark)
def removeAll(self):
for child in [x for x in self.tree.children()]:
self.tree.removeChild(child)
self.pop()

2013
PyPDF2/pdf.py Normal file

File diff suppressed because it is too large Load Diff

125
PyPDF2/utils.py Normal file
View File

@ -0,0 +1,125 @@
# vim: sw=4:expandtab:foldmethod=marker
#
# Copyright (c) 2006, Mathieu Fenniak
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
#
# * Redistributions of source code must retain the above copyright notice,
# this list of conditions and the following disclaimer.
# * Redistributions in binary form must reproduce the above copyright notice,
# this list of conditions and the following disclaimer in the documentation
# and/or other materials provided with the distribution.
# * The name of the author may not be used to endorse or promote products
# derived from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
# AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
# LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
# CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
# SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
# INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
# CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
# ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
"""
Utility functions for PDF library.
"""
__author__ = "Mathieu Fenniak"
__author_email__ = "biziqe@mathieu.fenniak.net"
#ENABLE_PSYCO = False
#if ENABLE_PSYCO:
# try:
# import psyco
# except ImportError:
# ENABLE_PSYCO = False
#
#if not ENABLE_PSYCO:
# class psyco:
# def proxy(func):
# return func
# proxy = staticmethod(proxy)
def readUntilWhitespace(stream, maxchars=None):
txt = ""
while True:
tok = stream.read(1)
if tok.isspace() or not tok:
break
txt += tok
if len(txt) == maxchars:
break
return txt
def readNonWhitespace(stream):
tok = ' '
while tok == '\n' or tok == '\r' or tok == ' ' or tok == '\t':
tok = stream.read(1)
return tok
class ConvertFunctionsToVirtualList(object):
def __init__(self, lengthFunction, getFunction):
self.lengthFunction = lengthFunction
self.getFunction = getFunction
def __len__(self):
return self.lengthFunction()
def __getitem__(self, index):
if not isinstance(index, int):
raise TypeError, "sequence indices must be integers"
len_self = len(self)
if index < 0:
# support negative indexes
index = len_self + index
if index < 0 or index >= len_self:
raise IndexError, "sequence index out of range"
return self.getFunction(index)
def RC4_encrypt(key, plaintext):
S = [i for i in range(256)]
j = 0
for i in range(256):
j = (j + S[i] + ord(key[i % len(key)])) % 256
S[i], S[j] = S[j], S[i]
i, j = 0, 0
retval = ""
for x in range(len(plaintext)):
i = (i + 1) % 256
j = (j + S[i]) % 256
S[i], S[j] = S[j], S[i]
t = S[(S[i] + S[j]) % 256]
retval += chr(ord(plaintext[x]) ^ t)
return retval
def matrixMultiply(a, b):
return [[sum([float(i)*float(j)
for i, j in zip(row, col)]
) for col in zip(*b)]
for row in a]
class PyPdfError(Exception):
pass
class PdfReadError(PyPdfError):
pass
class PageSizeNotDefinedError(PyPdfError):
pass
class PdfReadWarning(UserWarning):
pass
if __name__ == "__main__":
# test RC4
out = RC4_encrypt("Key", "Plaintext")
print repr(out)
pt = RC4_encrypt("Key", out)
print repr(pt)

355
PyPDF2/xmp.py Normal file
View File

@ -0,0 +1,355 @@
import re
import datetime
import decimal
from generic import PdfObject
from xml.dom import getDOMImplementation
from xml.dom.minidom import parseString
RDF_NAMESPACE = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
DC_NAMESPACE = "http://purl.org/dc/elements/1.1/"
XMP_NAMESPACE = "http://ns.adobe.com/xap/1.0/"
PDF_NAMESPACE = "http://ns.adobe.com/pdf/1.3/"
XMPMM_NAMESPACE = "http://ns.adobe.com/xap/1.0/mm/"
# What is the PDFX namespace, you might ask? I might ask that too. It's
# a completely undocumented namespace used to place "custom metadata"
# properties, which are arbitrary metadata properties with no semantic or
# documented meaning. Elements in the namespace are key/value-style storage,
# where the element name is the key and the content is the value. The keys
# are transformed into valid XML identifiers by substituting an invalid
# identifier character with \u2182 followed by the unicode hex ID of the
# original character. A key like "my car" is therefore "my\u21820020car".
#
# \u2182, in case you're wondering, is the unicode character
# \u{ROMAN NUMERAL TEN THOUSAND}, a straightforward and obvious choice for
# escaping characters.
#
# Intentional users of the pdfx namespace should be shot on sight. A
# custom data schema and sensical XML elements could be used instead, as is
# suggested by Adobe's own documentation on XMP (under "Extensibility of
# Schemas").
#
# Information presented here on the /pdfx/ schema is a result of limited
# reverse engineering, and does not constitute a full specification.
PDFX_NAMESPACE = "http://ns.adobe.com/pdfx/1.3/"
iso8601 = re.compile("""
(?P<year>[0-9]{4})
(-
(?P<month>[0-9]{2})
(-
(?P<day>[0-9]+)
(T
(?P<hour>[0-9]{2}):
(?P<minute>[0-9]{2})
(:(?P<second>[0-9]{2}(.[0-9]+)?))?
(?P<tzd>Z|[-+][0-9]{2}:[0-9]{2})
)?
)?
)?
""", re.VERBOSE)
##
# An object that represents Adobe XMP metadata.
class XmpInformation(PdfObject):
def __init__(self, stream):
self.stream = stream
docRoot = parseString(self.stream.getData())
self.rdfRoot = docRoot.getElementsByTagNameNS(RDF_NAMESPACE, "RDF")[0]
self.cache = {}
def writeToStream(self, stream, encryption_key):
self.stream.writeToStream(stream, encryption_key)
def getElement(self, aboutUri, namespace, name):
for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"):
if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri:
attr = desc.getAttributeNodeNS(namespace, name)
if attr != None:
yield attr
for element in desc.getElementsByTagNameNS(namespace, name):
yield element
def getNodesInNamespace(self, aboutUri, namespace):
for desc in self.rdfRoot.getElementsByTagNameNS(RDF_NAMESPACE, "Description"):
if desc.getAttributeNS(RDF_NAMESPACE, "about") == aboutUri:
for i in range(desc.attributes.length):
attr = desc.attributes.item(i)
if attr.namespaceURI == namespace:
yield attr
for child in desc.childNodes:
if child.namespaceURI == namespace:
yield child
def _getText(self, element):
text = ""
for child in element.childNodes:
if child.nodeType == child.TEXT_NODE:
text += child.data
return text
def _converter_string(value):
return value
def _converter_date(value):
m = iso8601.match(value)
year = int(m.group("year"))
month = int(m.group("month") or "1")
day = int(m.group("day") or "1")
hour = int(m.group("hour") or "0")
minute = int(m.group("minute") or "0")
second = decimal.Decimal(m.group("second") or "0")
seconds = second.to_integral(decimal.ROUND_FLOOR)
milliseconds = (second - seconds) * 1000000
tzd = m.group("tzd") or "Z"
dt = datetime.datetime(year, month, day, hour, minute, seconds, milliseconds)
if tzd != "Z":
tzd_hours, tzd_minutes = [int(x) for x in tzd.split(":")]
tzd_hours *= -1
if tzd_hours < 0:
tzd_minutes *= -1
dt = dt + datetime.timedelta(hours=tzd_hours, minutes=tzd_minutes)
return dt
_test_converter_date = staticmethod(_converter_date)
def _getter_bag(namespace, name, converter):
def get(self):
cached = self.cache.get(namespace, {}).get(name)
if cached:
return cached
retval = []
for element in self.getElement("", namespace, name):
bags = element.getElementsByTagNameNS(RDF_NAMESPACE, "Bag")
if len(bags):
for bag in bags:
for item in bag.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
value = self._getText(item)
value = converter(value)
retval.append(value)
ns_cache = self.cache.setdefault(namespace, {})
ns_cache[name] = retval
return retval
return get
def _getter_seq(namespace, name, converter):
def get(self):
cached = self.cache.get(namespace, {}).get(name)
if cached:
return cached
retval = []
for element in self.getElement("", namespace, name):
seqs = element.getElementsByTagNameNS(RDF_NAMESPACE, "Seq")
if len(seqs):
for seq in seqs:
for item in seq.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
value = self._getText(item)
value = converter(value)
retval.append(value)
else:
value = converter(self._getText(element))
retval.append(value)
ns_cache = self.cache.setdefault(namespace, {})
ns_cache[name] = retval
return retval
return get
def _getter_langalt(namespace, name, converter):
def get(self):
cached = self.cache.get(namespace, {}).get(name)
if cached:
return cached
retval = {}
for element in self.getElement("", namespace, name):
alts = element.getElementsByTagNameNS(RDF_NAMESPACE, "Alt")
if len(alts):
for alt in alts:
for item in alt.getElementsByTagNameNS(RDF_NAMESPACE, "li"):
value = self._getText(item)
value = converter(value)
retval[item.getAttribute("xml:lang")] = value
else:
retval["x-default"] = converter(self._getText(element))
ns_cache = self.cache.setdefault(namespace, {})
ns_cache[name] = retval
return retval
return get
def _getter_single(namespace, name, converter):
def get(self):
cached = self.cache.get(namespace, {}).get(name)
if cached:
return cached
value = None
for element in self.getElement("", namespace, name):
if element.nodeType == element.ATTRIBUTE_NODE:
value = element.nodeValue
else:
value = self._getText(element)
break
if value != None:
value = converter(value)
ns_cache = self.cache.setdefault(namespace, {})
ns_cache[name] = value
return value
return get
##
# Contributors to the resource (other than the authors). An unsorted
# array of names.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_contributor = property(_getter_bag(DC_NAMESPACE, "contributor", _converter_string))
##
# Text describing the extent or scope of the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_coverage = property(_getter_single(DC_NAMESPACE, "coverage", _converter_string))
##
# A sorted array of names of the authors of the resource, listed in order
# of precedence.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_creator = property(_getter_seq(DC_NAMESPACE, "creator", _converter_string))
##
# A sorted array of dates (datetime.datetime instances) of signifigance to
# the resource. The dates and times are in UTC.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_date = property(_getter_seq(DC_NAMESPACE, "date", _converter_date))
##
# A language-keyed dictionary of textual descriptions of the content of the
# resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_description = property(_getter_langalt(DC_NAMESPACE, "description", _converter_string))
##
# The mime-type of the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_format = property(_getter_single(DC_NAMESPACE, "format", _converter_string))
##
# Unique identifier of the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_identifier = property(_getter_single(DC_NAMESPACE, "identifier", _converter_string))
##
# An unordered array specifying the languages used in the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_language = property(_getter_bag(DC_NAMESPACE, "language", _converter_string))
##
# An unordered array of publisher names.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_publisher = property(_getter_bag(DC_NAMESPACE, "publisher", _converter_string))
##
# An unordered array of text descriptions of relationships to other
# documents.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_relation = property(_getter_bag(DC_NAMESPACE, "relation", _converter_string))
##
# A language-keyed dictionary of textual descriptions of the rights the
# user has to this resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_rights = property(_getter_langalt(DC_NAMESPACE, "rights", _converter_string))
##
# Unique identifier of the work from which this resource was derived.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_source = property(_getter_single(DC_NAMESPACE, "source", _converter_string))
##
# An unordered array of descriptive phrases or keywrods that specify the
# topic of the content of the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_subject = property(_getter_bag(DC_NAMESPACE, "subject", _converter_string))
##
# A language-keyed dictionary of the title of the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_title = property(_getter_langalt(DC_NAMESPACE, "title", _converter_string))
##
# An unordered array of textual descriptions of the document type.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
dc_type = property(_getter_bag(DC_NAMESPACE, "type", _converter_string))
##
# An unformatted text string representing document keywords.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
pdf_keywords = property(_getter_single(PDF_NAMESPACE, "Keywords", _converter_string))
##
# The PDF file version, for example 1.0, 1.3.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
pdf_pdfversion = property(_getter_single(PDF_NAMESPACE, "PDFVersion", _converter_string))
##
# The name of the tool that created the PDF document.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
pdf_producer = property(_getter_single(PDF_NAMESPACE, "Producer", _converter_string))
##
# The date and time the resource was originally created. The date and
# time are returned as a UTC datetime.datetime object.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
xmp_createDate = property(_getter_single(XMP_NAMESPACE, "CreateDate", _converter_date))
##
# The date and time the resource was last modified. The date and time
# are returned as a UTC datetime.datetime object.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
xmp_modifyDate = property(_getter_single(XMP_NAMESPACE, "ModifyDate", _converter_date))
##
# The date and time that any metadata for this resource was last
# changed. The date and time are returned as a UTC datetime.datetime
# object.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
xmp_metadataDate = property(_getter_single(XMP_NAMESPACE, "MetadataDate", _converter_date))
##
# The name of the first known tool used to create the resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
xmp_creatorTool = property(_getter_single(XMP_NAMESPACE, "CreatorTool", _converter_string))
##
# The common identifier for all versions and renditions of this resource.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
xmpmm_documentId = property(_getter_single(XMPMM_NAMESPACE, "DocumentID", _converter_string))
##
# An identifier for a specific incarnation of a document, updated each
# time a file is saved.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
xmpmm_instanceId = property(_getter_single(XMPMM_NAMESPACE, "InstanceID", _converter_string))
def custom_properties(self):
if not hasattr(self, "_custom_properties"):
self._custom_properties = {}
for node in self.getNodesInNamespace("", PDFX_NAMESPACE):
key = node.localName
while True:
# see documentation about PDFX_NAMESPACE earlier in file
idx = key.find(u"\u2182")
if idx == -1:
break
key = key[:idx] + chr(int(key[idx+1:idx+5], base=16)) + key[idx+5:]
if node.nodeType == node.ATTRIBUTE_NODE:
value = node.nodeValue
else:
value = self._getText(node)
self._custom_properties[key] = value
return self._custom_properties
##
# Retrieves custom metadata properties defined in the undocumented pdfx
# metadata schema.
# <p>Stability: Added in v1.12, will exist for all future v1.x releases.
# @return Returns a dictionary of key/value items for custom metadata
# properties.
custom_properties = property(custom_properties)

38
README Normal file
View File

@ -0,0 +1,38 @@
Example:
from pyPdf import PdfFileWriter, PdfFileReader
output = PdfFileWriter()
input1 = PdfFileReader(file("document1.pdf", "rb"))
# add page 1 from input1 to output document, unchanged
output.addPage(input1.getPage(0))
# add page 2 from input1, but rotated clockwise 90 degrees
output.addPage(input1.getPage(1).rotateClockwise(90))
# add page 3 from input1, rotated the other way:
output.addPage(input1.getPage(2).rotateCounterClockwise(90))
# alt: output.addPage(input1.getPage(2).rotateClockwise(270))
# add page 4 from input1, but first add a watermark from another pdf:
page4 = input1.getPage(3)
watermark = PdfFileReader(file("watermark.pdf", "rb"))
page4.mergePage(watermark.getPage(0))
# add page 5 from input1, but crop it to half size:
page5 = input1.getPage(4)
page5.mediaBox.upperRight = (
page5.mediaBox.getUpperRight_x() / 2,
page5.mediaBox.getUpperRight_y() / 2
)
output.addPage(page5)
# print how many pages input1 has:
print "document1.pdf has %s pages." % input1.getNumPages())
# finally, write "output" to document-output.pdf
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)

40
setup.py Normal file
View File

@ -0,0 +1,40 @@
#!/usr/bin/env python
from distutils.core import setup
long_description = """
A Pure-Python library built as a PDF toolkit. It is capable of:
- extracting document information (title, author, ...),
- splitting documents page by page,
- merging documents page by page,
- cropping pages,
- merging multiple pages into a single page,
- encrypting and decrypting PDF files.
By being Pure-Python, it should run on any Python platform without any
dependencies on external libraries. It can also work entirely on StringIO
objects rather than file streams, allowing for PDF manipulation in memory.
It is therefore a useful tool for websites that manage or manipulate PDFs.
"""
setup(
name="pyPdf",
version="1.12",
description="PDF toolkit",
long_description=long_description,
author="Mathieu Fenniak",
author_email="biziqe@mathieu.fenniak.net",
url="http://pybrary.net/pyPdf/",
download_url="http://pybrary.net/pyPdf/pyPdf-1.12.tar.gz",
classifiers = [
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"License :: OSI Approved :: BSD License",
"Programming Language :: Python",
"Operating System :: OS Independent",
"Topic :: Software Development :: Libraries :: Python Modules",
],
packages=["pyPdf"],
)