pdfrw (0.1-3) unstable; urgency=medium

* QA upload.
  * Build using dh_python2

# imported from the archive
This commit is contained in:
Matthias Klose 2014-07-13 17:50:59 +02:00
commit a1959ba9c0
49 changed files with 3407 additions and 0 deletions

21
LICENSE.txt Normal file
View File

@ -0,0 +1,21 @@
pdfrw (pdfrw.googlecode.com)
Copyright (c) 2006-2012 Patrick Maupin
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

3
README.txt Normal file
View File

@ -0,0 +1,3 @@
pdfrw reads and writes PDF files.
More info at http://code.google.com/p/pdfrw

45
debian/changelog vendored Normal file
View File

@ -0,0 +1,45 @@
pdfrw (0.1-3) unstable; urgency=medium
* QA upload.
* Build using dh_python2
-- Matthias Klose <doko@debian.org> Sun, 13 Jul 2014 15:50:59 +0000
pdfrw (0.1-2) unstable; urgency=medium
* Orphaning package.
-- Chris Lamb <lamby@debian.org> Sun, 09 Feb 2014 00:05:27 +0000
pdfrw (0.1-1) unstable; urgency=low
* New upstream release.
-- Chris Lamb <lamby@debian.org> Tue, 16 Oct 2012 07:54:53 +0100
pdfrw (0+svn136-4) unstable; urgency=low
* Correct Homepage field. (Closes: #683165)
* Specify a 'name' kwarg in call to setuptools.setup.
-- Chris Lamb <lamby@debian.org> Tue, 31 Jul 2012 02:41:14 -0700
pdfrw (0+svn136-3) unstable; urgency=low
* python-pdfrw should Replaces/Provides/Conflicts pdfrw. Thanks to intrigeri
<intrigeri@boum.org>. (Closes: #639273)
-- Chris Lamb <lamby@debian.org> Fri, 26 Aug 2011 10:48:38 +0100
pdfrw (0+svn136-2) unstable; urgency=low
* Rename binary package to "python-pdfrw".
* Change Section to "python".
-- Chris Lamb <lamby@debian.org> Tue, 23 Aug 2011 15:17:20 +0100
pdfrw (0+svn136-1) unstable; urgency=low
* Initial release. (Closes: #638862)
-- Chris Lamb <lamby@debian.org> Mon, 22 Aug 2011 16:09:03 +0100

1
debian/compat vendored Normal file
View File

@ -0,0 +1 @@
7

32
debian/control vendored Normal file
View File

@ -0,0 +1,32 @@
Source: pdfrw
Section: python
Priority: optional
Maintainer: Debian QA Group <packages@qa.debian.org>
Build-Depends: debhelper (>= 7.0.50~)
Build-Depends-Indep: python-setuptools
Standards-Version: 3.9.2
Homepage: http://code.google.com/p/pdfrw/
Vcs-Git: git://github.com/lamby/pkg-pdfrw.git
Vcs-Browser: https://github.com/lamby/pkg-pdfrw
Package: python-pdfrw
Architecture: all
Depends: ${misc:Depends}, ${python:Depends}, python-reportlab
Replaces: pdfrw
Provides: pdfrw
Conflicts: pdfrw
Description: PDF file manipulation library
pdfrw can read and write PDF files, and can also be used to read in PDFs which
can then be used inside reportlab.
.
pdfrw tries to be agnostic about the contents of PDF files, and support them
as containers, but to do useful work, something a little higher-level is
required. It supports the following:
.
* PDF pages. pdfrw knows enough to find the pages in PDF files you read in,
and to write a set of pages back out to a new PDF file.
* Form XObjects. pdfrw can take any page or rectangle on a page, and convert
it to a Form XObject, suitable for use inside another PDF file
* reportlab objects. pdfrw can recursively create a set of reportlab objects
from its internal object format. This allows, for example, Form XObjects to
be used inside reportlab.

44
debian/copyright vendored Normal file
View File

@ -0,0 +1,44 @@
Author: Patrick Maupin
Download: http://code.google.com/p/pdfrw/
Files: *
Copyright: © 2006-2009 Patrick Maupin
License: MIT
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
.
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
Files: debian/*
Copyright: © 2011 Chris Lamb <chris@chris-lamb.co.uk>
License: MIT
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
.
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

1
debian/examples vendored Normal file
View File

@ -0,0 +1 @@
examples/*

4
debian/rules vendored Executable file
View File

@ -0,0 +1,4 @@
#!/usr/bin/make -f
%:
dh $@ --with python2

1
debian/source/format vendored Normal file
View File

@ -0,0 +1 @@
3.0 (quilt)

51
examples/4up.py Executable file
View File

@ -0,0 +1,51 @@
#!/usr/bin/env python
'''
usage: 4up.py my.pdf firstpage lastpage
Creates 4up.my.pdf
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfName, PdfArray
from pdfrw.buildxobj import pagexobj
def get4(allpages):
# Pull a maximum of 4 pages off the list
pages = [pagexobj(x) for x in allpages[:4]]
del allpages[:4]
x_max = max(page.BBox[2] for page in pages)
y_max = max(page.BBox[3] for page in pages)
stream = []
xobjdict = PdfDict()
for index, page in enumerate(pages):
x = x_max * (index & 1) / 2.0
y = y_max * (index <= 1) / 2.0
index = '/P%s' % index
stream.append('q 0.5 0 0 0.5 %s %s cm %s Do Q\n' % (x, y, index))
xobjdict[index] = page
return PdfDict(
Type = PdfName.Page,
Contents = PdfDict(stream=''.join(stream)),
MediaBox = PdfArray([0, 0, x_max, y_max]),
Resources = PdfDict(XObject = xobjdict),
)
def go(inpfn, outfn):
pages = PdfReader(inpfn).pages
writer = PdfWriter()
while pages:
writer.addpage(get4(pages))
writer.write(outfn)
if __name__ == '__main__':
inpfn, = sys.argv[1:]
outfn = '4up.' + os.path.basename(inpfn)
go(inpfn, outfn)

32
examples/README.txt Normal file
View File

@ -0,0 +1,32 @@
Example programs:
4up.py -- Prints pages four-up
alter.py -- Simple example of making a very slight modification to a PDF.
booklet.py -- Converts a PDF into a booklet.
metadata.py -- Concatenates multiple PDFs, adds metadata.
poster.py -- Changes the size of a PDF to create a poster
print_two.py -- this is used when printing two cut-down copies on a single sheet of paper (double-sided) Requires uncompressed PDF.
rotate.py -- This will rotate selected ranges of pages within a document.
subset.py -- This will retrieve a subset of pages from a document.
watermark.py -- Adds a watermark to a PDF
rl1/4up.py -- Same as 4up.py, using reportlab for output. Next simplest reportlab example.
rl1/booklet.py -- Version of print_booklet using reportlab for output.
rl1/platypus_pdf_template.py -- Example using a PDF page as a watermark background with reportlab.
rl1/subset.py -- Same as subset.py, using reportlab for output. Simplest reportlab example.
rl2/copy.py -- example of how you could parse a graphics stream and then use reportlab for output.
Works on a few different PDFs, probably not a suitable starting point for real
production work without a lot of work on the library functions.

25
examples/alter.py Executable file
View File

@ -0,0 +1,25 @@
#!/usr/bin/env python
'''
usage: alter.py my.pdf
Creates alter.my.pdf
Demonstrates making a slight alteration to a preexisting PDF file.
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter
inpfn, = sys.argv[1:]
outfn = 'alter.' + os.path.basename(inpfn)
trailer = PdfReader(inpfn)
trailer.Info.Title = 'My New Title Goes Here'
writer = PdfWriter()
writer.trailer = trailer
writer.write(outfn)

65
examples/booklet.py Executable file
View File

@ -0,0 +1,65 @@
#!/usr/bin/env python
'''
usage: booklet.py my.pdf
Creates booklet.my.pdf
Pages organized in a form suitable for booklet printing.
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfArray, PdfName, IndirectPdfDict
from pdfrw.buildxobj import pagexobj
def fixpage(*pages):
pages = [pagexobj(x) for x in pages]
class PageStuff(tuple):
pass
x = y = 0
for i, page in enumerate(pages):
index = '/P%s' % i
shift_right = x and '1 0 0 1 %s 0 cm ' % x or ''
stuff = PageStuff((index, page))
stuff.stream = 'q %s%s Do Q\n' % (shift_right, index)
x += page.BBox[2]
y = max(y, page.BBox[3])
pages[i] = stuff
# Multiple copies of first page used as a placeholder to
# get blank page on back.
for p1, p2 in zip(pages, pages[1:]):
if p1[1] is p2[1]:
pages.remove(p1)
return IndirectPdfDict(
Type = PdfName.Page,
Contents = PdfDict(stream=''.join(page.stream for page in pages)),
MediaBox = PdfArray([0, 0, x, y]),
Resources = PdfDict(
XObject = PdfDict(pages),
),
)
inpfn, = sys.argv[1:]
outfn = 'booklet.' + os.path.basename(inpfn)
pages = PdfReader(inpfn).pages
# Use page1 as a marker to print a blank at the end
if len(pages) & 1:
pages.append(pages[0])
bigpages = []
while len(pages) > 2:
bigpages.append(fixpage(pages.pop(), pages.pop(0)))
bigpages.append(fixpage(pages.pop(0), pages.pop()))
bigpages += pages
PdfWriter().addpages(bigpages).write(outfn)

33
examples/find_pdfrw.py Normal file
View File

@ -0,0 +1,33 @@
'''
find_xxx.py -- Find the place in the tree where xxx lives.
Ways to use:
1) Make a copy, change 'xxx' in package to be your name; or
2) Under Linux, just ln -s to where this is in the right tree
Created by Pat Maupin, who doesn't consider it big enough to be worth copyrighting
'''
import sys
import os
myname = __name__[5:] # remove 'find_'
myname = os.path.join(myname, '__init__.py')
def trypath(newpath):
path = None
while path != newpath:
path = newpath
if os.path.exists(os.path.join(path, myname)):
return path
newpath = os.path.dirname(path)
root = trypath(__file__) or trypath(os.path.realpath(__file__))
if root is None:
print
print 'Warning: %s: Could not find path to development package %s' % (__file__, myname)
print ' The import will either fail or will use system-installed libraries'
print
elif root not in sys.path:
sys.path.append(root)

39
examples/metadata.py Executable file
View File

@ -0,0 +1,39 @@
#!/usr/bin/env python
'''
usage: metadata.py <first.pdf> [<next.pdf> ...]
Creates output.pdf
This file demonstrates two features:
1) Concatenating multiple input PDFs.
2) adding metadata to the PDF.
If you do not need to add metadata, look at subset.py, which
has a simpler interface to PdfWriter.
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter, IndirectPdfDict
inputs = sys.argv[1:]
assert inputs
outfn = 'output.pdf'
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn.pages)
writer.trailer.Info = IndirectPdfDict(
Title = 'your title goes here',
Author = 'your name goes here',
Subject = 'what is it all about?',
Creator = 'some script goes here',
)
writer.write(outfn)

57
examples/poster.py Executable file
View File

@ -0,0 +1,57 @@
#!/usr/bin/env python
'''
usage: poster.py my.pdf
Shows how to change the size on a PDF.
Motivation:
My daughter needed to create a 48" x 36" poster, but her Mac version of Powerpoint
only wanted to output 8.5" x 11" for some reason.
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfName, PdfArray, IndirectPdfDict
from pdfrw.buildxobj import pagexobj
def adjust(page):
page = pagexobj(page)
assert page.BBox == [0, 0, 11 * 72, int(8.5 * 72)], page.BBox
margin = 72 // 2
old_x, old_y = page.BBox[2] - 2 * margin, page.BBox[3] - 2 * margin
new_x, new_y = 48 * 72, 36 * 72
ratio = 1.0 * new_x / old_x
assert ratio == 1.0 * new_y / old_y
index = '/BasePage'
x = -margin * ratio
y = -margin * ratio
stream = 'q %0.2f 0 0 %0.2f %s %s cm %s Do Q\n' % (ratio, ratio, x, y, index)
xobjdict = PdfDict()
xobjdict[index] = page
return PdfDict(
Type = PdfName.Page,
Contents = PdfDict(stream=stream),
MediaBox = PdfArray([0, 0, new_x, new_y]),
Resources = PdfDict(XObject = xobjdict),
)
def go(inpfn, outfn):
reader = PdfReader(inpfn)
page, = reader.pages
writer = PdfWriter()
writer.addpage(adjust(page))
writer.trailer.Info = IndirectPdfDict(reader.Info)
writer.write(outfn)
if __name__ == '__main__':
inpfn, = sys.argv[1:]
outfn = 'poster.' + os.path.basename(inpfn)
go(inpfn, outfn)

58
examples/print_two.py Executable file
View File

@ -0,0 +1,58 @@
#!/usr/bin/env python
'''
usage: print_two.py my.pdf
Creates print_two.my.pdf
This is only useful when you can cut down sheets of paper to make two
small documents. Works for double-sided only right now.
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter, PdfArray, IndirectPdfDict
def fixpage(page, count=[0]):
count[0] += 1
evenpage = not (count[0] & 1)
# For demo purposes, just go with the MediaBox and toast the others
box = [float(x) for x in page.MediaBox]
assert box[0] == box[1] == 0, "demo won't work on this PDF"
for key, value in sorted(page.iteritems()):
if 'box' in key.lower():
del page[key]
startsize = tuple(box[2:])
finalsize = box[3], 2 * box[2]
page.MediaBox = PdfArray((0, 0) + finalsize)
page.Rotate = (int(page.Rotate or 0) + 90) % 360
contents = page.Contents
if contents is None:
return page
contents = isinstance(contents, dict) and [contents] or contents
prefix = '0 1 -1 0 %s %s cm\n' % (finalsize[0], 0)
if evenpage:
prefix = '1 0 0 1 %s %s cm\n' % (0, finalsize[1]/2) + prefix
first_prefix = 'q\n-1 0 0 -1 %s %s cm\n' % finalsize + prefix
second_prefix = '\nQ\n' + prefix
first_prefix = IndirectPdfDict(stream=first_prefix)
second_prefix = IndirectPdfDict(stream=second_prefix)
contents = PdfArray(([second_prefix] + contents) * 2)
contents[0] = first_prefix
page.Contents = contents
return page
inpfn, = sys.argv[1:]
outfn = 'print_two.' + os.path.basename(inpfn)
pages = PdfReader(inpfn).pages
PdfWriter().addpages(fixpage(x) for x in pages).write(outfn)

57
examples/rl1/4up.py Executable file
View File

@ -0,0 +1,57 @@
#!/usr/bin/env python
'''
usage: 4up.py my.pdf
Uses Form XObjects and reportlab to create 4up.my.pdf.
Demonstrates use of pdfrw with reportlab.
'''
import sys
import os
from reportlab.pdfgen.canvas import Canvas
import find_pdfrw
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
def addpage(canvas, allpages):
pages = allpages[:4]
del allpages[:4]
x_max = max(page.BBox[2] for page in pages)
y_max = max(page.BBox[3] for page in pages)
canvas.setPageSize((x_max, y_max))
for index, page in enumerate(pages):
x = x_max * (index & 1) / 2.0
y = y_max * (index <= 1) / 2.0
canvas.saveState()
canvas.translate(x, y)
canvas.scale(0.5, 0.5)
canvas.doForm(makerl(canvas, page))
canvas.restoreState()
canvas.showPage()
def go(argv):
inpfn, = argv
outfn = '4up.' + os.path.basename(inpfn)
pages = PdfReader(inpfn).pages
pages = [pagexobj(x) for x in pages]
canvas = Canvas(outfn)
while pages:
addpage(canvas, pages)
canvas.save()
if __name__ == '__main__':
go(sys.argv[1:])

9
examples/rl1/README.txt Normal file
View File

@ -0,0 +1,9 @@
This directory contains example scripts which read in PDFs
and convert pages to PDF Form XObjects using pdfrw, and then
write out the PDFs using reportlab.
The examples, from easiest to hardest, are:
subset.py -- prints a subset of pages
4up.py -- prints pages 4-up
booklet.py -- creates a booklet out of the pages

69
examples/rl1/booklet.py Executable file
View File

@ -0,0 +1,69 @@
#!/usr/bin/env python
'''
usage: booklet.py my.pdf
Uses Form XObjects and reportlab to create booklet.my.pdf.
Demonstrates use of pdfrw with reportlab.
'''
import sys
import os
from reportlab.pdfgen.canvas import Canvas
import find_pdfrw
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
def read_and_double(inpfn):
pages = PdfReader(inpfn).pages
pages = [pagexobj(x) for x in pages]
if len(pages) & 1:
pages.append(pages[0]) # Sentinel -- get same size for back as front
xobjs = []
while len(pages) > 2:
xobjs.append((pages.pop(), pages.pop(0)))
xobjs.append((pages.pop(0), pages.pop()))
xobjs += [(x,) for x in pages]
return xobjs
def make_pdf(outfn, xobjpairs):
canvas = Canvas(outfn)
for xobjlist in xobjpairs:
x = y = 0
for xobj in xobjlist:
x += xobj.BBox[2]
y = max(y, xobj.BBox[3])
canvas.setPageSize((x,y))
# Handle blank back page
if len(xobjlist) > 1 and xobjlist[0] == xobjlist[-1]:
xobjlist = xobjlist[:1]
x = xobjlist[0].BBox[2]
else:
x = 0
y = 0
for xobj in xobjlist:
canvas.saveState()
canvas.translate(x, y)
canvas.doForm(makerl(canvas, xobj))
canvas.restoreState()
x += xobj.BBox[2]
canvas.showPage()
canvas.save()
inpfn, = sys.argv[1:]
outfn = 'booklet.' + os.path.basename(inpfn)
make_pdf(outfn, read_and_double(inpfn))

View File

@ -0,0 +1,33 @@
'''
find_xxx.py -- Find the place in the tree where xxx lives.
Ways to use:
1) Make a copy, change 'xxx' in package to be your name; or
2) Under Linux, just ln -s to where this is in the right tree
Created by Pat Maupin, who doesn't consider it big enough to be worth copyrighting
'''
import sys
import os
myname = __name__[5:] # remove 'find_'
myname = os.path.join(myname, '__init__.py')
def trypath(newpath):
path = None
while path != newpath:
path = newpath
if os.path.exists(os.path.join(path, myname)):
return path
newpath = os.path.dirname(path)
root = trypath(__file__) or trypath(os.path.realpath(__file__))
if root is None:
print
print 'Warning: %s: Could not find path to development package %s' % (__file__, myname)
print ' The import will either fail or will use system-installed libraries'
print
elif root not in sys.path:
sys.path.append(root)

View File

@ -0,0 +1,106 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
usage: platypus_pdf_template.py output.pdf pdf_file_to_use_as_template.pdf
Example of using pdfrw to use a pdf (page one) as the background for all
other pages together with platypus.
There is a table of contents in this example for completeness sake.
Contributed by user asannes
"""
import sys
from reportlab.platypus import PageTemplate, BaseDocTemplate, Frame
from reportlab.platypus import NextPageTemplate, Paragraph, PageBreak
from reportlab.platypus.tableofcontents import TableOfContents
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.rl_config import defaultPageSize
from reportlab.lib.units import inch
from reportlab.graphics import renderPDF
import find_pdfrw
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
PAGE_WIDTH = defaultPageSize[0]
PAGE_HEIGHT = defaultPageSize[1]
class MyTemplate(PageTemplate):
"""The kernel of this example, where we use pdfrw to fill in the
background of a page before writing to it. This could be used to fill
in a water mark or similar."""
def __init__(self, pdf_template_filename, name=None):
frames = [Frame(
0.85 * inch,
0.5 * inch,
PAGE_WIDTH - 1.15 * inch,
PAGE_HEIGHT - (1.5 * inch)
)]
PageTemplate.__init__(self, name, frames)
# use first page as template
page = PdfReader(pdf_template_filename).pages[0]
self.page_template = pagexobj(page)
# Scale it to fill the complete page
self.page_xscale = PAGE_WIDTH/self.page_template.BBox[2]
self.page_yscale = PAGE_HEIGHT/self.page_template.BBox[3]
def beforeDrawPage(self, canvas, doc):
"""Draws the background before anything else"""
canvas.saveState()
rl_obj = makerl(canvas, self.page_template)
canvas.scale(self.page_xscale, self.page_yscale)
canvas.doForm(rl_obj)
canvas.restoreState()
class MyDocTemplate(BaseDocTemplate):
"""Used to apply heading to table of contents."""
def afterFlowable(self, flowable):
"""Adds Heading1 to table of contents"""
if flowable.__class__.__name__ == 'Paragraph':
style = flowable.style.name
text = flowable.getPlainText()
key = '%s' % self.seq.nextf('toc')
if style == 'Heading1':
self.canv.bookmarkPage(key)
self.notify('TOCEntry', [1, text, self.page, key])
def create_toc():
"""Creates the table of contents"""
table_of_contents = TableOfContents()
table_of_contents.dotsMinLevel = 0
header1 = ParagraphStyle(name = 'Heading1', fontSize = 16, leading = 16)
header2 = ParagraphStyle(name = 'Heading2', fontSize = 14, leading = 14)
table_of_contents.levelStyles = [header1, header2]
return [table_of_contents, PageBreak()]
def create_pdf(filename, pdf_template_filename):
"""Create the pdf, with all the contents"""
pdf_report = open(filename, "w")
document = MyDocTemplate(pdf_report)
templates = [ MyTemplate(pdf_template_filename, name='background') ]
document.addPageTemplates(templates)
styles = getSampleStyleSheet()
elements = [NextPageTemplate('background')]
elements.extend(create_toc())
# Dummy content (hello world x 200)
for i in range(200):
elements.append(Paragraph("Hello World" + str(i), styles['Heading1']))
document.multiBuild(elements)
pdf_report.close()
if __name__ == '__main__':
try:
output, template = sys.argv[1:]
create_pdf(output, template)
except ValueError:
print "Usage: %s <output> <template>" % (sys.argv[0])

43
examples/rl1/subset.py Executable file
View File

@ -0,0 +1,43 @@
#!/usr/bin/env python
'''
usage: subset.py my.pdf firstpage lastpage
Creates subset_<pagenum>_to_<pagenum>.my.pdf
Uses Form XObjects and reportlab to create output file.
Demonstrates use of pdfrw with reportlab.
'''
import sys
import os
from reportlab.pdfgen.canvas import Canvas
import find_pdfrw
from pdfrw import PdfReader
from pdfrw.buildxobj import pagexobj
from pdfrw.toreportlab import makerl
def go(inpfn, firstpage, lastpage):
firstpage, lastpage = int(firstpage), int(lastpage)
outfn = 'subset_%s_to_%s.%s' % (firstpage, lastpage, os.path.basename(inpfn))
pages = PdfReader(inpfn).pages
pages = [pagexobj(x) for x in pages[firstpage-1:lastpage]]
canvas = Canvas(outfn)
for page in pages:
canvas.setPageSize(tuple(page.BBox[2:]))
canvas.doForm(makerl(canvas, page))
canvas.showPage()
canvas.save()
if __name__ == '__main__':
inpfn, firstpage, lastpage = sys.argv[1:]
go(inpfn, firstpage, lastpage)

5
examples/rl2/README.txt Normal file
View File

@ -0,0 +1,5 @@
The copy.py demo in this directory parses the graphics stream from the PDF and actually plays it back through reportlab.
Doesn't yet handle fonts or unicode very well.
For a more practical demo, look at the Form XObjects approach in the examples/rl1 directory.

32
examples/rl2/copy.py Executable file
View File

@ -0,0 +1,32 @@
#!/usr/bin/env python
'''
usage: copy.py my.pdf
Creates copy.my.pdf
Uses somewhat-functional parser. For better results
for most things, see the Form XObject-based method.
'''
import sys
import os
from reportlab.pdfgen.canvas import Canvas
from decodegraphics import parsepage
from pdfrw import PdfReader, PdfWriter, PdfArray
inpfn, = sys.argv[1:]
outfn = 'copy.' + os.path.basename(inpfn)
pages = PdfReader(inpfn).pages
canvas = Canvas(outfn, pageCompression=0)
for page in pages:
box = [float(x) for x in page.MediaBox]
assert box[0] == box[1] == 0, "demo won't work on this PDF"
canvas.setPageSize(box[2:])
parsepage(page, canvas)
canvas.showPage()
canvas.save()

View File

@ -0,0 +1,378 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
This file is an example parser that will parse a graphics stream
into a reportlab canvas.
Needs work on fonts and unicode, but works on a few PDFs.
Better to use Form XObjects for most things (see the example in rl1).
'''
from inspect import getargspec
import find_pdfrw
from pdfrw import PdfTokens
from pdfrw.pdfobjects import PdfString
#############################################################################
# Graphics parsing
def parse_array(self, token='[', params=None):
mylist = []
for token in self.tokens:
if token == ']':
break
mylist.append(token)
self.params.append(mylist)
def parse_savestate(self, token='q', params=''):
self.canv.saveState()
def parse_restorestate(self, token='Q', params=''):
self.canv.restoreState()
def parse_transform(self, token='cm', params='ffffff'):
self.canv.transform(*params)
def parse_linewidth(self, token='w', params='f'):
self.canv.setLineWidth(*params)
def parse_linecap(self, token='J', params='i'):
self.canv.setLineCap(*params)
def parse_linejoin(self, token='j', params='i'):
self.canv.setLineJoin(*params)
def parse_miterlimit(self, token='M', params='f'):
self.canv.setMiterLimit(*params)
def parse_dash(self, token='d', params='as'): # Array, string
self.canv.setDash(*params)
def parse_intent(self, token='ri', params='n'):
# TODO: add logging
pass
def parse_flatness(self, token='i', params='i'):
# TODO: add logging
pass
def parse_gstate(self, token='gs', params='n'):
# TODO: add logging
# Could parse stuff we care about from here later
pass
def parse_move(self, token='m', params='ff'):
if self.gpath is None:
self.gpath = self.canv.beginPath()
self.gpath.moveTo(*params)
self.current_point = params
def parse_line(self, token='l', params='ff'):
self.gpath.lineTo(*params)
self.current_point = params
def parse_curve(self, token='c', params='ffffff'):
self.gpath.curveTo(*params)
self.current_point = params[-2:]
def parse_curve1(self, token='v', params='ffff'):
parse_curve(self, token, tuple(self.current_point) + tuple(params))
def parse_curve2(self, token='y', params='ffff'):
parse_curve(self, token, tuple(params) + tuple(params[-2:]))
def parse_close(self, token='h', params=''):
self.gpath.close()
def parse_rect(self, token='re', params='ffff'):
if self.gpath is None:
self.gpath = self.canv.beginPath()
self.gpath.rect(*params)
self.current_point = params[-2:]
def parse_stroke(self, token='S', params=''):
finish_path(self, 1, 0, 0)
def parse_close_stroke(self, token='s', params=''):
self.gpath.close()
finish_path(self, 1, 0, 0)
def parse_fill(self, token='f', params=''):
finish_path(self, 0, 1, 1)
def parse_fill_compat(self, token='F', params=''):
finish_path(self, 0, 1, 1)
def parse_fill_even_odd(self, token='f*', params=''):
finish_path(self, 0, 1, 0)
def parse_fill_stroke_even_odd(self, token='B*', params=''):
finish_path(self, 1, 1, 0)
def parse_fill_stroke(self, token='B', params=''):
finish_path(self, 1, 1, 1)
def parse_close_fill_stroke_even_odd(self, token='b*', params=''):
self.gpath.close()
finish_path(self, 1, 1, 0)
def parse_close_fill_stroke(self, token='b', params=''):
self.gpath.close()
finish_path(self, 1, 1, 1)
def parse_nop(self, token='n', params=''):
finish_path(self, 0, 0, 0)
def finish_path(self, stroke, fill, fillmode):
if self.gpath is not None:
canv = self.canv
canv._fillMode, oldmode = fillmode, canv._fillMode
canv.drawPath(self.gpath, stroke, fill)
canv._fillMode = oldmode
self.gpath = None
def parse_clip_path(self, token='W', params=''):
# TODO: add logging
pass
def parse_clip_path_even_odd(self, token='W*', params=''):
# TODO: add logging
pass
def parse_stroke_gray(self, token='G', params='f'):
self.canv.setStrokeGray(*params)
def parse_fill_gray(self, token='g', params='f'):
self.canv.setFillGray(*params)
def parse_stroke_rgb(self, token='RG', params='fff'):
self.canv.setStrokeColorRGB(*params)
def parse_fill_rgb(self, token='rg', params='fff'):
self.canv.setFillColorRGB(*params)
def parse_stroke_cmyk(self, token='K', params='ffff'):
self.canv.setStrokeColorCMYK(*params)
def parse_fill_cmyk(self, token='k', params='ffff'):
self.canv.setFillColorCMYK(*params)
#############################################################################
# Text parsing
def parse_begin_text(self, token='BT', params=''):
assert self.tpath is None
self.tpath = self.canv.beginText()
def parse_text_transform(self, token='Tm', params='ffffff'):
path = self.tpath
# Stoopid optimization to remove nop
try:
code = path._code
except AttributeError:
pass
else:
if code[-1] == '1 0 0 1 0 0 Tm':
code.pop()
path.setTextTransform(*params)
def parse_setfont(self, token='Tf', params='nf'):
fontinfo = self.fontdict[params[0]]
self.tpath._setFont(fontinfo.name, params[1])
self.curfont = fontinfo
def parse_text_out(self, token='Tj', params='t'):
text = params[0].decode(self.curfont.remap, self.curfont.twobyte)
self.tpath.textOut(text)
def parse_TJ(self, token='TJ', params='a'):
remap = self.curfont.remap
twobyte = self.curfont.twobyte
result = []
for x in params[0]:
if isinstance(x, PdfString):
result.append(x.decode(remap, twobyte))
else:
# TODO: Adjust spacing between characters here
int(x)
text = ''.join(result)
self.tpath.textOut(text)
def parse_end_text(self, token='ET', params=''):
assert self.tpath is not None
self.canv.drawText(self.tpath)
self.tpath=None
def parse_move_cursor(self, token='Td', params='ff'):
self.tpath.moveCursor(params[0], -params[1])
def parse_set_leading(self, token='TL', params='f'):
self.tpath.setLeading(*params)
def parse_text_line(self, token='T*', params=''):
self.tpath.textLine()
def parse_set_char_space(self, token='Tc', params='f'):
self.tpath.setCharSpace(*params)
def parse_set_word_space(self, token='Tw', params='f'):
self.tpath.setWordSpace(*params)
def parse_set_hscale(self, token='Tz', params='f'):
self.tpath.setHorizScale(params[0] - 100)
def parse_set_rise(self, token='Ts', params='f'):
self.tpath.setRise(*params)
def parse_xobject(self, token='Do', params='n'):
# TODO: Need to do this
pass
class FontInfo(object):
''' Pretty basic -- needs a lot of work to work right for all fonts
'''
lookup = {
'BitstreamVeraSans' : 'Helvetica', # WRONG -- have to learn about font stuff...
}
def __init__(self, source):
name = source.BaseFont[1:]
self.name = self.lookup.get(name, name)
self.remap = chr
self.twobyte = False
info = source.ToUnicode
if not info:
return
info = info.stream.split('beginbfchar')[1].split('endbfchar')[0]
info = list(PdfTokens(info))
assert not len(info) & 1
info2 = []
for x in info:
assert x[0] == '<' and x[-1] == '>' and len(x) in (4,6), x
i = int(x[1:-1], 16)
info2.append(i)
self.remap = dict((x,chr(y)) for (x,y) in zip(info2[::2], info2[1::2])).get
self.twobyte = len(info[0]) > 4
#############################################################################
# Control structures
def findparsefuncs():
def checkname(n):
assert n.startswith('/')
return n
def checkarray(a):
assert isinstance(a, list), a
return a
def checktext(t):
assert isinstance(t, PdfString)
return t
fixparam = dict(f=float, i=int, n=checkname, a=checkarray, s=str, t=checktext)
fixcache = {}
def fixlist(params):
try:
result = fixcache[params]
except KeyError:
result = tuple(fixparam[x] for x in params)
fixcache[params] = result
return result
dispatch = {}
expected_args = 'self token params'.split()
for key, func in globals().iteritems():
if key.startswith('parse_'):
args, varargs, keywords, defaults = getargspec(func)
assert args == expected_args and varargs is None \
and keywords is None and len(defaults) == 2, \
(key, args, varargs, keywords, defaults)
token, params = defaults
if params is not None:
params = fixlist(params)
value = func, params
assert dispatch.setdefault(token, value) is value, repr(token)
return dispatch
class _ParseClass(object):
dispatch = findparsefuncs()
@classmethod
def parsepage(cls, page, canvas=None):
self = cls()
contents = page.Contents
if contents.Filter is not None:
raise SystemExit('Cannot parse graphics -- page encoded with %s' % contents.Filter)
dispatch = cls.dispatch.get
self.tokens = tokens = iter(PdfTokens(contents.stream))
self.params = params = []
self.canv = canvas
self.gpath = None
self.tpath = None
self.fontdict = dict((x,FontInfo(y)) for (x, y) in page.Resources.Font.iteritems())
for token in self.tokens:
info = dispatch(token)
if info is None:
params.append(token)
continue
func, paraminfo = info
if paraminfo is None:
func(self, token, ())
continue
delta = len(params) - len(paraminfo)
if delta:
if delta < 0:
print 'Operator %s expected %s parameters, got %s' % (token, len(paraminfo), params)
params[:] = []
continue
else:
print "Unparsed parameters/commands:", params[:delta]
del params[:delta]
paraminfo = zip(paraminfo, params)
try:
params[:] = [x(y) for (x,y) in paraminfo]
except:
for i, (x,y) in enumerate(paraminfo):
try:
x(y)
except:
raise # For now
continue
func(self, token, params)
params[:] = []
def debugparser(undisturbed = set('parse_array'.split())):
def debugdispatch():
def getvalue(oldval):
name = oldval[0].__name__
def myfunc(self, token, params):
print '%s called %s(%s)' % (token, name, ', '.join(str(x) for x in params))
if name in undisturbed:
myfunc = oldval[0]
return myfunc, oldval[1]
return dict((x, getvalue(y)) for (x,y) in _ParseClass.dispatch.iteritems())
class _DebugParse(_ParseClass):
dispatch = debugdispatch()
return _DebugParse.parsepage
parsepage = _ParseClass.parsepage
if __name__ == '__main__':
import sys
from pdfreader import PdfReader
parse = debugparser()
fname, = sys.argv[1:]
pdf = PdfReader(fname)
for i, page in enumerate(pdf.pages):
print '\nPage %s ------------------------------------' % i
parse(page)

View File

@ -0,0 +1,33 @@
'''
find_xxx.py -- Find the place in the tree where xxx lives.
Ways to use:
1) Make a copy, change 'xxx' in package to be your name; or
2) Under Linux, just ln -s to where this is in the right tree
Created by Pat Maupin, who doesn't consider it big enough to be worth copyrighting
'''
import sys
import os
myname = __name__[5:] # remove 'find_'
myname = os.path.join(myname, '__init__.py')
def trypath(newpath):
path = None
while path != newpath:
path = newpath
if os.path.exists(os.path.join(path, myname)):
return path
newpath = os.path.dirname(path)
root = trypath(__file__) or trypath(os.path.realpath(__file__))
if root is None:
print
print 'Warning: %s: Could not find path to development package %s' % (__file__, myname)
print ' The import will either fail or will use system-installed libraries'
print
elif root not in sys.path:
sys.path.append(root)

41
examples/rotate.py Executable file
View File

@ -0,0 +1,41 @@
#!/usr/bin/env python
'''
usage: rotate.py my.pdf rotation [page[range] ...]
eg. rotate.py 270 1-3 5 7-9
Rotation must be multiple of 90 degrees, clockwise.
Creates rotate.my.pdf with selected pages rotated. Rotates all by default.
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter
inpfn = sys.argv[1]
rotate = sys.argv[2]
ranges = sys.argv[3:]
rotate = int(rotate)
assert rotate % 90 == 0
ranges = [[int(y) for y in x.split('-')] for x in ranges]
outfn = 'rotate.%s' % os.path.basename(inpfn)
trailer = PdfReader(inpfn)
pages = trailer.pages
if not ranges:
ranges = [[1, len(pages)]]
for onerange in ranges:
onerange = (onerange + onerange[-1:])[:2]
for pagenum in range(onerange[0]-1, onerange[1]):
pages[pagenum].Rotate = (int(pages[pagenum].inheritable.Rotate or 0) + rotate) % 360
outdata = PdfWriter()
outdata.trailer = trailer
outdata.write(outfn)

30
examples/subset.py Executable file
View File

@ -0,0 +1,30 @@
#!/usr/bin/env python
'''
usage: subset.py my.pdf page[range] [page[range]] ...
eg. subset.py 1-3 5 7-9
Creates subset.my.pdf
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter
inpfn = sys.argv[1]
ranges = sys.argv[2:]
assert ranges, "Expected at least one range"
ranges = ([int(y) for y in x.split('-')] for x in ranges)
outfn = 'subset.%s' % os.path.basename(inpfn)
pages = PdfReader(inpfn).pages
outdata = PdfWriter()
for onerange in ranges:
onerange = (onerange + onerange[-1:])[:2]
for pagenum in range(onerange[0], onerange[1]+1):
outdata.addpage(pages[pagenum-1])
outdata.write(outfn)

114
examples/watermark.py Executable file
View File

@ -0,0 +1,114 @@
#!/usr/bin/env python
'''
Simple example of watermarking using form xobjects (pdfrw).
usage: watermark.py my.pdf single_page.pdf
Creates watermark.my.pdf, with every page overlaid with
first page from single_page.pdf
'''
import sys
import os
import find_pdfrw
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfName, IndirectPdfDict, PdfArray
from pdfrw.buildxobj import pagexobj
def fixpage(page, watermark):
# Find the page's resource dictionary. Create if none
resources = page.inheritable.Resources
if resources is None:
resources = page.Resources = PdfDict()
# Find or create the parent's xobject dictionary
xobjdict = resources.XObject
if xobjdict is None:
xobjdict = resources.XObject = PdfDict()
# Allow for an infinite number of cascaded watermarks
index = 0
while 1:
watermark_name = '/Watermark.%d' % index
if watermark_name not in xobjdict:
break
index += 1
xobjdict[watermark_name] = watermark
# Turn the contents into an array if it is not already one
contents = page.Contents
if not isinstance(contents, PdfArray):
contents = page.Contents = PdfArray([contents])
# Save initial state before executing page
contents.insert(0, IndirectPdfDict(stream='q\n'))
# Restore initial state and append the watermark
contents.append(IndirectPdfDict(stream='Q %s Do\n' % watermark_name))
return page
def watermark(input_fname, watermark_fname, output_fname=None):
outfn = output_fname or ('watermark.' + os.path.basename(input_fname))
w = pagexobj(PdfReader(watermark_fname).pages[0])
pages = PdfReader(input_fname).pages
PdfWriter().addpages([fixpage(x, w) for x in pages]).write(outfn)
return outfn
def fix_pdf(fname, watermark_fname, indir, outdir):
from os import mkdir, path
if not path.exists(outdir):
mkdir(outdir)
watermark = pagexobj(PdfReader(watermark_fname).pages[0])
trailer = PdfReader(path.join(indir, fname))
for page in trailer.pages:
fixpage(page, watermark)
PdfWriter().write(path.join(outdir, fname), trailer)
return len(trailer.pages)
def batch_watermark(pdfdir, watermark_fname, outputdir='tmp'):
import traceback
from glob import glob
from os import path
fnames=glob(pdfdir+"/*.pdf")
total_pages = 0
good_files = 0
for fname in fnames:
fname = fname.replace(pdfdir+'/','')
try:
total_pages += fix_pdf(fname, watermark_fname, pdfdir, outputdir)
good_files += 1
print "%s OK" %fname
except Exception:
print "%s Failed miserably" %fname
print traceback.format_exc()[:2000]
#raise
print "success %.2f%% %s pages" %((float(good_files)/len(fnames))*100, total_pages)
if __name__ == "__main__":
from optparse import OptionParser
parser = OptionParser(description = __doc__)
parser.add_option('-i', dest='input_fname', help='file name to be watermarked (pdf)')
parser.add_option('-w', dest='watermark_fname', help='watermark file name (pdf)')
parser.add_option('-d', dest='pdfdir', help='watermark all pdf files in this directory')
parser.add_option('-o', dest='outdir', help='outputdir used with option -d', default='tmp')
options, args = parser.parse_args()
if options.input_fname and options.watermark_fname:
watermark = pagexobj(PdfReader(options.watermark_fname).pages[0])
outfn = 'watermark.' + os.path.basename(options.input_fname)
pages = PdfReader(options.input_fname).pages
PdfWriter().addpages([fixpage(x, watermark) for x in pages]).write(outfn)
elif options.pdfdir and options.watermark_fname:
batch_watermark(options.pdfdir, options.watermark_fname, options.outdir)
else:
parser.print_help()

16
pdfrw/__init__.py Normal file
View File

@ -0,0 +1,16 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
__version__ = '0.1'
from pdfrw.pdfwriter import PdfWriter
from pdfrw.pdfreader import PdfReader
from pdfrw.objects import PdfObject, PdfName, PdfArray, PdfDict, IndirectPdfDict, PdfString
from pdfrw.tokens import PdfTokens
from pdfrw.errors import PdfParseError
# Add a tiny bit of compatibility to pyPdf
PdfFileReader = PdfReader
PdfFileWriter = PdfWriter

249
pdfrw/buildxobj.py Normal file
View File

@ -0,0 +1,249 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
This module contains code to build PDF "Form XObjects".
A Form XObject allows a fragment from one PDF file to be cleanly
included in another PDF file.
Reference for syntax: "Parameters for opening PDF files" from SDK 8.1
http://www.adobe.com/devnet/acrobat/pdfs/pdf_open_parameters.pdf
supported 'page=xxx', 'viewrect=<left>,<top>,<width>,<height>'
Also supported by this, but not by Adobe:
'rotate=xxx' where xxx in [0, 90, 180, 270]
Units are in points
Reference for content: Adobe PDF reference, sixth edition, version 1.7
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
Form xobjects discussed chapter 4.9, page 355
'''
from pdfrw.objects import PdfDict, PdfArray, PdfName
from pdfrw.pdfreader import PdfReader
from pdfrw.errors import log
class ViewInfo(object):
''' Instantiate ViewInfo with a uri, and it will parse out
the filename, page, and viewrect into object attributes.
'''
doc = None
docname = None
page = None
viewrect = None
rotate = None
def __init__(self, pageinfo='', **kw):
pageinfo=pageinfo.split('#',1)
if len(pageinfo) == 2:
pageinfo[1:] = pageinfo[1].replace('&', '#').split('#')
for key in 'page viewrect'.split():
if pageinfo[0].startswith(key+'='):
break
else:
self.docname = pageinfo.pop(0)
for item in pageinfo:
key, value = item.split('=')
key = key.strip()
value = value.replace(',', ' ').split()
if key in ('page', 'rotate'):
assert len(value) == 1
setattr(self, key, int(value[0]))
elif key == 'viewrect':
assert len(value) == 4
setattr(self, key, [float(x) for x in value])
else:
log.error('Unknown option: %s', key)
for key, value in kw.iteritems():
assert hasattr(self, key), key
setattr(self, key, value)
def get_rotation(rotate):
''' Return clockwise rotation code:
0 = unrotated
1 = 90 degrees
2 = 180 degrees
3 = 270 degrees
'''
try:
rotate = int(rotate)
except (ValueError, TypeError):
return 0
if rotate % 90 != 0:
return 0
return rotate / 90
def rotate_point(point, rotation):
''' Rotate an (x,y) coordinate clockwise by a
rotation code specifying a multiple of 90 degrees.
'''
if rotation & 1:
point = point[1], -point[0]
if rotation & 2:
point = -point[0], -point[1]
return point
def rotate_rect(rect, rotation):
''' Rotate both points within the rectangle, then normalize
the rectangle by returning the new lower left, then new
upper right.
'''
rect = rotate_point(rect[:2], rotation) + rotate_point(rect[2:], rotation)
return (min(rect[0], rect[2]), min(rect[1], rect[3]),
max(rect[0], rect[2]), max(rect[1], rect[3]))
def getrects(inheritable, pageinfo, rotation):
''' Given the inheritable attributes of a page and
the desired pageinfo rectangle, return the page's
media box and the calculated boundary (clip) box.
'''
mbox = tuple([float(x) for x in inheritable.MediaBox])
vrect = pageinfo.viewrect
if vrect is None:
cbox = tuple([float(x) for x in (inheritable.CropBox or mbox)])
else:
# Rotate the media box to match what the user sees,
# figure out the clipping box, then rotate back
mleft, mbot, mright, mtop = rotate_rect(mbox, rotation)
x, y, w, h = vrect
cleft = mleft + x
ctop = mtop - y
cright = cleft + w
cbot = ctop - h
cbox = max(mleft, cleft), max(mbot, cbot), min(mright, cright), min(mtop, ctop)
cbox = rotate_rect(cbox, -rotation)
return mbox, cbox
def _cache_xobj(contents, resources, mbox, bbox, rotation):
''' Return a cached Form XObject, or create a new one and cache it.
Adds private members x, y, w, h
'''
cachedict = contents.xobj_cachedict
if cachedict is None:
cachedict = contents.private.xobj_cachedict = {}
cachekey = mbox, bbox, rotation
result = cachedict.get(cachekey)
if result is None:
func = (_get_fullpage, _get_subpage)[mbox != bbox]
result = PdfDict(
func(contents, resources, mbox, bbox, rotation),
Type = PdfName.XObject,
Subtype = PdfName.Form,
FormType = 1,
BBox = PdfArray(bbox),
)
rect = bbox
if rotation:
matrix = rotate_point((1, 0), rotation) + rotate_point((0, 1), rotation)
result.Matrix = PdfArray(matrix + (0, 0))
rect = rotate_rect(rect, rotation)
result.private.x = rect[0]
result.private.y = rect[1]
result.private.w = rect[2] - rect[0]
result.private.h = rect[3] - rect[1]
cachedict[cachekey] = result
return result
def _get_fullpage(contents, resources, mbox, bbox, rotation):
''' fullpage is easy. Just copy the contents,
set up the resources, and let _cache_xobj handle the
rest.
'''
return PdfDict(contents, Resources=resources)
def _get_subpage(contents, resources, mbox, bbox, rotation):
''' subpages *could* be as easy as full pages, but we
choose to complicate life by creating a Form XObject
for the page, and then one that references it for
the subpage, on the off-chance that we want multiple
items from the page.
'''
return PdfDict(
stream = '/FullPage Do\n',
Resources = PdfDict(
XObject = PdfDict(
FullPage = _cache_xobj(contents, resources, mbox, mbox, 0)
)
)
)
def pagexobj(page, viewinfo=ViewInfo(), allow_compressed=True):
''' pagexobj creates and returns a Form XObject for
a given view within a page (Defaults to entire page.)
'''
inheritable = page.inheritable
resources = inheritable.Resources
rotation = get_rotation(inheritable.Rotate)
mbox, bbox = getrects(inheritable, viewinfo, rotation)
rotation += get_rotation(viewinfo.rotate)
contents = page.Contents
# Make sure the only attribute is length
# All the filters must have been executed
assert int(contents.Length) == len(contents.stream)
if not allow_compressed:
assert len([x for x in contents.iteritems()]) == 1
return _cache_xobj(contents, resources, mbox, bbox, rotation)
def docxobj(pageinfo, doc=None, allow_compressed=True):
''' docxobj creates and returns an actual Form XObject.
Can work standalone, or in conjunction with
the CacheXObj class (below).
'''
if not isinstance(pageinfo, ViewInfo):
pageinfo = ViewInfo(pageinfo)
# If we're explicitly passed a document,
# make sure we don't have one implicitly as well.
# If no implicit or explicit doc, then read one in
# from the filename.
if doc is not None:
assert pageinfo.doc is None
pageinfo.doc = doc
elif pageinfo.doc is not None:
doc = pageinfo.doc
else:
doc = pageinfo.doc = PdfReader(pageinfo.docname, decompress = not allow_compressed)
assert isinstance(doc, PdfReader)
sourcepage = doc.pages[(pageinfo.page or 1) - 1]
return pagexobj(sourcepage, pageinfo, allow_compressed)
class CacheXObj(object):
''' Use to keep from reparsing files over and over,
and to keep from making the output too much
bigger than it ought to be by replicating
unnecessary object copies.
'''
def __init__(self, decompress=False):
''' Set decompress true if you need
the Form XObjects to be decompressed.
Will decompress what it can and scream
about the rest.
'''
self.cached_pdfs = {}
self.decompress = decompress
def load(self, sourcename):
''' Load a Form XObject from a uri
'''
info = ViewInfo(sourcename)
fname = info.docname
pcache = self.cached_pdfs
doc = pcache.get(fname)
if doc is None:
doc = pcache[fname] = PdfReader(fname, decompress=self.decompress)
return docxobj(info, doc, allow_compressed=not self.decompress)

26
pdfrw/compress.py Normal file
View File

@ -0,0 +1,26 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
Currently, this sad little file only knows how to decompress
using the flate (zlib) algorithm. Maybe more later, but it's
not a priority for me...
'''
import zlib
from pdfrw.objects import PdfDict, PdfName
from pdfrw.errors import log
from pdfrw.uncompress import streamobjects
def compress(mylist):
flate = PdfName.FlateDecode
for obj in streamobjects(mylist):
ftype = obj.Filter
if ftype is not None:
continue
oldstr = obj.stream
newstr = zlib.compress(oldstr)
if len(newstr) < len(oldstr) + 30:
obj.stream = newstr
obj.Filter = flate
obj.DecodeParms = None

31
pdfrw/errors.py Normal file
View File

@ -0,0 +1,31 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
PDF Exceptions and error handling
'''
import logging
from exceptions import Exception
logging.basicConfig(
format='[%(levelname)s] %(filename)s:%(lineno)d %(message)s',
level=logging.WARNING)
log = logging.getLogger('pdfrw')
class PdfError(Exception):
"Abstract base class of exceptions thrown by this module"
def __init__(self, msg):
self.msg = msg
def __str__(self):
return self.msg
class PdfParseError(PdfError):
"Error thrown by parser/tokenizer"
class PdfOutputError(PdfError):
"Error thrown by PDF writer"

16
pdfrw/objects/__init__.py Normal file
View File

@ -0,0 +1,16 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
Objects that can occur in PDF files. The most important
objects are arrays and dicts. Either of these can be
indirect or not, and dicts could have an associated
stream.
'''
from pdfrw.objects.pdfname import PdfName
from pdfrw.objects.pdfdict import PdfDict, IndirectPdfDict
from pdfrw.objects.pdfarray import PdfArray
from pdfrw.objects.pdfobject import PdfObject
from pdfrw.objects.pdfstring import PdfString
from pdfrw.objects.pdfindirect import PdfIndirect

59
pdfrw/objects/pdfarray.py Normal file
View File

@ -0,0 +1,59 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
from pdfrw.objects.pdfindirect import PdfIndirect
from pdfrw.objects.pdfobject import PdfObject
def _resolved():
pass
class PdfArray(list):
''' A PdfArray maps the PDF file array object into a Python list.
It has an indirect attribute which defaults to False.
'''
indirect = False
def __init__(self, source=[]):
self._resolve = self._resolver
self.extend(source)
def _resolver(self, isinstance=isinstance, enumerate=enumerate,
listiter=list.__iter__,
PdfIndirect=PdfIndirect, resolved=_resolved,
PdfNull=PdfObject('null')):
for index, value in enumerate(list.__iter__(self)):
if isinstance(value, PdfIndirect):
value = value.real_value()
if value is None:
value = PdfNull
self[index] = value
self._resolve = resolved
def __getitem__(self, index, listget=list.__getitem__):
self._resolve()
return listget(self, index)
def __getslice__(self, index, listget=list.__getslice__):
self._resolve()
return listget(self, index)
def __iter__(self, listiter=list.__iter__):
self._resolve()
return listiter(self)
def count(self, item):
self._resolve()
return list.count(self, item)
def index(self, item):
self._resolve()
return list.index(self, item)
def remove(self, item):
self._resolve()
return list.remove(self, item)
def sort(self, *args, **kw):
self._resolve()
return list.sort(self, *args, **kw)
def pop(self, *args):
self._resolve()
return list.pop(self, *args)

205
pdfrw/objects/pdfdict.py Normal file
View File

@ -0,0 +1,205 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
from pdfrw.objects.pdfname import PdfName
from pdfrw.objects.pdfindirect import PdfIndirect
from pdfrw.objects.pdfobject import PdfObject
class _DictSearch(object):
''' Used to search for inheritable attributes.
'''
def __init__(self, basedict):
self.basedict = basedict
def __getattr__(self, name, PdfName=PdfName):
return self[PdfName(name)]
def __getitem__(self, name, set=set, getattr=getattr, id=id):
visited = set()
mydict = self.basedict
while 1:
value = mydict[name]
if value is not None:
return value
myid = id(mydict)
assert myid not in visited
visited.add(myid)
mydict = mydict.Parent
if mydict is None:
return
class _Private(object):
''' Used to store private attributes (not output to PDF files)
on PdfDict classes
'''
def __init__(self, pdfdict):
vars(self)['pdfdict'] = pdfdict
def __setattr__(self, name, value):
vars(self.pdfdict)[name] = value
class PdfDict(dict):
''' PdfDict objects are subclassed dictionaries with the following features:
- Every key in the dictionary starts with "/"
- A dictionary item can be deleted by assigning it to None
- Keys that (after the initial "/") conform to Python naming conventions
can also be accessed (set and retrieved) as attributes of the dictionary.
E.g. mydict.Page is the same thing as mydict['/Page']
- Private attributes (not in the PDF space) can be set on the dictionary
object attribute dictionary by using the private attribute:
mydict.private.foo = 3
mydict.foo = 5
x = mydict.foo # x will now contain 3
y = mydict['/foo'] # y will now contain 5
Most standard adobe dictionary keys start with an upper case letter,
so to avoid conflicts, it is best to start private attributes with
lower case letters.
- PdfDicts have the following read-only properties:
- private -- as discussed above, provides write access to dictionary's
attributes
- inheritable -- this creates and returns a "view" attribute that
will search through the object hierarchy for any desired
attribute, such as /Rotate or /MediaBox
- PdfDicts also have the following special attributes:
- indirect is not stored in the PDF dictionary, but in the object's
attribute dictionary
- stream is also stored in the object's attribute dictionary
and will also update the stream length.
- _stream will store in the object's attribute dictionary without
updating the stream length.
It is possible, for example, to have a PDF name such as "/indirect"
or "/stream", but you cannot access such a name as an attribute:
mydict.indirect -- accesses object's attribute dictionary
mydict["/indirect"] -- accesses actual PDF dictionary
'''
indirect = False
stream = None
_special = dict(indirect = ('indirect', False),
stream = ('stream', True),
_stream = ('stream', False),
)
def __setitem__(self, name, value, setter=dict.__setitem__):
assert name.startswith('/'), name
if value is not None:
setter(self, name, value)
elif name in self:
del self[name]
def __init__(self, *args, **kw):
if args:
if len(args) == 1:
args = args[0]
self.update(args)
if isinstance(args, PdfDict):
self.indirect = args.indirect
self._stream = args.stream
for key, value in kw.iteritems():
setattr(self, key, value)
def __getattr__(self, name, PdfName=PdfName):
''' If the attribute doesn't exist on the dictionary object,
try to slap a '/' in front of it and get it out
of the actual dictionary itself.
'''
return self.get(PdfName(name))
def get(self, key, dictget=dict.get, isinstance=isinstance, PdfIndirect=PdfIndirect):
''' Get a value out of the dictionary, after resolving any indirect objects.
'''
value = dictget(self, key)
if isinstance(value, PdfIndirect):
self[key] = value = value.real_value()
return value
def __getitem__(self, key):
return self.get(key)
def __setattr__(self, name, value, special=_special.get, PdfName=PdfName, vars=vars):
''' Set an attribute on the dictionary. Handle the keywords
indirect, stream, and _stream specially (for content objects)
'''
info = special(name)
if info is None:
self[PdfName(name)] = value
else:
name, setlen = info
vars(self)[name] = value
if setlen:
notnone = value is not None
self.Length = notnone and PdfObject(len(value)) or None
def iteritems(self, dictiter=dict.iteritems, isinstance=isinstance, PdfIndirect=PdfIndirect):
''' Iterate over the dictionary, resolving any unresolved objects
'''
for key, value in list(dictiter(self)):
if isinstance(value, PdfIndirect):
self[key] = value = value.real_value()
if value is not None:
assert key.startswith('/'), (key, value)
yield key, value
def items(self):
return list(self.iteritems())
def itervalues(self):
for key, value in self.iteritems():
yield value
def values(self):
return list((value for key, value in self.iteritems()))
def keys(self):
return list((key for key, value in self.iteritems()))
def __iter__(self):
for key, value in self.iteritems():
yield key
def iterkeys(self):
return iter(self)
def copy(self):
return type(self)(self)
def pop(self, key):
value = self.get(key)
del self[key]
return value
def popitem(self):
key, value = dict.pop(self)
if isinstance(value, PdfIndirect):
value = value.real_value()
return value
def inheritable(self):
''' Search through ancestors as needed for inheritable
dictionary items.
NOTE: You might think it would be a good idea
to cache this class, but then you'd have to worry
about it pointing to the wrong dictionary if you
made a copy of the object...
'''
return _DictSearch(self)
inheritable = property(inheritable)
def private(self):
''' Allows setting private metadata for use in
processing (not sent to PDF file).
See note on inheritable
'''
return _Private(self)
private = property(private)
class IndirectPdfDict(PdfDict):
''' IndirectPdfDict is a convenience class. You could
create a direct PdfDict and then set indirect = True on it,
or you could just create an IndirectPdfDict.
'''
indirect = True

View File

@ -0,0 +1,20 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
class _NotLoaded(object):
pass
class PdfIndirect(tuple):
''' A placeholder for an object that hasn't been read in yet.
The object itself is the (object number, generation number) tuple.
The attributes include information about where the object is
referenced from and the file object to retrieve the real object from.
'''
value = _NotLoaded
def real_value(self, NotLoaded=_NotLoaded):
value = self.value
if value is NotLoaded:
value = self.value = self._loader(self)
return value

17
pdfrw/objects/pdfname.py Normal file
View File

@ -0,0 +1,17 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
from pdfrw.objects.pdfobject import PdfObject
class PdfName(object):
''' PdfName is a simple way to get a PDF name from a string:
PdfName.FooBar == PdfObject('/FooBar')
'''
def __getattr__(self, name):
return self(name)
def __call__(self, name, PdfObject=PdfObject):
return PdfObject('/' + name)
PdfName = PdfName()

View File

@ -0,0 +1,10 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
class PdfObject(str):
''' A PdfObject is a textual representation of any PDF file object
other than an array, dict or string. It has an indirect attribute
which defaults to False.
'''
indirect = False

View File

@ -0,0 +1,73 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
import re
class PdfString(str):
''' A PdfString is an encoded string. It has a decode
method to get the actual string data out, and there
is an encode class method to create such a string.
Like any PDF object, it could be indirect, but it
defaults to being a direct object.
'''
indirect = False
unescape_dict = {'\\b':'\b', '\\f':'\f', '\\n':'\n',
'\\r':'\r', '\\t':'\t',
'\\\r\n': '', '\\\r':'', '\\\n':'',
'\\\\':'\\', '\\':'',
}
unescape_pattern = r'(\\\\|\\b|\\f|\\n|\\r|\\t|\\\r\n|\\\r|\\\n|\\[0-9]+|\\)'
unescape_func = re.compile(unescape_pattern).split
hex_pattern = '([a-fA-F0-9][a-fA-F0-9]|[a-fA-F0-9])'
hex_func = re.compile(hex_pattern).split
hex_pattern2 = '([a-fA-F0-9][a-fA-F0-9][a-fA-F0-9][a-fA-F0-9]|[a-fA-F0-9][a-fA-F0-9]|[a-fA-F0-9])'
hex_func2 = re.compile(hex_pattern2).split
hex_funcs = hex_func, hex_func2
def decode_regular(self, remap=chr):
assert self[0] == '(' and self[-1] == ')'
mylist = self.unescape_func(self[1:-1])
result = []
unescape = self.unescape_dict.get
for chunk in mylist:
chunk = unescape(chunk, chunk)
if chunk.startswith('\\') and len(chunk) > 1:
value = int(chunk[1:], 8)
# FIXME: TODO: Handle unicode here
if value > 127:
value = 127
chunk = remap(value)
if chunk:
result.append(chunk)
return ''.join(result)
def decode_hex(self, remap=chr, twobytes=False):
data = ''.join(self.split())
data = self.hex_funcs[twobytes](data)
chars = data[1::2]
other = data[0::2]
assert other[0] == '<' and other[-1] == '>' and ''.join(other) == '<>', self
return ''.join([remap(int(x, 16)) for x in chars])
def decode(self, remap=chr, twobytes=False):
if self.startswith('('):
return self.decode_regular(remap)
else:
return self.decode_hex(remap, twobytes)
def encode(cls, source, usehex=False):
assert not usehex, "Not supported yet"
if isinstance(source, unicode):
source = source.encode('utf-8')
else:
source = str(source)
source = source.replace('\\', '\\\\')
source = source.replace('(', '\\(')
source = source.replace(')', '\\)')
return cls('(' +source + ')')
encode = classmethod(encode)

433
pdfrw/pdfreader.py Normal file
View File

@ -0,0 +1,433 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
The PdfReader class reads an entire PDF file into memory and
parses the top-level container objects. (It does not parse
into streams.) The object subclasses PdfDict, and the
document pages are stored in a list in the pages attribute
of the object.
'''
import gc
from pdfrw.errors import PdfParseError, log
from pdfrw.tokens import PdfTokens
from pdfrw.objects import PdfDict, PdfArray, PdfName, PdfObject, PdfIndirect
from pdfrw.uncompress import uncompress
class PdfReader(PdfDict):
warned_bad_stream_start = False # Use to keep from spewing warnings
warned_bad_stream_end = False # Use to keep from spewing warnings
def findindirect(self, objnum, gennum, PdfIndirect=PdfIndirect, int=int):
''' Return a previously loaded indirect object, or create
a placeholder for it.
'''
key = int(objnum), int(gennum)
result = self.indirect_objects.get(key)
if result is None:
self.indirect_objects[key] = result = PdfIndirect(key)
self.deferred_objects.add(key)
result._loader = self.loadindirect
return result
def readarray(self, source, PdfArray=PdfArray):
''' Found a [ token. Parse the tokens after that.
'''
specialget = self.special.get
result = []
pop = result.pop
append = result.append
for value in source:
if value in ']R':
if value == ']':
break
generation = pop()
value = self.findindirect(pop(), generation)
else:
func = specialget(value)
if func is not None:
value = func(source)
append(value)
return PdfArray(result)
def readdict(self, source, PdfDict=PdfDict):
''' Found a << token. Parse the tokens after that.
'''
specialget = self.special.get
result = PdfDict()
next = source.next
tok = next()
while tok != '>>':
if not tok.startswith('/'):
source.exception('Expected PDF /name object')
key = tok
value = next()
func = specialget(value)
if func is not None:
value = func(source)
tok = next()
else:
tok = next()
if value.isdigit() and tok.isdigit():
if next() != 'R':
source.exception('Expected "R" following two integers')
value = self.findindirect(value, tok)
tok = next()
result[key] = value
return result
def empty_obj(self, source, PdfObject=PdfObject):
''' Some silly git put an empty object in the
file. Back up so the caller sees the endobj.
'''
source.floc = source.tokstart
def badtoken(self, source):
''' Didn't see that coming.
'''
source.exception('Unexpected delimiter')
def findstream(self, obj, tok, source, PdfDict=PdfDict, isinstance=isinstance, len=len):
''' Figure out if there is a content stream
following an object, and return the start
pointer to the content stream if so.
(We can't read it yet, because we might not
know how long it is, because Length might
be an indirect object.)
'''
isdict = isinstance(obj, PdfDict)
if not isdict or tok != 'stream':
source.exception("Expected 'endobj'%s token", isdict and " or 'stream'" or '')
fdata = source.fdata
startstream = source.tokstart + len(tok)
gotcr = fdata[startstream] == '\r'
startstream += gotcr
gotlf = fdata[startstream] == '\n'
startstream += gotlf
if not gotlf:
if not gotcr:
source.exception(r'stream keyword not followed by \n')
if not self.warned_bad_stream_start:
source.warning(r"stream keyword terminated by \r without \n")
self.private.warned_bad_stream_start = True
return startstream
def readstream(self, obj, startstream, source,
streamending = 'endstream endobj'.split(), int=int):
fdata = source.fdata
length = int(obj.Length)
source.floc = target_endstream = startstream + length
endit = source.multiple(2)
obj._stream = fdata[startstream:target_endstream]
if endit == streamending:
return
# The length attribute does not match the distance between the
# stream and endstream keywords.
do_warn, self.warned_bad_stream_end = self.warned_bad_stream_end, False
#TODO: Extract maxstream from dictionary of object offsets
# and use rfind instead of find.
maxstream = len(fdata) - 20
endstream = fdata.find('endstream', startstream, maxstream)
source.floc = startstream
room = endstream - startstream
if endstream < 0:
source.error('Could not find endstream')
return
if length == room + 1 and fdata[startstream-2:startstream] == '\r\n':
source.warning(r"stream keyword terminated by \r without \n")
obj._stream = fdata[startstream-1:target_endstream-1]
return
source.floc = endstream
if length > room:
source.error('stream /Length attribute (%d) appears to be too big (size %d) -- adjusting',
length, room)
obj.stream = fdata[startstream:endstream]
return
if fdata[target_endstream:endstream].rstrip():
source.error('stream /Length attribute (%d) might be smaller than data size (%d)',
length, room)
return
endobj = fdata.find('endobj', endstream, maxstream)
if endobj < 0:
source.error('Could not find endobj after endstream')
return
if fdata[endstream:endobj].rstrip() != 'endstream':
source.error('Unexpected data between endstream and endobj')
return
source.error('Illegal endstream/endobj combination')
def loadindirect(self, key):
result = self.indirect_objects.get(key)
if not isinstance(result, PdfIndirect):
return result
source = self.source
offset = int(self.source.obj_offsets.get(key, '0'))
if not offset:
log.warning("Did not find PDF object %s" % (key,))
return None
# Read the object header and validate it
objnum, gennum = key
source.floc = offset
objid = source.multiple(3)
ok = len(objid) == 3
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
ok = ok and objid[2] == 'obj'
if not ok:
source.floc = offset
source.next()
objheader = '%d %d obj' % (objnum, gennum)
fdata = source.fdata
offset2 = fdata.find('\n' + objheader) + 1 or fdata.find('\r' + objheader) + 1
if not offset2 or fdata.find(fdata[offset2-1] + objheader, offset2) > 0:
source.warning("Expected indirect object '%s'" % objheader)
return None
source.warning("Indirect object %s found at incorrect offset %d (expected offset %d)" %
(objheader, offset2, offset))
source.floc = offset2 + len(objheader)
# Read the object, and call special code if it starts
# an array or dictionary
obj = source.next()
func = self.special.get(obj)
if func is not None:
obj = func(source)
self.indirect_objects[key] = obj
self.deferred_objects.remove(key)
# Mark the object as indirect, and
# add it to the list of streams if it starts a stream
obj.indirect = key
tok = source.next()
if tok != 'endobj':
self.readstream(obj, self.findstream(obj, tok, source), source)
return obj
def findxref(fdata):
''' Find the cross reference section at the end of a file
'''
startloc = fdata.rfind('startxref')
if startloc < 0:
raise PdfParseError('Did not find "startxref" at end of file')
source = PdfTokens(fdata, startloc, False)
tok = source.next()
assert tok == 'startxref' # (We just checked this...)
tableloc = source.next_default()
if not tableloc.isdigit():
source.exception('Expected table location')
if source.next_default().rstrip().lstrip('%') != 'EOF':
source.exception('Expected %%EOF')
return startloc, PdfTokens(fdata, int(tableloc), True)
findxref = staticmethod(findxref)
def parsexref(self, source, int=int, range=range):
''' Parse (one of) the cross-reference file section(s)
'''
fdata = source.fdata
setdefault = source.obj_offsets.setdefault
add_offset = source.all_offsets.append
next = source.next
tok = next()
if tok != 'xref':
source.exception('Expected "xref" keyword')
start = source.floc
try:
while 1:
tok = next()
if tok == 'trailer':
return
startobj = int(tok)
for objnum in range(startobj, startobj + int(next())):
offset = int(next())
generation = int(next())
inuse = next()
if inuse == 'n':
if offset != 0:
setdefault((objnum, generation), offset)
add_offset(offset)
elif inuse != 'f':
raise ValueError
except:
pass
try:
# Table formatted incorrectly. See if we can figure it out anyway.
end = source.fdata.rindex('trailer', start)
table = source.fdata[start:end].splitlines()
for line in table:
tokens = line.split()
if len(tokens) == 2:
objnum = int(tokens[0])
elif len(tokens) == 3:
offset, generation, inuse = int(tokens[0]), int(tokens[1]), tokens[2]
if offset != 0 and inuse == 'n':
setdefault((objnum, generation), offset)
add_offset(offset)
objnum += 1
elif tokens:
log.error('Invalid line in xref table: %s' % repr(line))
raise ValueError
log.warning('Badly formatted xref table')
source.floc = end
source.next()
except:
source.floc = start
source.exception('Invalid table format')
def readpages(self, node):
pagename=PdfName.Page
pagesname=PdfName.Pages
catalogname = PdfName.Catalog
typename = PdfName.Type
kidname = PdfName.Kids
# PDFs can have arbitrarily nested Pages/Page
# dictionary structures.
def readnode(node):
nodetype = node[typename]
if nodetype == pagename:
yield node
elif nodetype == pagesname:
for node in node[kidname]:
for node in readnode(node):
yield node
elif nodetype == catalogname:
for node in readnode(node[pagesname]):
yield node
else:
log.error('Expected /Page or /Pages dictionary, got %s' % repr(node))
try:
return list(readnode(node))
except (AttributeError, TypeError), s:
log.error('Invalid page tree: %s' % s)
return []
def __init__(self, fname=None, fdata=None, decompress=False, disable_gc=True):
# Runs a lot faster with GC off.
disable_gc = disable_gc and gc.isenabled()
try:
if disable_gc:
gc.disable()
if fname is not None:
assert fdata is None
# Allow reading preexisting streams like pyPdf
if hasattr(fname, 'read'):
fdata = fname.read()
else:
try:
f = open(fname, 'rb')
fdata = f.read()
f.close()
except IOError:
raise PdfParseError('Could not read PDF file %s' % fname)
assert fdata is not None
if not fdata.startswith('%PDF-'):
startloc = fdata.find('%PDF-')
if startloc >= 0:
log.warning('PDF header not at beginning of file')
else:
lines = fdata.lstrip().splitlines()
if not lines:
raise PdfParseError('Empty PDF file!')
raise PdfParseError('Invalid PDF header: %s' % repr(lines[0]))
endloc = fdata.rfind('%EOF')
if endloc < 0:
raise PdfParseError('EOF mark not found: %s' % repr(fdata[-20:]))
endloc += 6
junk = fdata[endloc:]
fdata = fdata[:endloc]
if junk.rstrip('\00').strip():
log.warning('Extra data at end of file')
private = self.private
private.indirect_objects = {}
private.deferred_objects = set()
private.special = {'<<': self.readdict,
'[': self.readarray,
'endobj': self.empty_obj,
}
for tok in r'\ ( ) < > { } ] >> %'.split():
self.special[tok] = self.badtoken
startloc, source = self.findxref(fdata)
private.source = source
xref_table_list = []
source.all_offsets = []
while 1:
source.obj_offsets = {}
# Loop through all the cross-reference tables
self.parsexref(source)
tok = source.next()
if tok != '<<':
source.exception('Expected "<<" starting catalog')
newdict = self.readdict(source)
token = source.next()
if token != 'startxref' and not xref_table_list:
source.warning('Expected "startxref" at end of xref table')
# Loop if any previously-written tables.
prev = newdict.Prev
if prev is None:
break
if not xref_table_list:
newdict.Prev = None
original_indirect = self.indirect_objects.copy()
original_newdict = newdict
source.floc = int(prev)
xref_table_list.append(source.obj_offsets)
self.indirect_objects.clear()
if xref_table_list:
for update in reversed(xref_table_list):
source.obj_offsets.update(update)
self.indirect_objects.clear()
self.indirect_objects.update(original_indirect)
newdict = original_newdict
self.update(newdict)
#self.read_all_indirect(source)
private.pages = self.readpages(self.Root)
if decompress:
self.uncompress()
# For compatibility with pyPdf
private.numPages = len(self.pages)
finally:
if disable_gc:
gc.enable()
# For compatibility with pyPdf
def getPage(self, pagenum):
return self.pages[pagenum]
def read_all(self):
deferred = self.deferred_objects
prev = set()
while 1:
new = deferred - prev
if not new:
break
prev |= deferred
for key in new:
self.loadindirect(key)
def uncompress(self):
self.read_all()
uncompress(self.indirect_objects.itervalues())

295
pdfrw/pdfwriter.py Executable file
View File

@ -0,0 +1,295 @@
#!/usr/bin/env python
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
The PdfWriter class writes an entire PDF file out to disk.
The writing process is not at all optimized or organized.
An instance of the PdfWriter class has two methods:
addpage(page)
and
write(fname)
addpage() assumes that the pages are part of a valid
tree/forest of PDF objects.
'''
try:
set
except NameError:
from sets import Set as set
from pdfrw.objects import PdfName, PdfArray, PdfDict, IndirectPdfDict, PdfObject, PdfString
from pdfrw.compress import compress as do_compress
from pdfrw.errors import PdfOutputError, log
NullObject = PdfObject('null')
NullObject.indirect = True
NullObject.Type = 'Null object'
def FormatObjects(f, trailer, version='1.3', compress=True, killobj=(),
id=id, isinstance=isinstance, getattr=getattr,len=len,
sum=sum, set=set, str=str, basestring=basestring,
hasattr=hasattr, repr=repr, enumerate=enumerate,
list=list, dict=dict, tuple=tuple,
do_compress=do_compress, PdfArray=PdfArray,
PdfDict=PdfDict, PdfObject=PdfObject, encode=PdfString.encode):
''' FormatObjects performs the actual formatting and disk write.
Should be a class, was a class, turned into nested functions
for performace (to reduce attribute lookups).
'''
def add(obj):
''' Add an object to our list, if it's an indirect
object. Just format it if not.
'''
# Can't hash dicts, so just hash the object ID
objid = id(obj)
# Automatically set stream objects to indirect
if isinstance(obj, PdfDict):
indirect = obj.indirect or (obj.stream is not None)
else:
indirect = getattr(obj, 'indirect', False)
if not indirect:
if objid in visited:
log.warning('Replicating direct %s object, should be indirect for optimal file size' % type(obj))
obj = type(obj)(obj)
objid = id(obj)
visiting(objid)
result = format_obj(obj)
leaving(objid)
return result
objnum = indirect_dict_get(objid)
# If we haven't seen the object yet, we need to
# add it to the indirect object list.
if objnum is None:
swapped = swapobj(objid)
if swapped is not None:
old_id = objid
obj = swapped
objid = id(obj)
objnum = indirect_dict_get(objid)
if objnum is not None:
indirect_dict[old_id] = objnum
return '%s 0 R' % objnum
objnum = len(objlist) + 1
objlist_append(None)
indirect_dict[objid] = objnum
deferred.append((objnum-1, obj))
return '%s 0 R' % objnum
def format_array(myarray, formatter):
# Format array data into semi-readable ASCII
if sum([len(x) for x in myarray]) <= 70:
return formatter % space_join(myarray)
return format_big(myarray, formatter)
def format_big(myarray, formatter):
bigarray = []
count = 1000000
for x in myarray:
lenx = len(x) + 1
count += lenx
if count > 71:
subarray = []
bigarray.append(subarray)
count = lenx
subarray.append(x)
return formatter % lf_join([space_join(x) for x in bigarray])
def format_obj(obj):
''' format PDF object data into semi-readable ASCII.
May mutually recurse with add() -- add() will
return references for indirect objects, and add
the indirect object to the list.
'''
while 1:
if isinstance(obj, (list, dict, tuple)):
if isinstance(obj, PdfArray):
myarray = [add(x) for x in obj]
return format_array(myarray, '[%s]')
elif isinstance(obj, PdfDict):
if compress and obj.stream:
do_compress([obj])
myarray = []
dictkeys = [str(x) for x in obj.keys()]
dictkeys.sort()
for key in dictkeys:
myarray.append(key)
myarray.append(add(obj[key]))
result = format_array(myarray, '<<%s>>')
stream = obj.stream
if stream is not None:
result = '%s\nstream\n%s\nendstream' % (result, stream)
return result
obj = (PdfArray, PdfDict)[isinstance(obj, dict)](obj)
continue
if not hasattr(obj, 'indirect') and isinstance(obj, basestring):
return encode(obj)
return str(getattr(obj, 'encoded', obj))
def format_deferred():
while deferred:
index, obj = deferred.pop()
objlist[index] = format_obj(obj)
indirect_dict = {}
indirect_dict_get = indirect_dict.get
objlist = []
objlist_append = objlist.append
visited = set()
visiting = visited.add
leaving = visited.remove
space_join = ' '.join
lf_join = '\n '.join
f_write = f.write
deferred = []
# Don't reference old catalog or pages objects -- swap references to new ones.
swapobj = {PdfName.Catalog:trailer.Root, PdfName.Pages:trailer.Root.Pages, None:trailer}.get
swapobj = [(objid, swapobj(obj.Type)) for objid, obj in killobj.iteritems()]
swapobj = dict((objid, obj is None and NullObject or obj) for objid, obj in swapobj).get
for objid in killobj:
assert swapobj(objid) is not None
# The first format of trailer gets all the information,
# but we throw away the actual trailer formatting.
format_obj(trailer)
# Keep formatting until we're done.
# (Used to recurse inside format_obj for this, but
# hit system limit.)
format_deferred()
# Now we know the size, so we update the trailer dict
# and get the formatted data.
trailer.Size = PdfObject(len(objlist) + 1)
trailer = format_obj(trailer)
# Now we have all the pieces to write out to the file.
# Keep careful track of the counts while we do it so
# we can correctly build the cross-reference.
header = '%%PDF-%s\n%%\xe2\xe3\xcf\xd3\n' % version
f_write(header)
offset = len(header)
offsets = [(0, 65535, 'f')]
offsets_append = offsets.append
for i, x in enumerate(objlist):
objstr = '%s 0 obj\n%s\nendobj\n' % (i + 1, x)
offsets_append((offset, 0, 'n'))
offset += len(objstr)
f_write(objstr)
f_write('xref\n0 %s\n' % len(offsets))
for x in offsets:
f_write('%010d %05d %s\r\n' % x)
f_write('trailer\n\n%s\nstartxref\n%s\n%%%%EOF\n' % (trailer, offset))
class PdfWriter(object):
_trailer = None
def __init__(self, version='1.3', compress=False):
self.pagearray = PdfArray()
self.compress = compress
self.version = version
self.killobj = {}
def addpage(self, page):
self._trailer = None
if page.Type != PdfName.Page:
raise PdfOutputError('Bad /Type: Expected %s, found %s'
% (PdfName.Page, page.Type))
inheritable = page.inheritable # searches for resources
self.pagearray.append(
IndirectPdfDict(
page,
Resources = inheritable.Resources,
MediaBox = inheritable.MediaBox,
CropBox = inheritable.CropBox,
Rotate = inheritable.Rotate,
)
)
# Add parents in the hierarchy to objects we
# don't want to output
killobj = self.killobj
obj = page.Parent
while obj is not None:
objid = id(obj)
if objid in killobj:
break
killobj[objid] = obj
obj = obj.Parent
return self
addPage = addpage # for compatibility with pyPdf
def addpages(self, pagelist):
for page in pagelist:
self.addpage(page)
return self
def _get_trailer(self):
trailer = self._trailer
if trailer is not None:
return trailer
# Create the basic object structure of the PDF file
trailer = PdfDict(
Root = IndirectPdfDict(
Type = PdfName.Catalog,
Pages = IndirectPdfDict(
Type = PdfName.Pages,
Count = PdfObject(len(self.pagearray)),
Kids = self.pagearray
)
)
)
# Make all the pages point back to the page dictionary
pagedict = trailer.Root.Pages
for page in pagedict.Kids:
page.Parent = pagedict
self._trailer = trailer
return trailer
def _set_trailer(self, trailer):
self._trailer = trailer
trailer = property(_get_trailer, _set_trailer)
def write(self, fname, trailer=None):
trailer = trailer or self.trailer
# Dump the data. We either have a filename or a preexisting
# file object.
preexisting = hasattr(fname, 'write')
f = preexisting and fname or open(fname, 'wb')
FormatObjects(f, trailer, self.version, self.compress, self.killobj)
if not preexisting:
f.close()
if __name__ == '__main__':
import logging
log.setLevel(logging.DEBUG)
import pdfreader
x = pdfreader.PdfReader('source.pdf')
y = PdfWriter()
for i, page in enumerate(x.pages):
print ' Adding page', i+1, '\r',
y.addpage(page)
print
y.write('result.pdf')
print

228
pdfrw/tokens.py Normal file
View File

@ -0,0 +1,228 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
A tokenizer for PDF streams.
In general, documentation used was "PDF reference",
sixth edition, for PDF version 1.7, dated November 2006.
'''
from __future__ import generators
import re
import itertools
from pdfrw.objects import PdfString, PdfObject
from pdfrw.errors import log, PdfParseError
def linepos(fdata, loc):
line = fdata.count('\n', 0, loc) + 1
line += fdata.count('\r', 0, loc) - fdata.count('\r\n', 0, loc)
col = loc - max(fdata.rfind('\n', 0, loc), fdata.rfind('\r', 0, loc))
return line, col
class PdfTokens(object):
# Table 3.1, page 50 of reference, defines whitespace
eol = '\n\r'
whitespace = '\x00 \t\f' + eol
# Text on page 50 defines delimiter characters
# Escape the ]
delimiters = r'()<>{}[\]/%'
# "normal" stuff is all but delimiters or whitespace.
p_normal = r'(?:[^\\%s%s]+|\\[^%s])+' % (whitespace, delimiters, whitespace)
p_comment = r'\%%[^%s]*' % eol
# This will get the bulk of literal strings.
p_literal_string = r'\((?:[^\\()]+|\\.)*[()]?'
# This will get more pieces of literal strings
# (Don't ask me why, but it hangs without the trailing ?.)
p_literal_string_extend = r'(?:[^\\()]+|\\.)*[()]?'
# A hex string. This one's easy.
p_hex_string = r'\<[%s0-9A-Fa-f]*\>' % whitespace
p_dictdelim = r'\<\<|\>\>'
p_name = r'/[^%s%s]*' % (delimiters, whitespace)
p_catchall = '[^%s]' % whitespace
pattern = '|'.join([p_normal, p_name, p_hex_string, p_dictdelim, p_literal_string, p_comment, p_catchall])
findtok = re.compile('(%s)[%s]*' % (pattern, whitespace), re.DOTALL).finditer
findparen = re.compile('(%s)[%s]*' % (p_literal_string_extend, whitespace), re.DOTALL).finditer
splitname = re.compile(r'\#([0-9A-Fa-f]{2})').split
def _cacheobj(cache, obj, constructor):
''' This caching relies on the constructors
returning something that will compare as
equal to the original obj. This works
fine with our PDF objects.
'''
result = cache.get(obj)
if result is None:
result = constructor(obj)
cache[result] = result
return result
def fixname(self, cache, token, constructor, splitname=splitname, join=''.join, cacheobj=_cacheobj):
''' Inside name tokens, a '#' character indicates that
the next two bytes are hex characters to be used
to form the 'real' character.
'''
substrs = splitname(token)
if '#' in join(substrs[::2]):
self.warning('Invalid /Name token')
return token
substrs[1::2] = (chr(int(x, 16)) for x in substrs[1::2])
result = cacheobj(cache, join(substrs), constructor)
result.encoded = token
return result
def _gettoks(self, startloc, cacheobj=_cacheobj,
delimiters=delimiters, findtok=findtok, findparen=findparen,
PdfString=PdfString, PdfObject=PdfObject):
''' Given a source data string and a location inside it,
gettoks generates tokens. Each token is a tuple of the form:
<starting file loc>, <ending file loc>, <token string>
The ending file loc is past any trailing whitespace.
The main complication here is the literal strings, which
can contain nested parentheses. In order to cope with these
we can discard the current iterator and loop back to the
top to get a fresh one.
We could use re.search instead of re.finditer, but that's slower.
'''
fdata = self.fdata
current = self.current = [(startloc, startloc)]
namehandler = (cacheobj, self.fixname)
cache = {}
while 1:
for match in findtok(fdata, current[0][1]):
current[0] = tokspan = match.span()
token = match.group(1)
firstch = token[0]
if firstch not in delimiters:
token = cacheobj(cache, token, PdfObject)
elif firstch in '/<(%':
if firstch == '/':
# PDF Name
token = namehandler['#' in token](cache, token, PdfObject)
elif firstch == '<':
# << dict delim, or < hex string >
if token[1:2] != '<':
token = cacheobj(cache, token, PdfString)
elif firstch == '(':
# Literal string
# It's probably simple, but maybe not
# Nested parentheses are a bear, and if
# they are present, we exit the for loop
# and get back in with a new starting location.
ends = None # For broken strings
if fdata[match.end(1)-1] != ')':
nest = 2
m_start, loc = tokspan
for match in findparen(fdata, loc):
loc = match.end(1)
ending = fdata[loc-1] == ')'
nest += 1 - ending * 2
if not nest:
break
if ending and ends is None:
ends = loc, match.end(), nest
token = fdata[m_start:loc]
current[0] = m_start, match.end()
if nest:
# There is one possible recoverable error seen in
# the wild -- some stupid generators don't escape (.
# If this happens, just terminate on first unescaped ).
# The string won't be quite right, but that's a science
# fair project for another time.
(self.error, self.exception)[not ends]('Unterminated literal string')
loc, ends, nest = ends
token = fdata[m_start:loc] + ')' * nest
current[0] = m_start, ends
token = cacheobj(cache, token, PdfString)
elif firstch == '%':
# Comment
if self.strip_comments:
continue
else:
self.exception('Tokenizer logic incorrect -- should never get here')
yield token
if current[0] is not tokspan:
break
else:
if self.strip_comments:
break
raise StopIteration
def __init__(self, fdata, startloc=0, strip_comments=True):
self.fdata = fdata
self.strip_comments = strip_comments
self.iterator = iterator = self._gettoks(startloc)
self.next = iterator.next
def setstart(self, startloc):
''' Change the starting location.
'''
current = self.current
if startloc != current[0][1]:
current[0] = startloc, startloc
def floc(self):
''' Return the current file position
(where the next token will be retrieved)
'''
return self.current[0][1]
floc = property(floc, setstart)
def tokstart(self):
''' Return the file position of the most
recently retrieved token.
'''
return self.current[0][0]
tokstart = property(tokstart, setstart)
def __iter__(self):
return self.iterator
def multiple(self, count, islice=itertools.islice, list=list):
''' Retrieve multiple tokens
'''
return list(islice(self, count))
def next_default(self, default='nope'):
for result in self:
return result
return default
def msg(self, msg, *arg):
if arg:
msg %= arg
fdata = self.fdata
begin, end = self.current[0]
line, col = linepos(fdata, begin)
if end > begin:
tok = fdata[begin:end].rstrip()
if len(tok) > 30:
tok = tok[:26] + ' ...'
return '%s (line=%d, col=%d, token=%s)' % (msg, line, col, repr(tok))
return '%s (line=%d, col=%d)' % (msg, line, col)
def warning(self, *arg):
log.warning(self.msg(*arg))
def error(self, *arg):
log.error(self.msg(*arg))
def exception(self, *arg):
raise PdfParseError(self.msg(*arg))

139
pdfrw/toreportlab.py Normal file
View File

@ -0,0 +1,139 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
Converts pdfrw objects into reportlab objects.
Designed for and tested with rl 2.3.
Knows too much about reportlab internals.
What can you do?
The interface to this function is through the makerl() function.
Parameters:
canv - a reportlab "canvas" (also accepts a "document")
pdfobj - a pdfrw PDF object
Returns:
A corresponding reportlab object, or if the
object is a PDF Form XObject, the name to
use with reportlab for the object.
Will recursively convert all necessary objects.
Be careful when converting a page -- if /Parent is set,
will recursively convert all pages!
Notes:
1) Original objects are annotated with a
derived_rl_obj attribute which points to the
reportlab object. This keeps multiple reportlab
objects from being generated for the same pdfobj
via repeated calls to makerl. This is great for
not putting too many objects into the
new PDF, but not so good if you are modifying
objects for different pages. Then you
need to do your own deep copying (of circular
structures). You're on your own.
2) ReportLab seems weird about FormXObjects.
They pass around a partial name instead of the
object or a reference to it. So we have to
reach into reportlab and get a number for
a unique name. I guess this is to make it
where you can combine page streams with
impunity, but that's just a guess.
3) Updated 1/23/2010 to handle multipass documents
(e.g. with a table of contents). These have
a different doc object on every pass.
'''
from reportlab.pdfbase import pdfdoc as rldocmodule
from pdfrw.objects import PdfDict, PdfArray, PdfName
RLStream = rldocmodule.PDFStream
RLDict = rldocmodule.PDFDictionary
RLArray = rldocmodule.PDFArray
def _makedict(rldoc, pdfobj):
rlobj = rldict = RLDict()
if pdfobj.indirect:
rlobj.__RefOnly__ = 1
rlobj = rldoc.Reference(rlobj)
pdfobj.derived_rl_obj[rldoc] = rlobj, None
for key, value in pdfobj.iteritems():
rldict[key[1:]] = makerl_recurse(rldoc, value)
return rlobj
def _makestream(rldoc, pdfobj, xobjtype=PdfName.XObject):
rldict = RLDict()
rlobj = RLStream(rldict, pdfobj.stream)
if pdfobj.Type == xobjtype:
shortname = 'pdfrw_%s' % (rldoc.objectcounter+1)
fullname = rldoc.getXObjectName(shortname)
else:
shortname = fullname = None
result = rldoc.Reference(rlobj, fullname)
pdfobj.derived_rl_obj[rldoc] = result, shortname
for key, value in pdfobj.iteritems():
rldict[key[1:]] = makerl_recurse(rldoc, value)
return result
def _makearray(rldoc, pdfobj):
rlobj = rlarray = RLArray([])
if pdfobj.indirect:
rlobj.__RefOnly__ = 1
rlobj = rldoc.Reference(rlobj)
pdfobj.derived_rl_obj[rldoc] = rlobj, None
mylist = rlarray.sequence
for value in pdfobj:
mylist.append(makerl_recurse(rldoc, value))
return rlobj
def _makestr(rldoc, pdfobj):
assert isinstance(pdfobj, (float, int, str)), repr(pdfobj)
return pdfobj
def makerl_recurse(rldoc, pdfobj):
docdict = getattr(pdfobj, 'derived_rl_obj', None)
if docdict is not None:
value = docdict.get(rldoc)
if value is not None:
return value[0]
if isinstance(pdfobj, PdfDict):
if pdfobj.stream is not None:
func = _makestream
else:
func = _makedict
if docdict is None:
pdfobj.private.derived_rl_obj = {}
elif isinstance(pdfobj, PdfArray):
func = _makearray
if docdict is None:
pdfobj.derived_rl_obj = {}
else:
func = _makestr
return func(rldoc, pdfobj)
def makerl(canv, pdfobj):
try:
rldoc = canv._doc
except AttributeError:
rldoc = canv
rlobj = makerl_recurse(rldoc, pdfobj)
try:
name = pdfobj.derived_rl_obj[rldoc][1]
except AttributeError:
name = None
return name or rlobj

52
pdfrw/uncompress.py Normal file
View File

@ -0,0 +1,52 @@
# A part of pdfrw (pdfrw.googlecode.com)
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
# MIT license -- See LICENSE.txt for details
'''
Currently, this sad little file only knows how to decompress
using the flate (zlib) algorithm. Maybe more later, but it's
not a priority for me...
'''
import zlib
from pdfrw.objects import PdfDict, PdfName
from pdfrw.errors import log
def streamobjects(mylist, isinstance=isinstance, PdfDict=PdfDict):
for obj in mylist:
if isinstance(obj, PdfDict) and obj.stream is not None:
yield obj
def uncompress(mylist, warnings=set(), flate = PdfName.FlateDecode,
decompress=zlib.decompressobj, isinstance=isinstance, list=list, len=len):
ok = True
for obj in streamobjects(mylist):
ftype = obj.Filter
if ftype is None:
continue
if isinstance(ftype, list) and len(ftype) == 1:
# todo: multiple filters
ftype = ftype[0]
parms = obj.DecodeParms
if ftype != flate or parms is not None:
msg = 'Not decompressing: cannot use filter %s with parameters %s' % (repr(ftype), repr(parms))
if msg not in warnings:
warnings.add(msg)
log.warning(msg)
ok = False
else:
dco = decompress()
error = None
try:
data = dco.decompress(obj.stream)
except Exception, s:
error = str(s)
if error is None:
assert not dco.unconsumed_tail
if dco.unused_data.strip():
error = 'Unconsumed compression data: %s' % repr(dco.unused_data[:20])
if error is None:
obj.Filter = None
obj.stream = data
else:
log.error('%s %s' % (error, repr(obj.indirect)))
return ok

38
setup.py Normal file
View File

@ -0,0 +1,38 @@
#!/usr/bin/env python
from distutils.core import setup
setup(
name='pdfrw',
version='0.1',
description='PDF file reader/writer library',
long_description='''
pdfrw lets you read and write PDF files, including
compositing multiple pages together (e.g. to do watermarking,
or to copy an image or diagram from one PDF to another),
and can output by itself, or in conjunction with reportlab.
pdfrw will faithfully reproduce vector formats without
rasterization, so the rst2pdf package has used pdfrw
by default for PDF and SVG images by default since
March 2010. Several small examples are provided.
''',
author='Patrick Maupin',
author_email='pmaupin@gmail.com',
platforms="Independent",
url='http://code.google.com/p/pdfrw/',
packages=['pdfrw', 'pdfrw.objects'],
license="MIT",
classifiers=[
'Development Status :: 4 - Beta',
'Environment :: Console',
'Intended Audience :: Developers',
'License :: OSI Approved :: MIT License',
'Operating System :: OS Independent',
'Programming Language :: Python',
'Topic :: Multimedia :: Graphics :: Graphics Conversion',
'Topic :: Software Development :: Libraries',
'Topic :: Utilities'
],
keywords='pdf vector graphics',
)

1
tests/__init__.py Normal file
View File

@ -0,0 +1 @@
# This file intentionally left blank.

37
tests/test_pdfstring.py Normal file
View File

@ -0,0 +1,37 @@
'''
Run from the directory above like so:
python -m tests.test_pdfstring
'''
import pdfrw
import unittest
class TestEncoding(unittest.TestCase):
@staticmethod
def decode(value):
return pdfrw.pdfobjects.PdfString(value).decode()
@staticmethod
def encode(value):
return str(pdfrw.pdfobjects.PdfString.encode(value))
@classmethod
def encode_decode(cls, value):
return cls.decode(cls.encode(value))
def roundtrip(self, value):
self.assertEqual(value, self.encode_decode(value))
def test_doubleslash(self):
self.roundtrip('\\')
def main():
unittest.main()
if __name__ == '__main__':
main()