pdfrw (0.1-3) unstable; urgency=medium
* QA upload. * Build using dh_python2 # imported from the archive
This commit is contained in:
commit
a1959ba9c0
|
@ -0,0 +1,21 @@
|
|||
pdfrw (pdfrw.googlecode.com)
|
||||
|
||||
Copyright (c) 2006-2012 Patrick Maupin
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in
|
||||
all copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
THE SOFTWARE.
|
|
@ -0,0 +1,3 @@
|
|||
pdfrw reads and writes PDF files.
|
||||
|
||||
More info at http://code.google.com/p/pdfrw
|
|
@ -0,0 +1,45 @@
|
|||
pdfrw (0.1-3) unstable; urgency=medium
|
||||
|
||||
* QA upload.
|
||||
* Build using dh_python2
|
||||
|
||||
-- Matthias Klose <doko@debian.org> Sun, 13 Jul 2014 15:50:59 +0000
|
||||
|
||||
pdfrw (0.1-2) unstable; urgency=medium
|
||||
|
||||
* Orphaning package.
|
||||
|
||||
-- Chris Lamb <lamby@debian.org> Sun, 09 Feb 2014 00:05:27 +0000
|
||||
|
||||
pdfrw (0.1-1) unstable; urgency=low
|
||||
|
||||
* New upstream release.
|
||||
|
||||
-- Chris Lamb <lamby@debian.org> Tue, 16 Oct 2012 07:54:53 +0100
|
||||
|
||||
pdfrw (0+svn136-4) unstable; urgency=low
|
||||
|
||||
* Correct Homepage field. (Closes: #683165)
|
||||
* Specify a 'name' kwarg in call to setuptools.setup.
|
||||
|
||||
-- Chris Lamb <lamby@debian.org> Tue, 31 Jul 2012 02:41:14 -0700
|
||||
|
||||
pdfrw (0+svn136-3) unstable; urgency=low
|
||||
|
||||
* python-pdfrw should Replaces/Provides/Conflicts pdfrw. Thanks to intrigeri
|
||||
<intrigeri@boum.org>. (Closes: #639273)
|
||||
|
||||
-- Chris Lamb <lamby@debian.org> Fri, 26 Aug 2011 10:48:38 +0100
|
||||
|
||||
pdfrw (0+svn136-2) unstable; urgency=low
|
||||
|
||||
* Rename binary package to "python-pdfrw".
|
||||
* Change Section to "python".
|
||||
|
||||
-- Chris Lamb <lamby@debian.org> Tue, 23 Aug 2011 15:17:20 +0100
|
||||
|
||||
pdfrw (0+svn136-1) unstable; urgency=low
|
||||
|
||||
* Initial release. (Closes: #638862)
|
||||
|
||||
-- Chris Lamb <lamby@debian.org> Mon, 22 Aug 2011 16:09:03 +0100
|
|
@ -0,0 +1 @@
|
|||
7
|
|
@ -0,0 +1,32 @@
|
|||
Source: pdfrw
|
||||
Section: python
|
||||
Priority: optional
|
||||
Maintainer: Debian QA Group <packages@qa.debian.org>
|
||||
Build-Depends: debhelper (>= 7.0.50~)
|
||||
Build-Depends-Indep: python-setuptools
|
||||
Standards-Version: 3.9.2
|
||||
Homepage: http://code.google.com/p/pdfrw/
|
||||
Vcs-Git: git://github.com/lamby/pkg-pdfrw.git
|
||||
Vcs-Browser: https://github.com/lamby/pkg-pdfrw
|
||||
|
||||
Package: python-pdfrw
|
||||
Architecture: all
|
||||
Depends: ${misc:Depends}, ${python:Depends}, python-reportlab
|
||||
Replaces: pdfrw
|
||||
Provides: pdfrw
|
||||
Conflicts: pdfrw
|
||||
Description: PDF file manipulation library
|
||||
pdfrw can read and write PDF files, and can also be used to read in PDFs which
|
||||
can then be used inside reportlab.
|
||||
.
|
||||
pdfrw tries to be agnostic about the contents of PDF files, and support them
|
||||
as containers, but to do useful work, something a little higher-level is
|
||||
required. It supports the following:
|
||||
.
|
||||
* PDF pages. pdfrw knows enough to find the pages in PDF files you read in,
|
||||
and to write a set of pages back out to a new PDF file.
|
||||
* Form XObjects. pdfrw can take any page or rectangle on a page, and convert
|
||||
it to a Form XObject, suitable for use inside another PDF file
|
||||
* reportlab objects. pdfrw can recursively create a set of reportlab objects
|
||||
from its internal object format. This allows, for example, Form XObjects to
|
||||
be used inside reportlab.
|
|
@ -0,0 +1,44 @@
|
|||
Author: Patrick Maupin
|
||||
Download: http://code.google.com/p/pdfrw/
|
||||
|
||||
Files: *
|
||||
Copyright: © 2006-2009 Patrick Maupin
|
||||
License: MIT
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
.
|
||||
The above copyright notice and this permission notice shall be included in
|
||||
all copies or substantial portions of the Software.
|
||||
.
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
THE SOFTWARE.
|
||||
|
||||
Files: debian/*
|
||||
Copyright: © 2011 Chris Lamb <chris@chris-lamb.co.uk>
|
||||
License: MIT
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
.
|
||||
The above copyright notice and this permission notice shall be included in
|
||||
all copies or substantial portions of the Software.
|
||||
.
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
THE SOFTWARE.
|
|
@ -0,0 +1 @@
|
|||
examples/*
|
|
@ -0,0 +1,4 @@
|
|||
#!/usr/bin/make -f
|
||||
|
||||
%:
|
||||
dh $@ --with python2
|
|
@ -0,0 +1 @@
|
|||
3.0 (quilt)
|
|
@ -0,0 +1,51 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: 4up.py my.pdf firstpage lastpage
|
||||
|
||||
Creates 4up.my.pdf
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfName, PdfArray
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
|
||||
def get4(allpages):
|
||||
# Pull a maximum of 4 pages off the list
|
||||
pages = [pagexobj(x) for x in allpages[:4]]
|
||||
del allpages[:4]
|
||||
|
||||
x_max = max(page.BBox[2] for page in pages)
|
||||
y_max = max(page.BBox[3] for page in pages)
|
||||
|
||||
stream = []
|
||||
xobjdict = PdfDict()
|
||||
for index, page in enumerate(pages):
|
||||
x = x_max * (index & 1) / 2.0
|
||||
y = y_max * (index <= 1) / 2.0
|
||||
index = '/P%s' % index
|
||||
stream.append('q 0.5 0 0 0.5 %s %s cm %s Do Q\n' % (x, y, index))
|
||||
xobjdict[index] = page
|
||||
|
||||
return PdfDict(
|
||||
Type = PdfName.Page,
|
||||
Contents = PdfDict(stream=''.join(stream)),
|
||||
MediaBox = PdfArray([0, 0, x_max, y_max]),
|
||||
Resources = PdfDict(XObject = xobjdict),
|
||||
)
|
||||
|
||||
def go(inpfn, outfn):
|
||||
pages = PdfReader(inpfn).pages
|
||||
writer = PdfWriter()
|
||||
while pages:
|
||||
writer.addpage(get4(pages))
|
||||
writer.write(outfn)
|
||||
|
||||
if __name__ == '__main__':
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = '4up.' + os.path.basename(inpfn)
|
||||
go(inpfn, outfn)
|
|
@ -0,0 +1,32 @@
|
|||
Example programs:
|
||||
|
||||
4up.py -- Prints pages four-up
|
||||
|
||||
alter.py -- Simple example of making a very slight modification to a PDF.
|
||||
|
||||
booklet.py -- Converts a PDF into a booklet.
|
||||
|
||||
metadata.py -- Concatenates multiple PDFs, adds metadata.
|
||||
|
||||
poster.py -- Changes the size of a PDF to create a poster
|
||||
|
||||
print_two.py -- this is used when printing two cut-down copies on a single sheet of paper (double-sided) Requires uncompressed PDF.
|
||||
|
||||
rotate.py -- This will rotate selected ranges of pages within a document.
|
||||
|
||||
subset.py -- This will retrieve a subset of pages from a document.
|
||||
|
||||
watermark.py -- Adds a watermark to a PDF
|
||||
|
||||
rl1/4up.py -- Same as 4up.py, using reportlab for output. Next simplest reportlab example.
|
||||
|
||||
rl1/booklet.py -- Version of print_booklet using reportlab for output.
|
||||
|
||||
rl1/platypus_pdf_template.py -- Example using a PDF page as a watermark background with reportlab.
|
||||
|
||||
rl1/subset.py -- Same as subset.py, using reportlab for output. Simplest reportlab example.
|
||||
|
||||
rl2/copy.py -- example of how you could parse a graphics stream and then use reportlab for output.
|
||||
Works on a few different PDFs, probably not a suitable starting point for real
|
||||
production work without a lot of work on the library functions.
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: alter.py my.pdf
|
||||
|
||||
Creates alter.my.pdf
|
||||
|
||||
Demonstrates making a slight alteration to a preexisting PDF file.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter
|
||||
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = 'alter.' + os.path.basename(inpfn)
|
||||
|
||||
trailer = PdfReader(inpfn)
|
||||
trailer.Info.Title = 'My New Title Goes Here'
|
||||
writer = PdfWriter()
|
||||
writer.trailer = trailer
|
||||
writer.write(outfn)
|
|
@ -0,0 +1,65 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: booklet.py my.pdf
|
||||
|
||||
Creates booklet.my.pdf
|
||||
|
||||
Pages organized in a form suitable for booklet printing.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfArray, PdfName, IndirectPdfDict
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
|
||||
def fixpage(*pages):
|
||||
pages = [pagexobj(x) for x in pages]
|
||||
|
||||
class PageStuff(tuple):
|
||||
pass
|
||||
|
||||
x = y = 0
|
||||
for i, page in enumerate(pages):
|
||||
index = '/P%s' % i
|
||||
shift_right = x and '1 0 0 1 %s 0 cm ' % x or ''
|
||||
stuff = PageStuff((index, page))
|
||||
stuff.stream = 'q %s%s Do Q\n' % (shift_right, index)
|
||||
x += page.BBox[2]
|
||||
y = max(y, page.BBox[3])
|
||||
pages[i] = stuff
|
||||
|
||||
# Multiple copies of first page used as a placeholder to
|
||||
# get blank page on back.
|
||||
for p1, p2 in zip(pages, pages[1:]):
|
||||
if p1[1] is p2[1]:
|
||||
pages.remove(p1)
|
||||
|
||||
return IndirectPdfDict(
|
||||
Type = PdfName.Page,
|
||||
Contents = PdfDict(stream=''.join(page.stream for page in pages)),
|
||||
MediaBox = PdfArray([0, 0, x, y]),
|
||||
Resources = PdfDict(
|
||||
XObject = PdfDict(pages),
|
||||
),
|
||||
)
|
||||
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = 'booklet.' + os.path.basename(inpfn)
|
||||
pages = PdfReader(inpfn).pages
|
||||
|
||||
# Use page1 as a marker to print a blank at the end
|
||||
if len(pages) & 1:
|
||||
pages.append(pages[0])
|
||||
|
||||
bigpages = []
|
||||
while len(pages) > 2:
|
||||
bigpages.append(fixpage(pages.pop(), pages.pop(0)))
|
||||
bigpages.append(fixpage(pages.pop(0), pages.pop()))
|
||||
|
||||
bigpages += pages
|
||||
|
||||
PdfWriter().addpages(bigpages).write(outfn)
|
|
@ -0,0 +1,33 @@
|
|||
'''
|
||||
find_xxx.py -- Find the place in the tree where xxx lives.
|
||||
|
||||
Ways to use:
|
||||
1) Make a copy, change 'xxx' in package to be your name; or
|
||||
2) Under Linux, just ln -s to where this is in the right tree
|
||||
|
||||
Created by Pat Maupin, who doesn't consider it big enough to be worth copyrighting
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
myname = __name__[5:] # remove 'find_'
|
||||
myname = os.path.join(myname, '__init__.py')
|
||||
|
||||
def trypath(newpath):
|
||||
path = None
|
||||
while path != newpath:
|
||||
path = newpath
|
||||
if os.path.exists(os.path.join(path, myname)):
|
||||
return path
|
||||
newpath = os.path.dirname(path)
|
||||
|
||||
root = trypath(__file__) or trypath(os.path.realpath(__file__))
|
||||
|
||||
if root is None:
|
||||
print
|
||||
print 'Warning: %s: Could not find path to development package %s' % (__file__, myname)
|
||||
print ' The import will either fail or will use system-installed libraries'
|
||||
print
|
||||
elif root not in sys.path:
|
||||
sys.path.append(root)
|
|
@ -0,0 +1,39 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: metadata.py <first.pdf> [<next.pdf> ...]
|
||||
|
||||
Creates output.pdf
|
||||
|
||||
This file demonstrates two features:
|
||||
|
||||
1) Concatenating multiple input PDFs.
|
||||
|
||||
2) adding metadata to the PDF.
|
||||
|
||||
If you do not need to add metadata, look at subset.py, which
|
||||
has a simpler interface to PdfWriter.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter, IndirectPdfDict
|
||||
|
||||
inputs = sys.argv[1:]
|
||||
assert inputs
|
||||
outfn = 'output.pdf'
|
||||
|
||||
writer = PdfWriter()
|
||||
for inpfn in inputs:
|
||||
writer.addpages(PdfReader(inpfn.pages)
|
||||
|
||||
writer.trailer.Info = IndirectPdfDict(
|
||||
Title = 'your title goes here',
|
||||
Author = 'your name goes here',
|
||||
Subject = 'what is it all about?',
|
||||
Creator = 'some script goes here',
|
||||
)
|
||||
writer.write(outfn)
|
|
@ -0,0 +1,57 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: poster.py my.pdf
|
||||
|
||||
Shows how to change the size on a PDF.
|
||||
|
||||
Motivation:
|
||||
|
||||
My daughter needed to create a 48" x 36" poster, but her Mac version of Powerpoint
|
||||
only wanted to output 8.5" x 11" for some reason.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfName, PdfArray, IndirectPdfDict
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
|
||||
def adjust(page):
|
||||
page = pagexobj(page)
|
||||
assert page.BBox == [0, 0, 11 * 72, int(8.5 * 72)], page.BBox
|
||||
margin = 72 // 2
|
||||
old_x, old_y = page.BBox[2] - 2 * margin, page.BBox[3] - 2 * margin
|
||||
|
||||
new_x, new_y = 48 * 72, 36 * 72
|
||||
ratio = 1.0 * new_x / old_x
|
||||
assert ratio == 1.0 * new_y / old_y
|
||||
|
||||
index = '/BasePage'
|
||||
x = -margin * ratio
|
||||
y = -margin * ratio
|
||||
stream = 'q %0.2f 0 0 %0.2f %s %s cm %s Do Q\n' % (ratio, ratio, x, y, index)
|
||||
xobjdict = PdfDict()
|
||||
xobjdict[index] = page
|
||||
|
||||
return PdfDict(
|
||||
Type = PdfName.Page,
|
||||
Contents = PdfDict(stream=stream),
|
||||
MediaBox = PdfArray([0, 0, new_x, new_y]),
|
||||
Resources = PdfDict(XObject = xobjdict),
|
||||
)
|
||||
|
||||
def go(inpfn, outfn):
|
||||
reader = PdfReader(inpfn)
|
||||
page, = reader.pages
|
||||
writer = PdfWriter()
|
||||
writer.addpage(adjust(page))
|
||||
writer.trailer.Info = IndirectPdfDict(reader.Info)
|
||||
writer.write(outfn)
|
||||
|
||||
if __name__ == '__main__':
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = 'poster.' + os.path.basename(inpfn)
|
||||
go(inpfn, outfn)
|
|
@ -0,0 +1,58 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: print_two.py my.pdf
|
||||
|
||||
Creates print_two.my.pdf
|
||||
|
||||
This is only useful when you can cut down sheets of paper to make two
|
||||
small documents. Works for double-sided only right now.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter, PdfArray, IndirectPdfDict
|
||||
|
||||
def fixpage(page, count=[0]):
|
||||
count[0] += 1
|
||||
evenpage = not (count[0] & 1)
|
||||
|
||||
# For demo purposes, just go with the MediaBox and toast the others
|
||||
box = [float(x) for x in page.MediaBox]
|
||||
assert box[0] == box[1] == 0, "demo won't work on this PDF"
|
||||
|
||||
for key, value in sorted(page.iteritems()):
|
||||
if 'box' in key.lower():
|
||||
del page[key]
|
||||
|
||||
startsize = tuple(box[2:])
|
||||
finalsize = box[3], 2 * box[2]
|
||||
page.MediaBox = PdfArray((0, 0) + finalsize)
|
||||
page.Rotate = (int(page.Rotate or 0) + 90) % 360
|
||||
|
||||
contents = page.Contents
|
||||
if contents is None:
|
||||
return page
|
||||
contents = isinstance(contents, dict) and [contents] or contents
|
||||
|
||||
prefix = '0 1 -1 0 %s %s cm\n' % (finalsize[0], 0)
|
||||
if evenpage:
|
||||
prefix = '1 0 0 1 %s %s cm\n' % (0, finalsize[1]/2) + prefix
|
||||
first_prefix = 'q\n-1 0 0 -1 %s %s cm\n' % finalsize + prefix
|
||||
second_prefix = '\nQ\n' + prefix
|
||||
first_prefix = IndirectPdfDict(stream=first_prefix)
|
||||
second_prefix = IndirectPdfDict(stream=second_prefix)
|
||||
contents = PdfArray(([second_prefix] + contents) * 2)
|
||||
contents[0] = first_prefix
|
||||
page.Contents = contents
|
||||
return page
|
||||
|
||||
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = 'print_two.' + os.path.basename(inpfn)
|
||||
pages = PdfReader(inpfn).pages
|
||||
|
||||
PdfWriter().addpages(fixpage(x) for x in pages).write(outfn)
|
|
@ -0,0 +1,57 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: 4up.py my.pdf
|
||||
|
||||
|
||||
Uses Form XObjects and reportlab to create 4up.my.pdf.
|
||||
|
||||
Demonstrates use of pdfrw with reportlab.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
from reportlab.pdfgen.canvas import Canvas
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
from pdfrw.toreportlab import makerl
|
||||
|
||||
|
||||
def addpage(canvas, allpages):
|
||||
pages = allpages[:4]
|
||||
del allpages[:4]
|
||||
|
||||
x_max = max(page.BBox[2] for page in pages)
|
||||
y_max = max(page.BBox[3] for page in pages)
|
||||
|
||||
canvas.setPageSize((x_max, y_max))
|
||||
|
||||
for index, page in enumerate(pages):
|
||||
x = x_max * (index & 1) / 2.0
|
||||
y = y_max * (index <= 1) / 2.0
|
||||
canvas.saveState()
|
||||
canvas.translate(x, y)
|
||||
canvas.scale(0.5, 0.5)
|
||||
canvas.doForm(makerl(canvas, page))
|
||||
canvas.restoreState()
|
||||
canvas.showPage()
|
||||
|
||||
|
||||
def go(argv):
|
||||
inpfn, = argv
|
||||
outfn = '4up.' + os.path.basename(inpfn)
|
||||
|
||||
pages = PdfReader(inpfn).pages
|
||||
pages = [pagexobj(x) for x in pages]
|
||||
canvas = Canvas(outfn)
|
||||
|
||||
while pages:
|
||||
addpage(canvas, pages)
|
||||
canvas.save()
|
||||
|
||||
if __name__ == '__main__':
|
||||
go(sys.argv[1:])
|
|
@ -0,0 +1,9 @@
|
|||
This directory contains example scripts which read in PDFs
|
||||
and convert pages to PDF Form XObjects using pdfrw, and then
|
||||
write out the PDFs using reportlab.
|
||||
|
||||
The examples, from easiest to hardest, are:
|
||||
|
||||
subset.py -- prints a subset of pages
|
||||
4up.py -- prints pages 4-up
|
||||
booklet.py -- creates a booklet out of the pages
|
|
@ -0,0 +1,69 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: booklet.py my.pdf
|
||||
|
||||
|
||||
Uses Form XObjects and reportlab to create booklet.my.pdf.
|
||||
|
||||
Demonstrates use of pdfrw with reportlab.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
from reportlab.pdfgen.canvas import Canvas
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
from pdfrw.toreportlab import makerl
|
||||
|
||||
|
||||
def read_and_double(inpfn):
|
||||
pages = PdfReader(inpfn).pages
|
||||
pages = [pagexobj(x) for x in pages]
|
||||
if len(pages) & 1:
|
||||
pages.append(pages[0]) # Sentinel -- get same size for back as front
|
||||
|
||||
xobjs = []
|
||||
while len(pages) > 2:
|
||||
xobjs.append((pages.pop(), pages.pop(0)))
|
||||
xobjs.append((pages.pop(0), pages.pop()))
|
||||
xobjs += [(x,) for x in pages]
|
||||
return xobjs
|
||||
|
||||
|
||||
def make_pdf(outfn, xobjpairs):
|
||||
canvas = Canvas(outfn)
|
||||
for xobjlist in xobjpairs:
|
||||
x = y = 0
|
||||
for xobj in xobjlist:
|
||||
x += xobj.BBox[2]
|
||||
y = max(y, xobj.BBox[3])
|
||||
|
||||
canvas.setPageSize((x,y))
|
||||
|
||||
# Handle blank back page
|
||||
if len(xobjlist) > 1 and xobjlist[0] == xobjlist[-1]:
|
||||
xobjlist = xobjlist[:1]
|
||||
x = xobjlist[0].BBox[2]
|
||||
else:
|
||||
x = 0
|
||||
y = 0
|
||||
|
||||
for xobj in xobjlist:
|
||||
canvas.saveState()
|
||||
canvas.translate(x, y)
|
||||
canvas.doForm(makerl(canvas, xobj))
|
||||
canvas.restoreState()
|
||||
x += xobj.BBox[2]
|
||||
canvas.showPage()
|
||||
canvas.save()
|
||||
|
||||
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = 'booklet.' + os.path.basename(inpfn)
|
||||
|
||||
make_pdf(outfn, read_and_double(inpfn))
|
|
@ -0,0 +1,33 @@
|
|||
'''
|
||||
find_xxx.py -- Find the place in the tree where xxx lives.
|
||||
|
||||
Ways to use:
|
||||
1) Make a copy, change 'xxx' in package to be your name; or
|
||||
2) Under Linux, just ln -s to where this is in the right tree
|
||||
|
||||
Created by Pat Maupin, who doesn't consider it big enough to be worth copyrighting
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
myname = __name__[5:] # remove 'find_'
|
||||
myname = os.path.join(myname, '__init__.py')
|
||||
|
||||
def trypath(newpath):
|
||||
path = None
|
||||
while path != newpath:
|
||||
path = newpath
|
||||
if os.path.exists(os.path.join(path, myname)):
|
||||
return path
|
||||
newpath = os.path.dirname(path)
|
||||
|
||||
root = trypath(__file__) or trypath(os.path.realpath(__file__))
|
||||
|
||||
if root is None:
|
||||
print
|
||||
print 'Warning: %s: Could not find path to development package %s' % (__file__, myname)
|
||||
print ' The import will either fail or will use system-installed libraries'
|
||||
print
|
||||
elif root not in sys.path:
|
||||
sys.path.append(root)
|
|
@ -0,0 +1,106 @@
|
|||
#!/usr/bin/env python
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
usage: platypus_pdf_template.py output.pdf pdf_file_to_use_as_template.pdf
|
||||
|
||||
Example of using pdfrw to use a pdf (page one) as the background for all
|
||||
other pages together with platypus.
|
||||
|
||||
There is a table of contents in this example for completeness sake.
|
||||
|
||||
Contributed by user asannes
|
||||
|
||||
"""
|
||||
import sys
|
||||
|
||||
from reportlab.platypus import PageTemplate, BaseDocTemplate, Frame
|
||||
from reportlab.platypus import NextPageTemplate, Paragraph, PageBreak
|
||||
from reportlab.platypus.tableofcontents import TableOfContents
|
||||
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
||||
from reportlab.rl_config import defaultPageSize
|
||||
from reportlab.lib.units import inch
|
||||
from reportlab.graphics import renderPDF
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
from pdfrw.toreportlab import makerl
|
||||
|
||||
PAGE_WIDTH = defaultPageSize[0]
|
||||
PAGE_HEIGHT = defaultPageSize[1]
|
||||
|
||||
class MyTemplate(PageTemplate):
|
||||
"""The kernel of this example, where we use pdfrw to fill in the
|
||||
background of a page before writing to it. This could be used to fill
|
||||
in a water mark or similar."""
|
||||
|
||||
def __init__(self, pdf_template_filename, name=None):
|
||||
frames = [Frame(
|
||||
0.85 * inch,
|
||||
0.5 * inch,
|
||||
PAGE_WIDTH - 1.15 * inch,
|
||||
PAGE_HEIGHT - (1.5 * inch)
|
||||
)]
|
||||
PageTemplate.__init__(self, name, frames)
|
||||
# use first page as template
|
||||
page = PdfReader(pdf_template_filename).pages[0]
|
||||
self.page_template = pagexobj(page)
|
||||
# Scale it to fill the complete page
|
||||
self.page_xscale = PAGE_WIDTH/self.page_template.BBox[2]
|
||||
self.page_yscale = PAGE_HEIGHT/self.page_template.BBox[3]
|
||||
|
||||
def beforeDrawPage(self, canvas, doc):
|
||||
"""Draws the background before anything else"""
|
||||
canvas.saveState()
|
||||
rl_obj = makerl(canvas, self.page_template)
|
||||
canvas.scale(self.page_xscale, self.page_yscale)
|
||||
canvas.doForm(rl_obj)
|
||||
canvas.restoreState()
|
||||
|
||||
class MyDocTemplate(BaseDocTemplate):
|
||||
"""Used to apply heading to table of contents."""
|
||||
|
||||
def afterFlowable(self, flowable):
|
||||
"""Adds Heading1 to table of contents"""
|
||||
if flowable.__class__.__name__ == 'Paragraph':
|
||||
style = flowable.style.name
|
||||
text = flowable.getPlainText()
|
||||
key = '%s' % self.seq.nextf('toc')
|
||||
if style == 'Heading1':
|
||||
self.canv.bookmarkPage(key)
|
||||
self.notify('TOCEntry', [1, text, self.page, key])
|
||||
|
||||
def create_toc():
|
||||
"""Creates the table of contents"""
|
||||
table_of_contents = TableOfContents()
|
||||
table_of_contents.dotsMinLevel = 0
|
||||
header1 = ParagraphStyle(name = 'Heading1', fontSize = 16, leading = 16)
|
||||
header2 = ParagraphStyle(name = 'Heading2', fontSize = 14, leading = 14)
|
||||
table_of_contents.levelStyles = [header1, header2]
|
||||
return [table_of_contents, PageBreak()]
|
||||
|
||||
def create_pdf(filename, pdf_template_filename):
|
||||
"""Create the pdf, with all the contents"""
|
||||
pdf_report = open(filename, "w")
|
||||
document = MyDocTemplate(pdf_report)
|
||||
templates = [ MyTemplate(pdf_template_filename, name='background') ]
|
||||
document.addPageTemplates(templates)
|
||||
|
||||
styles = getSampleStyleSheet()
|
||||
elements = [NextPageTemplate('background')]
|
||||
elements.extend(create_toc())
|
||||
|
||||
# Dummy content (hello world x 200)
|
||||
for i in range(200):
|
||||
elements.append(Paragraph("Hello World" + str(i), styles['Heading1']))
|
||||
|
||||
document.multiBuild(elements)
|
||||
pdf_report.close()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
try:
|
||||
output, template = sys.argv[1:]
|
||||
create_pdf(output, template)
|
||||
except ValueError:
|
||||
print "Usage: %s <output> <template>" % (sys.argv[0])
|
|
@ -0,0 +1,43 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: subset.py my.pdf firstpage lastpage
|
||||
|
||||
Creates subset_<pagenum>_to_<pagenum>.my.pdf
|
||||
|
||||
|
||||
Uses Form XObjects and reportlab to create output file.
|
||||
|
||||
Demonstrates use of pdfrw with reportlab.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
from reportlab.pdfgen.canvas import Canvas
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
from pdfrw.toreportlab import makerl
|
||||
|
||||
|
||||
def go(inpfn, firstpage, lastpage):
|
||||
firstpage, lastpage = int(firstpage), int(lastpage)
|
||||
outfn = 'subset_%s_to_%s.%s' % (firstpage, lastpage, os.path.basename(inpfn))
|
||||
|
||||
pages = PdfReader(inpfn).pages
|
||||
pages = [pagexobj(x) for x in pages[firstpage-1:lastpage]]
|
||||
canvas = Canvas(outfn)
|
||||
|
||||
for page in pages:
|
||||
canvas.setPageSize(tuple(page.BBox[2:]))
|
||||
canvas.doForm(makerl(canvas, page))
|
||||
canvas.showPage()
|
||||
|
||||
canvas.save()
|
||||
|
||||
if __name__ == '__main__':
|
||||
inpfn, firstpage, lastpage = sys.argv[1:]
|
||||
go(inpfn, firstpage, lastpage)
|
|
@ -0,0 +1,5 @@
|
|||
The copy.py demo in this directory parses the graphics stream from the PDF and actually plays it back through reportlab.
|
||||
|
||||
Doesn't yet handle fonts or unicode very well.
|
||||
|
||||
For a more practical demo, look at the Form XObjects approach in the examples/rl1 directory.
|
|
@ -0,0 +1,32 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: copy.py my.pdf
|
||||
|
||||
Creates copy.my.pdf
|
||||
|
||||
Uses somewhat-functional parser. For better results
|
||||
for most things, see the Form XObject-based method.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
from reportlab.pdfgen.canvas import Canvas
|
||||
|
||||
from decodegraphics import parsepage
|
||||
from pdfrw import PdfReader, PdfWriter, PdfArray
|
||||
|
||||
inpfn, = sys.argv[1:]
|
||||
outfn = 'copy.' + os.path.basename(inpfn)
|
||||
pages = PdfReader(inpfn).pages
|
||||
canvas = Canvas(outfn, pageCompression=0)
|
||||
|
||||
for page in pages:
|
||||
box = [float(x) for x in page.MediaBox]
|
||||
assert box[0] == box[1] == 0, "demo won't work on this PDF"
|
||||
canvas.setPageSize(box[2:])
|
||||
parsepage(page, canvas)
|
||||
canvas.showPage()
|
||||
canvas.save()
|
|
@ -0,0 +1,378 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
This file is an example parser that will parse a graphics stream
|
||||
into a reportlab canvas.
|
||||
|
||||
Needs work on fonts and unicode, but works on a few PDFs.
|
||||
|
||||
Better to use Form XObjects for most things (see the example in rl1).
|
||||
|
||||
'''
|
||||
from inspect import getargspec
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfTokens
|
||||
from pdfrw.pdfobjects import PdfString
|
||||
|
||||
#############################################################################
|
||||
# Graphics parsing
|
||||
|
||||
def parse_array(self, token='[', params=None):
|
||||
mylist = []
|
||||
for token in self.tokens:
|
||||
if token == ']':
|
||||
break
|
||||
mylist.append(token)
|
||||
self.params.append(mylist)
|
||||
|
||||
def parse_savestate(self, token='q', params=''):
|
||||
self.canv.saveState()
|
||||
|
||||
def parse_restorestate(self, token='Q', params=''):
|
||||
self.canv.restoreState()
|
||||
|
||||
def parse_transform(self, token='cm', params='ffffff'):
|
||||
self.canv.transform(*params)
|
||||
|
||||
def parse_linewidth(self, token='w', params='f'):
|
||||
self.canv.setLineWidth(*params)
|
||||
|
||||
def parse_linecap(self, token='J', params='i'):
|
||||
self.canv.setLineCap(*params)
|
||||
|
||||
def parse_linejoin(self, token='j', params='i'):
|
||||
self.canv.setLineJoin(*params)
|
||||
|
||||
def parse_miterlimit(self, token='M', params='f'):
|
||||
self.canv.setMiterLimit(*params)
|
||||
|
||||
def parse_dash(self, token='d', params='as'): # Array, string
|
||||
self.canv.setDash(*params)
|
||||
|
||||
def parse_intent(self, token='ri', params='n'):
|
||||
# TODO: add logging
|
||||
pass
|
||||
|
||||
def parse_flatness(self, token='i', params='i'):
|
||||
# TODO: add logging
|
||||
pass
|
||||
|
||||
def parse_gstate(self, token='gs', params='n'):
|
||||
# TODO: add logging
|
||||
# Could parse stuff we care about from here later
|
||||
pass
|
||||
|
||||
def parse_move(self, token='m', params='ff'):
|
||||
if self.gpath is None:
|
||||
self.gpath = self.canv.beginPath()
|
||||
self.gpath.moveTo(*params)
|
||||
self.current_point = params
|
||||
|
||||
def parse_line(self, token='l', params='ff'):
|
||||
self.gpath.lineTo(*params)
|
||||
self.current_point = params
|
||||
|
||||
def parse_curve(self, token='c', params='ffffff'):
|
||||
self.gpath.curveTo(*params)
|
||||
self.current_point = params[-2:]
|
||||
|
||||
def parse_curve1(self, token='v', params='ffff'):
|
||||
parse_curve(self, token, tuple(self.current_point) + tuple(params))
|
||||
|
||||
def parse_curve2(self, token='y', params='ffff'):
|
||||
parse_curve(self, token, tuple(params) + tuple(params[-2:]))
|
||||
|
||||
def parse_close(self, token='h', params=''):
|
||||
self.gpath.close()
|
||||
|
||||
def parse_rect(self, token='re', params='ffff'):
|
||||
if self.gpath is None:
|
||||
self.gpath = self.canv.beginPath()
|
||||
self.gpath.rect(*params)
|
||||
self.current_point = params[-2:]
|
||||
|
||||
def parse_stroke(self, token='S', params=''):
|
||||
finish_path(self, 1, 0, 0)
|
||||
|
||||
def parse_close_stroke(self, token='s', params=''):
|
||||
self.gpath.close()
|
||||
finish_path(self, 1, 0, 0)
|
||||
|
||||
def parse_fill(self, token='f', params=''):
|
||||
finish_path(self, 0, 1, 1)
|
||||
|
||||
def parse_fill_compat(self, token='F', params=''):
|
||||
finish_path(self, 0, 1, 1)
|
||||
|
||||
def parse_fill_even_odd(self, token='f*', params=''):
|
||||
finish_path(self, 0, 1, 0)
|
||||
|
||||
def parse_fill_stroke_even_odd(self, token='B*', params=''):
|
||||
finish_path(self, 1, 1, 0)
|
||||
|
||||
def parse_fill_stroke(self, token='B', params=''):
|
||||
finish_path(self, 1, 1, 1)
|
||||
|
||||
def parse_close_fill_stroke_even_odd(self, token='b*', params=''):
|
||||
self.gpath.close()
|
||||
finish_path(self, 1, 1, 0)
|
||||
|
||||
def parse_close_fill_stroke(self, token='b', params=''):
|
||||
self.gpath.close()
|
||||
finish_path(self, 1, 1, 1)
|
||||
|
||||
def parse_nop(self, token='n', params=''):
|
||||
finish_path(self, 0, 0, 0)
|
||||
|
||||
def finish_path(self, stroke, fill, fillmode):
|
||||
if self.gpath is not None:
|
||||
canv = self.canv
|
||||
canv._fillMode, oldmode = fillmode, canv._fillMode
|
||||
canv.drawPath(self.gpath, stroke, fill)
|
||||
canv._fillMode = oldmode
|
||||
self.gpath = None
|
||||
|
||||
def parse_clip_path(self, token='W', params=''):
|
||||
# TODO: add logging
|
||||
pass
|
||||
|
||||
def parse_clip_path_even_odd(self, token='W*', params=''):
|
||||
# TODO: add logging
|
||||
pass
|
||||
|
||||
def parse_stroke_gray(self, token='G', params='f'):
|
||||
self.canv.setStrokeGray(*params)
|
||||
|
||||
def parse_fill_gray(self, token='g', params='f'):
|
||||
self.canv.setFillGray(*params)
|
||||
|
||||
def parse_stroke_rgb(self, token='RG', params='fff'):
|
||||
self.canv.setStrokeColorRGB(*params)
|
||||
|
||||
def parse_fill_rgb(self, token='rg', params='fff'):
|
||||
self.canv.setFillColorRGB(*params)
|
||||
|
||||
def parse_stroke_cmyk(self, token='K', params='ffff'):
|
||||
self.canv.setStrokeColorCMYK(*params)
|
||||
|
||||
def parse_fill_cmyk(self, token='k', params='ffff'):
|
||||
self.canv.setFillColorCMYK(*params)
|
||||
|
||||
#############################################################################
|
||||
# Text parsing
|
||||
|
||||
def parse_begin_text(self, token='BT', params=''):
|
||||
assert self.tpath is None
|
||||
self.tpath = self.canv.beginText()
|
||||
|
||||
def parse_text_transform(self, token='Tm', params='ffffff'):
|
||||
path = self.tpath
|
||||
|
||||
# Stoopid optimization to remove nop
|
||||
try:
|
||||
code = path._code
|
||||
except AttributeError:
|
||||
pass
|
||||
else:
|
||||
if code[-1] == '1 0 0 1 0 0 Tm':
|
||||
code.pop()
|
||||
|
||||
path.setTextTransform(*params)
|
||||
|
||||
def parse_setfont(self, token='Tf', params='nf'):
|
||||
fontinfo = self.fontdict[params[0]]
|
||||
self.tpath._setFont(fontinfo.name, params[1])
|
||||
self.curfont = fontinfo
|
||||
|
||||
def parse_text_out(self, token='Tj', params='t'):
|
||||
text = params[0].decode(self.curfont.remap, self.curfont.twobyte)
|
||||
self.tpath.textOut(text)
|
||||
|
||||
def parse_TJ(self, token='TJ', params='a'):
|
||||
remap = self.curfont.remap
|
||||
twobyte = self.curfont.twobyte
|
||||
result = []
|
||||
for x in params[0]:
|
||||
if isinstance(x, PdfString):
|
||||
result.append(x.decode(remap, twobyte))
|
||||
else:
|
||||
# TODO: Adjust spacing between characters here
|
||||
int(x)
|
||||
text = ''.join(result)
|
||||
self.tpath.textOut(text)
|
||||
|
||||
def parse_end_text(self, token='ET', params=''):
|
||||
assert self.tpath is not None
|
||||
self.canv.drawText(self.tpath)
|
||||
self.tpath=None
|
||||
|
||||
def parse_move_cursor(self, token='Td', params='ff'):
|
||||
self.tpath.moveCursor(params[0], -params[1])
|
||||
|
||||
def parse_set_leading(self, token='TL', params='f'):
|
||||
self.tpath.setLeading(*params)
|
||||
|
||||
def parse_text_line(self, token='T*', params=''):
|
||||
self.tpath.textLine()
|
||||
|
||||
def parse_set_char_space(self, token='Tc', params='f'):
|
||||
self.tpath.setCharSpace(*params)
|
||||
|
||||
def parse_set_word_space(self, token='Tw', params='f'):
|
||||
self.tpath.setWordSpace(*params)
|
||||
|
||||
def parse_set_hscale(self, token='Tz', params='f'):
|
||||
self.tpath.setHorizScale(params[0] - 100)
|
||||
|
||||
def parse_set_rise(self, token='Ts', params='f'):
|
||||
self.tpath.setRise(*params)
|
||||
|
||||
def parse_xobject(self, token='Do', params='n'):
|
||||
# TODO: Need to do this
|
||||
pass
|
||||
|
||||
class FontInfo(object):
|
||||
''' Pretty basic -- needs a lot of work to work right for all fonts
|
||||
'''
|
||||
lookup = {
|
||||
'BitstreamVeraSans' : 'Helvetica', # WRONG -- have to learn about font stuff...
|
||||
}
|
||||
|
||||
def __init__(self, source):
|
||||
name = source.BaseFont[1:]
|
||||
self.name = self.lookup.get(name, name)
|
||||
self.remap = chr
|
||||
self.twobyte = False
|
||||
info = source.ToUnicode
|
||||
if not info:
|
||||
return
|
||||
info = info.stream.split('beginbfchar')[1].split('endbfchar')[0]
|
||||
info = list(PdfTokens(info))
|
||||
assert not len(info) & 1
|
||||
info2 = []
|
||||
for x in info:
|
||||
assert x[0] == '<' and x[-1] == '>' and len(x) in (4,6), x
|
||||
i = int(x[1:-1], 16)
|
||||
info2.append(i)
|
||||
self.remap = dict((x,chr(y)) for (x,y) in zip(info2[::2], info2[1::2])).get
|
||||
self.twobyte = len(info[0]) > 4
|
||||
|
||||
#############################################################################
|
||||
# Control structures
|
||||
|
||||
def findparsefuncs():
|
||||
def checkname(n):
|
||||
assert n.startswith('/')
|
||||
return n
|
||||
|
||||
def checkarray(a):
|
||||
assert isinstance(a, list), a
|
||||
return a
|
||||
|
||||
def checktext(t):
|
||||
assert isinstance(t, PdfString)
|
||||
return t
|
||||
|
||||
fixparam = dict(f=float, i=int, n=checkname, a=checkarray, s=str, t=checktext)
|
||||
fixcache = {}
|
||||
def fixlist(params):
|
||||
try:
|
||||
result = fixcache[params]
|
||||
except KeyError:
|
||||
result = tuple(fixparam[x] for x in params)
|
||||
fixcache[params] = result
|
||||
return result
|
||||
|
||||
dispatch = {}
|
||||
expected_args = 'self token params'.split()
|
||||
for key, func in globals().iteritems():
|
||||
if key.startswith('parse_'):
|
||||
args, varargs, keywords, defaults = getargspec(func)
|
||||
assert args == expected_args and varargs is None \
|
||||
and keywords is None and len(defaults) == 2, \
|
||||
(key, args, varargs, keywords, defaults)
|
||||
token, params = defaults
|
||||
if params is not None:
|
||||
params = fixlist(params)
|
||||
value = func, params
|
||||
assert dispatch.setdefault(token, value) is value, repr(token)
|
||||
return dispatch
|
||||
|
||||
class _ParseClass(object):
|
||||
dispatch = findparsefuncs()
|
||||
|
||||
@classmethod
|
||||
def parsepage(cls, page, canvas=None):
|
||||
self = cls()
|
||||
contents = page.Contents
|
||||
if contents.Filter is not None:
|
||||
raise SystemExit('Cannot parse graphics -- page encoded with %s' % contents.Filter)
|
||||
dispatch = cls.dispatch.get
|
||||
self.tokens = tokens = iter(PdfTokens(contents.stream))
|
||||
self.params = params = []
|
||||
self.canv = canvas
|
||||
self.gpath = None
|
||||
self.tpath = None
|
||||
self.fontdict = dict((x,FontInfo(y)) for (x, y) in page.Resources.Font.iteritems())
|
||||
|
||||
for token in self.tokens:
|
||||
info = dispatch(token)
|
||||
if info is None:
|
||||
params.append(token)
|
||||
continue
|
||||
func, paraminfo = info
|
||||
if paraminfo is None:
|
||||
func(self, token, ())
|
||||
continue
|
||||
delta = len(params) - len(paraminfo)
|
||||
if delta:
|
||||
if delta < 0:
|
||||
print 'Operator %s expected %s parameters, got %s' % (token, len(paraminfo), params)
|
||||
params[:] = []
|
||||
continue
|
||||
else:
|
||||
print "Unparsed parameters/commands:", params[:delta]
|
||||
del params[:delta]
|
||||
paraminfo = zip(paraminfo, params)
|
||||
try:
|
||||
params[:] = [x(y) for (x,y) in paraminfo]
|
||||
except:
|
||||
for i, (x,y) in enumerate(paraminfo):
|
||||
try:
|
||||
x(y)
|
||||
except:
|
||||
raise # For now
|
||||
continue
|
||||
func(self, token, params)
|
||||
params[:] = []
|
||||
|
||||
def debugparser(undisturbed = set('parse_array'.split())):
|
||||
def debugdispatch():
|
||||
def getvalue(oldval):
|
||||
name = oldval[0].__name__
|
||||
def myfunc(self, token, params):
|
||||
print '%s called %s(%s)' % (token, name, ', '.join(str(x) for x in params))
|
||||
if name in undisturbed:
|
||||
myfunc = oldval[0]
|
||||
return myfunc, oldval[1]
|
||||
return dict((x, getvalue(y)) for (x,y) in _ParseClass.dispatch.iteritems())
|
||||
|
||||
class _DebugParse(_ParseClass):
|
||||
dispatch = debugdispatch()
|
||||
|
||||
return _DebugParse.parsepage
|
||||
|
||||
parsepage = _ParseClass.parsepage
|
||||
|
||||
if __name__ == '__main__':
|
||||
import sys
|
||||
from pdfreader import PdfReader
|
||||
parse = debugparser()
|
||||
fname, = sys.argv[1:]
|
||||
pdf = PdfReader(fname)
|
||||
for i, page in enumerate(pdf.pages):
|
||||
print '\nPage %s ------------------------------------' % i
|
||||
parse(page)
|
|
@ -0,0 +1,33 @@
|
|||
'''
|
||||
find_xxx.py -- Find the place in the tree where xxx lives.
|
||||
|
||||
Ways to use:
|
||||
1) Make a copy, change 'xxx' in package to be your name; or
|
||||
2) Under Linux, just ln -s to where this is in the right tree
|
||||
|
||||
Created by Pat Maupin, who doesn't consider it big enough to be worth copyrighting
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
myname = __name__[5:] # remove 'find_'
|
||||
myname = os.path.join(myname, '__init__.py')
|
||||
|
||||
def trypath(newpath):
|
||||
path = None
|
||||
while path != newpath:
|
||||
path = newpath
|
||||
if os.path.exists(os.path.join(path, myname)):
|
||||
return path
|
||||
newpath = os.path.dirname(path)
|
||||
|
||||
root = trypath(__file__) or trypath(os.path.realpath(__file__))
|
||||
|
||||
if root is None:
|
||||
print
|
||||
print 'Warning: %s: Could not find path to development package %s' % (__file__, myname)
|
||||
print ' The import will either fail or will use system-installed libraries'
|
||||
print
|
||||
elif root not in sys.path:
|
||||
sys.path.append(root)
|
|
@ -0,0 +1,41 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: rotate.py my.pdf rotation [page[range] ...]
|
||||
eg. rotate.py 270 1-3 5 7-9
|
||||
|
||||
Rotation must be multiple of 90 degrees, clockwise.
|
||||
|
||||
Creates rotate.my.pdf with selected pages rotated. Rotates all by default.
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter
|
||||
|
||||
inpfn = sys.argv[1]
|
||||
rotate = sys.argv[2]
|
||||
ranges = sys.argv[3:]
|
||||
|
||||
rotate = int(rotate)
|
||||
assert rotate % 90 == 0
|
||||
|
||||
ranges = [[int(y) for y in x.split('-')] for x in ranges]
|
||||
outfn = 'rotate.%s' % os.path.basename(inpfn)
|
||||
trailer = PdfReader(inpfn)
|
||||
pages = trailer.pages
|
||||
|
||||
if not ranges:
|
||||
ranges = [[1, len(pages)]]
|
||||
|
||||
for onerange in ranges:
|
||||
onerange = (onerange + onerange[-1:])[:2]
|
||||
for pagenum in range(onerange[0]-1, onerange[1]):
|
||||
pages[pagenum].Rotate = (int(pages[pagenum].inheritable.Rotate or 0) + rotate) % 360
|
||||
|
||||
outdata = PdfWriter()
|
||||
outdata.trailer = trailer
|
||||
outdata.write(outfn)
|
|
@ -0,0 +1,30 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
usage: subset.py my.pdf page[range] [page[range]] ...
|
||||
eg. subset.py 1-3 5 7-9
|
||||
|
||||
Creates subset.my.pdf
|
||||
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter
|
||||
|
||||
inpfn = sys.argv[1]
|
||||
ranges = sys.argv[2:]
|
||||
assert ranges, "Expected at least one range"
|
||||
|
||||
ranges = ([int(y) for y in x.split('-')] for x in ranges)
|
||||
outfn = 'subset.%s' % os.path.basename(inpfn)
|
||||
pages = PdfReader(inpfn).pages
|
||||
outdata = PdfWriter()
|
||||
|
||||
for onerange in ranges:
|
||||
onerange = (onerange + onerange[-1:])[:2]
|
||||
for pagenum in range(onerange[0], onerange[1]+1):
|
||||
outdata.addpage(pages[pagenum-1])
|
||||
outdata.write(outfn)
|
|
@ -0,0 +1,114 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
'''
|
||||
Simple example of watermarking using form xobjects (pdfrw).
|
||||
|
||||
usage: watermark.py my.pdf single_page.pdf
|
||||
|
||||
Creates watermark.my.pdf, with every page overlaid with
|
||||
first page from single_page.pdf
|
||||
'''
|
||||
|
||||
import sys
|
||||
import os
|
||||
|
||||
import find_pdfrw
|
||||
from pdfrw import PdfReader, PdfWriter, PdfDict, PdfName, IndirectPdfDict, PdfArray
|
||||
from pdfrw.buildxobj import pagexobj
|
||||
|
||||
def fixpage(page, watermark):
|
||||
|
||||
# Find the page's resource dictionary. Create if none
|
||||
resources = page.inheritable.Resources
|
||||
if resources is None:
|
||||
resources = page.Resources = PdfDict()
|
||||
|
||||
# Find or create the parent's xobject dictionary
|
||||
xobjdict = resources.XObject
|
||||
if xobjdict is None:
|
||||
xobjdict = resources.XObject = PdfDict()
|
||||
|
||||
# Allow for an infinite number of cascaded watermarks
|
||||
index = 0
|
||||
while 1:
|
||||
watermark_name = '/Watermark.%d' % index
|
||||
if watermark_name not in xobjdict:
|
||||
break
|
||||
index += 1
|
||||
xobjdict[watermark_name] = watermark
|
||||
|
||||
# Turn the contents into an array if it is not already one
|
||||
contents = page.Contents
|
||||
if not isinstance(contents, PdfArray):
|
||||
contents = page.Contents = PdfArray([contents])
|
||||
|
||||
# Save initial state before executing page
|
||||
contents.insert(0, IndirectPdfDict(stream='q\n'))
|
||||
|
||||
# Restore initial state and append the watermark
|
||||
contents.append(IndirectPdfDict(stream='Q %s Do\n' % watermark_name))
|
||||
return page
|
||||
|
||||
def watermark(input_fname, watermark_fname, output_fname=None):
|
||||
outfn = output_fname or ('watermark.' + os.path.basename(input_fname))
|
||||
w = pagexobj(PdfReader(watermark_fname).pages[0])
|
||||
pages = PdfReader(input_fname).pages
|
||||
PdfWriter().addpages([fixpage(x, w) for x in pages]).write(outfn)
|
||||
return outfn
|
||||
|
||||
def fix_pdf(fname, watermark_fname, indir, outdir):
|
||||
from os import mkdir, path
|
||||
if not path.exists(outdir):
|
||||
mkdir(outdir)
|
||||
watermark = pagexobj(PdfReader(watermark_fname).pages[0])
|
||||
trailer = PdfReader(path.join(indir, fname))
|
||||
for page in trailer.pages:
|
||||
fixpage(page, watermark)
|
||||
PdfWriter().write(path.join(outdir, fname), trailer)
|
||||
return len(trailer.pages)
|
||||
|
||||
def batch_watermark(pdfdir, watermark_fname, outputdir='tmp'):
|
||||
import traceback
|
||||
from glob import glob
|
||||
from os import path
|
||||
fnames=glob(pdfdir+"/*.pdf")
|
||||
total_pages = 0
|
||||
good_files = 0
|
||||
|
||||
for fname in fnames:
|
||||
fname = fname.replace(pdfdir+'/','')
|
||||
try:
|
||||
total_pages += fix_pdf(fname, watermark_fname, pdfdir, outputdir)
|
||||
good_files += 1
|
||||
print "%s OK" %fname
|
||||
except Exception:
|
||||
print "%s Failed miserably" %fname
|
||||
print traceback.format_exc()[:2000]
|
||||
#raise
|
||||
|
||||
print "success %.2f%% %s pages" %((float(good_files)/len(fnames))*100, total_pages)
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
from optparse import OptionParser
|
||||
parser = OptionParser(description = __doc__)
|
||||
parser.add_option('-i', dest='input_fname', help='file name to be watermarked (pdf)')
|
||||
parser.add_option('-w', dest='watermark_fname', help='watermark file name (pdf)')
|
||||
parser.add_option('-d', dest='pdfdir', help='watermark all pdf files in this directory')
|
||||
parser.add_option('-o', dest='outdir', help='outputdir used with option -d', default='tmp')
|
||||
options, args = parser.parse_args()
|
||||
|
||||
if options.input_fname and options.watermark_fname:
|
||||
watermark = pagexobj(PdfReader(options.watermark_fname).pages[0])
|
||||
outfn = 'watermark.' + os.path.basename(options.input_fname)
|
||||
pages = PdfReader(options.input_fname).pages
|
||||
|
||||
PdfWriter().addpages([fixpage(x, watermark) for x in pages]).write(outfn)
|
||||
|
||||
elif options.pdfdir and options.watermark_fname:
|
||||
batch_watermark(options.pdfdir, options.watermark_fname, options.outdir)
|
||||
|
||||
else:
|
||||
parser.print_help()
|
||||
|
||||
|
|
@ -0,0 +1,16 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
__version__ = '0.1'
|
||||
|
||||
from pdfrw.pdfwriter import PdfWriter
|
||||
from pdfrw.pdfreader import PdfReader
|
||||
from pdfrw.objects import PdfObject, PdfName, PdfArray, PdfDict, IndirectPdfDict, PdfString
|
||||
from pdfrw.tokens import PdfTokens
|
||||
from pdfrw.errors import PdfParseError
|
||||
|
||||
# Add a tiny bit of compatibility to pyPdf
|
||||
|
||||
PdfFileReader = PdfReader
|
||||
PdfFileWriter = PdfWriter
|
|
@ -0,0 +1,249 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
|
||||
This module contains code to build PDF "Form XObjects".
|
||||
|
||||
A Form XObject allows a fragment from one PDF file to be cleanly
|
||||
included in another PDF file.
|
||||
|
||||
Reference for syntax: "Parameters for opening PDF files" from SDK 8.1
|
||||
|
||||
http://www.adobe.com/devnet/acrobat/pdfs/pdf_open_parameters.pdf
|
||||
|
||||
supported 'page=xxx', 'viewrect=<left>,<top>,<width>,<height>'
|
||||
|
||||
Also supported by this, but not by Adobe:
|
||||
'rotate=xxx' where xxx in [0, 90, 180, 270]
|
||||
|
||||
Units are in points
|
||||
|
||||
|
||||
Reference for content: Adobe PDF reference, sixth edition, version 1.7
|
||||
|
||||
http://www.adobe.com/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
|
||||
|
||||
Form xobjects discussed chapter 4.9, page 355
|
||||
'''
|
||||
|
||||
from pdfrw.objects import PdfDict, PdfArray, PdfName
|
||||
from pdfrw.pdfreader import PdfReader
|
||||
from pdfrw.errors import log
|
||||
|
||||
class ViewInfo(object):
|
||||
''' Instantiate ViewInfo with a uri, and it will parse out
|
||||
the filename, page, and viewrect into object attributes.
|
||||
'''
|
||||
doc = None
|
||||
docname = None
|
||||
page = None
|
||||
viewrect = None
|
||||
rotate = None
|
||||
|
||||
def __init__(self, pageinfo='', **kw):
|
||||
pageinfo=pageinfo.split('#',1)
|
||||
if len(pageinfo) == 2:
|
||||
pageinfo[1:] = pageinfo[1].replace('&', '#').split('#')
|
||||
for key in 'page viewrect'.split():
|
||||
if pageinfo[0].startswith(key+'='):
|
||||
break
|
||||
else:
|
||||
self.docname = pageinfo.pop(0)
|
||||
for item in pageinfo:
|
||||
key, value = item.split('=')
|
||||
key = key.strip()
|
||||
value = value.replace(',', ' ').split()
|
||||
if key in ('page', 'rotate'):
|
||||
assert len(value) == 1
|
||||
setattr(self, key, int(value[0]))
|
||||
elif key == 'viewrect':
|
||||
assert len(value) == 4
|
||||
setattr(self, key, [float(x) for x in value])
|
||||
else:
|
||||
log.error('Unknown option: %s', key)
|
||||
for key, value in kw.iteritems():
|
||||
assert hasattr(self, key), key
|
||||
setattr(self, key, value)
|
||||
|
||||
def get_rotation(rotate):
|
||||
''' Return clockwise rotation code:
|
||||
0 = unrotated
|
||||
1 = 90 degrees
|
||||
2 = 180 degrees
|
||||
3 = 270 degrees
|
||||
'''
|
||||
try:
|
||||
rotate = int(rotate)
|
||||
except (ValueError, TypeError):
|
||||
return 0
|
||||
if rotate % 90 != 0:
|
||||
return 0
|
||||
return rotate / 90
|
||||
|
||||
def rotate_point(point, rotation):
|
||||
''' Rotate an (x,y) coordinate clockwise by a
|
||||
rotation code specifying a multiple of 90 degrees.
|
||||
'''
|
||||
if rotation & 1:
|
||||
point = point[1], -point[0]
|
||||
if rotation & 2:
|
||||
point = -point[0], -point[1]
|
||||
return point
|
||||
|
||||
def rotate_rect(rect, rotation):
|
||||
''' Rotate both points within the rectangle, then normalize
|
||||
the rectangle by returning the new lower left, then new
|
||||
upper right.
|
||||
'''
|
||||
rect = rotate_point(rect[:2], rotation) + rotate_point(rect[2:], rotation)
|
||||
return (min(rect[0], rect[2]), min(rect[1], rect[3]),
|
||||
max(rect[0], rect[2]), max(rect[1], rect[3]))
|
||||
|
||||
def getrects(inheritable, pageinfo, rotation):
|
||||
''' Given the inheritable attributes of a page and
|
||||
the desired pageinfo rectangle, return the page's
|
||||
media box and the calculated boundary (clip) box.
|
||||
'''
|
||||
mbox = tuple([float(x) for x in inheritable.MediaBox])
|
||||
vrect = pageinfo.viewrect
|
||||
if vrect is None:
|
||||
cbox = tuple([float(x) for x in (inheritable.CropBox or mbox)])
|
||||
else:
|
||||
# Rotate the media box to match what the user sees,
|
||||
# figure out the clipping box, then rotate back
|
||||
mleft, mbot, mright, mtop = rotate_rect(mbox, rotation)
|
||||
x, y, w, h = vrect
|
||||
cleft = mleft + x
|
||||
ctop = mtop - y
|
||||
cright = cleft + w
|
||||
cbot = ctop - h
|
||||
cbox = max(mleft, cleft), max(mbot, cbot), min(mright, cright), min(mtop, ctop)
|
||||
cbox = rotate_rect(cbox, -rotation)
|
||||
return mbox, cbox
|
||||
|
||||
|
||||
def _cache_xobj(contents, resources, mbox, bbox, rotation):
|
||||
''' Return a cached Form XObject, or create a new one and cache it.
|
||||
Adds private members x, y, w, h
|
||||
'''
|
||||
cachedict = contents.xobj_cachedict
|
||||
if cachedict is None:
|
||||
cachedict = contents.private.xobj_cachedict = {}
|
||||
cachekey = mbox, bbox, rotation
|
||||
result = cachedict.get(cachekey)
|
||||
if result is None:
|
||||
func = (_get_fullpage, _get_subpage)[mbox != bbox]
|
||||
result = PdfDict(
|
||||
func(contents, resources, mbox, bbox, rotation),
|
||||
Type = PdfName.XObject,
|
||||
Subtype = PdfName.Form,
|
||||
FormType = 1,
|
||||
BBox = PdfArray(bbox),
|
||||
)
|
||||
rect = bbox
|
||||
if rotation:
|
||||
matrix = rotate_point((1, 0), rotation) + rotate_point((0, 1), rotation)
|
||||
result.Matrix = PdfArray(matrix + (0, 0))
|
||||
rect = rotate_rect(rect, rotation)
|
||||
|
||||
result.private.x = rect[0]
|
||||
result.private.y = rect[1]
|
||||
result.private.w = rect[2] - rect[0]
|
||||
result.private.h = rect[3] - rect[1]
|
||||
cachedict[cachekey] = result
|
||||
return result
|
||||
|
||||
def _get_fullpage(contents, resources, mbox, bbox, rotation):
|
||||
''' fullpage is easy. Just copy the contents,
|
||||
set up the resources, and let _cache_xobj handle the
|
||||
rest.
|
||||
'''
|
||||
return PdfDict(contents, Resources=resources)
|
||||
|
||||
def _get_subpage(contents, resources, mbox, bbox, rotation):
|
||||
''' subpages *could* be as easy as full pages, but we
|
||||
choose to complicate life by creating a Form XObject
|
||||
for the page, and then one that references it for
|
||||
the subpage, on the off-chance that we want multiple
|
||||
items from the page.
|
||||
'''
|
||||
return PdfDict(
|
||||
stream = '/FullPage Do\n',
|
||||
Resources = PdfDict(
|
||||
XObject = PdfDict(
|
||||
FullPage = _cache_xobj(contents, resources, mbox, mbox, 0)
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
def pagexobj(page, viewinfo=ViewInfo(), allow_compressed=True):
|
||||
''' pagexobj creates and returns a Form XObject for
|
||||
a given view within a page (Defaults to entire page.)
|
||||
'''
|
||||
inheritable = page.inheritable
|
||||
resources = inheritable.Resources
|
||||
rotation = get_rotation(inheritable.Rotate)
|
||||
mbox, bbox = getrects(inheritable, viewinfo, rotation)
|
||||
rotation += get_rotation(viewinfo.rotate)
|
||||
contents = page.Contents
|
||||
# Make sure the only attribute is length
|
||||
# All the filters must have been executed
|
||||
assert int(contents.Length) == len(contents.stream)
|
||||
if not allow_compressed:
|
||||
assert len([x for x in contents.iteritems()]) == 1
|
||||
return _cache_xobj(contents, resources, mbox, bbox, rotation)
|
||||
|
||||
|
||||
|
||||
def docxobj(pageinfo, doc=None, allow_compressed=True):
|
||||
''' docxobj creates and returns an actual Form XObject.
|
||||
Can work standalone, or in conjunction with
|
||||
the CacheXObj class (below).
|
||||
'''
|
||||
if not isinstance(pageinfo, ViewInfo):
|
||||
pageinfo = ViewInfo(pageinfo)
|
||||
|
||||
# If we're explicitly passed a document,
|
||||
# make sure we don't have one implicitly as well.
|
||||
# If no implicit or explicit doc, then read one in
|
||||
# from the filename.
|
||||
if doc is not None:
|
||||
assert pageinfo.doc is None
|
||||
pageinfo.doc = doc
|
||||
elif pageinfo.doc is not None:
|
||||
doc = pageinfo.doc
|
||||
else:
|
||||
doc = pageinfo.doc = PdfReader(pageinfo.docname, decompress = not allow_compressed)
|
||||
assert isinstance(doc, PdfReader)
|
||||
|
||||
sourcepage = doc.pages[(pageinfo.page or 1) - 1]
|
||||
return pagexobj(sourcepage, pageinfo, allow_compressed)
|
||||
|
||||
|
||||
class CacheXObj(object):
|
||||
''' Use to keep from reparsing files over and over,
|
||||
and to keep from making the output too much
|
||||
bigger than it ought to be by replicating
|
||||
unnecessary object copies.
|
||||
'''
|
||||
def __init__(self, decompress=False):
|
||||
''' Set decompress true if you need
|
||||
the Form XObjects to be decompressed.
|
||||
Will decompress what it can and scream
|
||||
about the rest.
|
||||
'''
|
||||
self.cached_pdfs = {}
|
||||
self.decompress = decompress
|
||||
|
||||
def load(self, sourcename):
|
||||
''' Load a Form XObject from a uri
|
||||
'''
|
||||
info = ViewInfo(sourcename)
|
||||
fname = info.docname
|
||||
pcache = self.cached_pdfs
|
||||
doc = pcache.get(fname)
|
||||
if doc is None:
|
||||
doc = pcache[fname] = PdfReader(fname, decompress=self.decompress)
|
||||
return docxobj(info, doc, allow_compressed=not self.decompress)
|
|
@ -0,0 +1,26 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
Currently, this sad little file only knows how to decompress
|
||||
using the flate (zlib) algorithm. Maybe more later, but it's
|
||||
not a priority for me...
|
||||
'''
|
||||
import zlib
|
||||
from pdfrw.objects import PdfDict, PdfName
|
||||
from pdfrw.errors import log
|
||||
from pdfrw.uncompress import streamobjects
|
||||
|
||||
def compress(mylist):
|
||||
flate = PdfName.FlateDecode
|
||||
for obj in streamobjects(mylist):
|
||||
ftype = obj.Filter
|
||||
if ftype is not None:
|
||||
continue
|
||||
oldstr = obj.stream
|
||||
newstr = zlib.compress(oldstr)
|
||||
if len(newstr) < len(oldstr) + 30:
|
||||
obj.stream = newstr
|
||||
obj.Filter = flate
|
||||
obj.DecodeParms = None
|
|
@ -0,0 +1,31 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
PDF Exceptions and error handling
|
||||
'''
|
||||
|
||||
import logging
|
||||
from exceptions import Exception
|
||||
|
||||
|
||||
logging.basicConfig(
|
||||
format='[%(levelname)s] %(filename)s:%(lineno)d %(message)s',
|
||||
level=logging.WARNING)
|
||||
|
||||
log = logging.getLogger('pdfrw')
|
||||
|
||||
|
||||
class PdfError(Exception):
|
||||
"Abstract base class of exceptions thrown by this module"
|
||||
def __init__(self, msg):
|
||||
self.msg = msg
|
||||
def __str__(self):
|
||||
return self.msg
|
||||
|
||||
class PdfParseError(PdfError):
|
||||
"Error thrown by parser/tokenizer"
|
||||
|
||||
class PdfOutputError(PdfError):
|
||||
"Error thrown by PDF writer"
|
|
@ -0,0 +1,16 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
Objects that can occur in PDF files. The most important
|
||||
objects are arrays and dicts. Either of these can be
|
||||
indirect or not, and dicts could have an associated
|
||||
stream.
|
||||
'''
|
||||
from pdfrw.objects.pdfname import PdfName
|
||||
from pdfrw.objects.pdfdict import PdfDict, IndirectPdfDict
|
||||
from pdfrw.objects.pdfarray import PdfArray
|
||||
from pdfrw.objects.pdfobject import PdfObject
|
||||
from pdfrw.objects.pdfstring import PdfString
|
||||
from pdfrw.objects.pdfindirect import PdfIndirect
|
|
@ -0,0 +1,59 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
from pdfrw.objects.pdfindirect import PdfIndirect
|
||||
from pdfrw.objects.pdfobject import PdfObject
|
||||
|
||||
def _resolved():
|
||||
pass
|
||||
|
||||
class PdfArray(list):
|
||||
''' A PdfArray maps the PDF file array object into a Python list.
|
||||
It has an indirect attribute which defaults to False.
|
||||
'''
|
||||
indirect = False
|
||||
|
||||
def __init__(self, source=[]):
|
||||
self._resolve = self._resolver
|
||||
self.extend(source)
|
||||
|
||||
def _resolver(self, isinstance=isinstance, enumerate=enumerate,
|
||||
listiter=list.__iter__,
|
||||
PdfIndirect=PdfIndirect, resolved=_resolved,
|
||||
PdfNull=PdfObject('null')):
|
||||
for index, value in enumerate(list.__iter__(self)):
|
||||
if isinstance(value, PdfIndirect):
|
||||
value = value.real_value()
|
||||
if value is None:
|
||||
value = PdfNull
|
||||
self[index] = value
|
||||
self._resolve = resolved
|
||||
|
||||
def __getitem__(self, index, listget=list.__getitem__):
|
||||
self._resolve()
|
||||
return listget(self, index)
|
||||
|
||||
def __getslice__(self, index, listget=list.__getslice__):
|
||||
self._resolve()
|
||||
return listget(self, index)
|
||||
|
||||
def __iter__(self, listiter=list.__iter__):
|
||||
self._resolve()
|
||||
return listiter(self)
|
||||
|
||||
def count(self, item):
|
||||
self._resolve()
|
||||
return list.count(self, item)
|
||||
def index(self, item):
|
||||
self._resolve()
|
||||
return list.index(self, item)
|
||||
def remove(self, item):
|
||||
self._resolve()
|
||||
return list.remove(self, item)
|
||||
def sort(self, *args, **kw):
|
||||
self._resolve()
|
||||
return list.sort(self, *args, **kw)
|
||||
def pop(self, *args):
|
||||
self._resolve()
|
||||
return list.pop(self, *args)
|
|
@ -0,0 +1,205 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
from pdfrw.objects.pdfname import PdfName
|
||||
from pdfrw.objects.pdfindirect import PdfIndirect
|
||||
from pdfrw.objects.pdfobject import PdfObject
|
||||
|
||||
class _DictSearch(object):
|
||||
''' Used to search for inheritable attributes.
|
||||
'''
|
||||
def __init__(self, basedict):
|
||||
self.basedict = basedict
|
||||
def __getattr__(self, name, PdfName=PdfName):
|
||||
return self[PdfName(name)]
|
||||
def __getitem__(self, name, set=set, getattr=getattr, id=id):
|
||||
visited = set()
|
||||
mydict = self.basedict
|
||||
while 1:
|
||||
value = mydict[name]
|
||||
if value is not None:
|
||||
return value
|
||||
myid = id(mydict)
|
||||
assert myid not in visited
|
||||
visited.add(myid)
|
||||
mydict = mydict.Parent
|
||||
if mydict is None:
|
||||
return
|
||||
|
||||
class _Private(object):
|
||||
''' Used to store private attributes (not output to PDF files)
|
||||
on PdfDict classes
|
||||
'''
|
||||
def __init__(self, pdfdict):
|
||||
vars(self)['pdfdict'] = pdfdict
|
||||
def __setattr__(self, name, value):
|
||||
vars(self.pdfdict)[name] = value
|
||||
|
||||
class PdfDict(dict):
|
||||
''' PdfDict objects are subclassed dictionaries with the following features:
|
||||
|
||||
- Every key in the dictionary starts with "/"
|
||||
|
||||
- A dictionary item can be deleted by assigning it to None
|
||||
|
||||
- Keys that (after the initial "/") conform to Python naming conventions
|
||||
can also be accessed (set and retrieved) as attributes of the dictionary.
|
||||
E.g. mydict.Page is the same thing as mydict['/Page']
|
||||
|
||||
- Private attributes (not in the PDF space) can be set on the dictionary
|
||||
object attribute dictionary by using the private attribute:
|
||||
|
||||
mydict.private.foo = 3
|
||||
mydict.foo = 5
|
||||
x = mydict.foo # x will now contain 3
|
||||
y = mydict['/foo'] # y will now contain 5
|
||||
|
||||
Most standard adobe dictionary keys start with an upper case letter,
|
||||
so to avoid conflicts, it is best to start private attributes with
|
||||
lower case letters.
|
||||
|
||||
- PdfDicts have the following read-only properties:
|
||||
|
||||
- private -- as discussed above, provides write access to dictionary's
|
||||
attributes
|
||||
- inheritable -- this creates and returns a "view" attribute that
|
||||
will search through the object hierarchy for any desired
|
||||
attribute, such as /Rotate or /MediaBox
|
||||
|
||||
- PdfDicts also have the following special attributes:
|
||||
- indirect is not stored in the PDF dictionary, but in the object's
|
||||
attribute dictionary
|
||||
- stream is also stored in the object's attribute dictionary
|
||||
and will also update the stream length.
|
||||
- _stream will store in the object's attribute dictionary without
|
||||
updating the stream length.
|
||||
|
||||
It is possible, for example, to have a PDF name such as "/indirect"
|
||||
or "/stream", but you cannot access such a name as an attribute:
|
||||
|
||||
mydict.indirect -- accesses object's attribute dictionary
|
||||
mydict["/indirect"] -- accesses actual PDF dictionary
|
||||
'''
|
||||
indirect = False
|
||||
stream = None
|
||||
|
||||
_special = dict(indirect = ('indirect', False),
|
||||
stream = ('stream', True),
|
||||
_stream = ('stream', False),
|
||||
)
|
||||
|
||||
def __setitem__(self, name, value, setter=dict.__setitem__):
|
||||
assert name.startswith('/'), name
|
||||
if value is not None:
|
||||
setter(self, name, value)
|
||||
elif name in self:
|
||||
del self[name]
|
||||
|
||||
def __init__(self, *args, **kw):
|
||||
if args:
|
||||
if len(args) == 1:
|
||||
args = args[0]
|
||||
self.update(args)
|
||||
if isinstance(args, PdfDict):
|
||||
self.indirect = args.indirect
|
||||
self._stream = args.stream
|
||||
for key, value in kw.iteritems():
|
||||
setattr(self, key, value)
|
||||
|
||||
def __getattr__(self, name, PdfName=PdfName):
|
||||
''' If the attribute doesn't exist on the dictionary object,
|
||||
try to slap a '/' in front of it and get it out
|
||||
of the actual dictionary itself.
|
||||
'''
|
||||
return self.get(PdfName(name))
|
||||
|
||||
def get(self, key, dictget=dict.get, isinstance=isinstance, PdfIndirect=PdfIndirect):
|
||||
''' Get a value out of the dictionary, after resolving any indirect objects.
|
||||
'''
|
||||
value = dictget(self, key)
|
||||
if isinstance(value, PdfIndirect):
|
||||
self[key] = value = value.real_value()
|
||||
return value
|
||||
|
||||
def __getitem__(self, key):
|
||||
return self.get(key)
|
||||
|
||||
def __setattr__(self, name, value, special=_special.get, PdfName=PdfName, vars=vars):
|
||||
''' Set an attribute on the dictionary. Handle the keywords
|
||||
indirect, stream, and _stream specially (for content objects)
|
||||
'''
|
||||
info = special(name)
|
||||
if info is None:
|
||||
self[PdfName(name)] = value
|
||||
else:
|
||||
name, setlen = info
|
||||
vars(self)[name] = value
|
||||
if setlen:
|
||||
notnone = value is not None
|
||||
self.Length = notnone and PdfObject(len(value)) or None
|
||||
|
||||
def iteritems(self, dictiter=dict.iteritems, isinstance=isinstance, PdfIndirect=PdfIndirect):
|
||||
''' Iterate over the dictionary, resolving any unresolved objects
|
||||
'''
|
||||
for key, value in list(dictiter(self)):
|
||||
if isinstance(value, PdfIndirect):
|
||||
self[key] = value = value.real_value()
|
||||
if value is not None:
|
||||
assert key.startswith('/'), (key, value)
|
||||
yield key, value
|
||||
|
||||
def items(self):
|
||||
return list(self.iteritems())
|
||||
def itervalues(self):
|
||||
for key, value in self.iteritems():
|
||||
yield value
|
||||
def values(self):
|
||||
return list((value for key, value in self.iteritems()))
|
||||
def keys(self):
|
||||
return list((key for key, value in self.iteritems()))
|
||||
def __iter__(self):
|
||||
for key, value in self.iteritems():
|
||||
yield key
|
||||
def iterkeys(self):
|
||||
return iter(self)
|
||||
|
||||
def copy(self):
|
||||
return type(self)(self)
|
||||
|
||||
def pop(self, key):
|
||||
value = self.get(key)
|
||||
del self[key]
|
||||
return value
|
||||
|
||||
def popitem(self):
|
||||
key, value = dict.pop(self)
|
||||
if isinstance(value, PdfIndirect):
|
||||
value = value.real_value()
|
||||
return value
|
||||
|
||||
def inheritable(self):
|
||||
''' Search through ancestors as needed for inheritable
|
||||
dictionary items.
|
||||
NOTE: You might think it would be a good idea
|
||||
to cache this class, but then you'd have to worry
|
||||
about it pointing to the wrong dictionary if you
|
||||
made a copy of the object...
|
||||
'''
|
||||
return _DictSearch(self)
|
||||
inheritable = property(inheritable)
|
||||
|
||||
def private(self):
|
||||
''' Allows setting private metadata for use in
|
||||
processing (not sent to PDF file).
|
||||
See note on inheritable
|
||||
'''
|
||||
return _Private(self)
|
||||
private = property(private)
|
||||
|
||||
class IndirectPdfDict(PdfDict):
|
||||
''' IndirectPdfDict is a convenience class. You could
|
||||
create a direct PdfDict and then set indirect = True on it,
|
||||
or you could just create an IndirectPdfDict.
|
||||
'''
|
||||
indirect = True
|
|
@ -0,0 +1,20 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
class _NotLoaded(object):
|
||||
pass
|
||||
|
||||
class PdfIndirect(tuple):
|
||||
''' A placeholder for an object that hasn't been read in yet.
|
||||
The object itself is the (object number, generation number) tuple.
|
||||
The attributes include information about where the object is
|
||||
referenced from and the file object to retrieve the real object from.
|
||||
'''
|
||||
value = _NotLoaded
|
||||
|
||||
def real_value(self, NotLoaded=_NotLoaded):
|
||||
value = self.value
|
||||
if value is NotLoaded:
|
||||
value = self.value = self._loader(self)
|
||||
return value
|
|
@ -0,0 +1,17 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
from pdfrw.objects.pdfobject import PdfObject
|
||||
|
||||
class PdfName(object):
|
||||
''' PdfName is a simple way to get a PDF name from a string:
|
||||
|
||||
PdfName.FooBar == PdfObject('/FooBar')
|
||||
'''
|
||||
def __getattr__(self, name):
|
||||
return self(name)
|
||||
def __call__(self, name, PdfObject=PdfObject):
|
||||
return PdfObject('/' + name)
|
||||
PdfName = PdfName()
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
class PdfObject(str):
|
||||
''' A PdfObject is a textual representation of any PDF file object
|
||||
other than an array, dict or string. It has an indirect attribute
|
||||
which defaults to False.
|
||||
'''
|
||||
indirect = False
|
|
@ -0,0 +1,73 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
import re
|
||||
|
||||
class PdfString(str):
|
||||
''' A PdfString is an encoded string. It has a decode
|
||||
method to get the actual string data out, and there
|
||||
is an encode class method to create such a string.
|
||||
Like any PDF object, it could be indirect, but it
|
||||
defaults to being a direct object.
|
||||
'''
|
||||
indirect = False
|
||||
unescape_dict = {'\\b':'\b', '\\f':'\f', '\\n':'\n',
|
||||
'\\r':'\r', '\\t':'\t',
|
||||
'\\\r\n': '', '\\\r':'', '\\\n':'',
|
||||
'\\\\':'\\', '\\':'',
|
||||
}
|
||||
unescape_pattern = r'(\\\\|\\b|\\f|\\n|\\r|\\t|\\\r\n|\\\r|\\\n|\\[0-9]+|\\)'
|
||||
unescape_func = re.compile(unescape_pattern).split
|
||||
|
||||
hex_pattern = '([a-fA-F0-9][a-fA-F0-9]|[a-fA-F0-9])'
|
||||
hex_func = re.compile(hex_pattern).split
|
||||
|
||||
hex_pattern2 = '([a-fA-F0-9][a-fA-F0-9][a-fA-F0-9][a-fA-F0-9]|[a-fA-F0-9][a-fA-F0-9]|[a-fA-F0-9])'
|
||||
hex_func2 = re.compile(hex_pattern2).split
|
||||
|
||||
hex_funcs = hex_func, hex_func2
|
||||
|
||||
def decode_regular(self, remap=chr):
|
||||
assert self[0] == '(' and self[-1] == ')'
|
||||
mylist = self.unescape_func(self[1:-1])
|
||||
result = []
|
||||
unescape = self.unescape_dict.get
|
||||
for chunk in mylist:
|
||||
chunk = unescape(chunk, chunk)
|
||||
if chunk.startswith('\\') and len(chunk) > 1:
|
||||
value = int(chunk[1:], 8)
|
||||
# FIXME: TODO: Handle unicode here
|
||||
if value > 127:
|
||||
value = 127
|
||||
chunk = remap(value)
|
||||
if chunk:
|
||||
result.append(chunk)
|
||||
return ''.join(result)
|
||||
|
||||
def decode_hex(self, remap=chr, twobytes=False):
|
||||
data = ''.join(self.split())
|
||||
data = self.hex_funcs[twobytes](data)
|
||||
chars = data[1::2]
|
||||
other = data[0::2]
|
||||
assert other[0] == '<' and other[-1] == '>' and ''.join(other) == '<>', self
|
||||
return ''.join([remap(int(x, 16)) for x in chars])
|
||||
|
||||
def decode(self, remap=chr, twobytes=False):
|
||||
if self.startswith('('):
|
||||
return self.decode_regular(remap)
|
||||
|
||||
else:
|
||||
return self.decode_hex(remap, twobytes)
|
||||
|
||||
def encode(cls, source, usehex=False):
|
||||
assert not usehex, "Not supported yet"
|
||||
if isinstance(source, unicode):
|
||||
source = source.encode('utf-8')
|
||||
else:
|
||||
source = str(source)
|
||||
source = source.replace('\\', '\\\\')
|
||||
source = source.replace('(', '\\(')
|
||||
source = source.replace(')', '\\)')
|
||||
return cls('(' +source + ')')
|
||||
encode = classmethod(encode)
|
|
@ -0,0 +1,433 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
The PdfReader class reads an entire PDF file into memory and
|
||||
parses the top-level container objects. (It does not parse
|
||||
into streams.) The object subclasses PdfDict, and the
|
||||
document pages are stored in a list in the pages attribute
|
||||
of the object.
|
||||
'''
|
||||
import gc
|
||||
|
||||
from pdfrw.errors import PdfParseError, log
|
||||
from pdfrw.tokens import PdfTokens
|
||||
from pdfrw.objects import PdfDict, PdfArray, PdfName, PdfObject, PdfIndirect
|
||||
from pdfrw.uncompress import uncompress
|
||||
|
||||
class PdfReader(PdfDict):
|
||||
|
||||
warned_bad_stream_start = False # Use to keep from spewing warnings
|
||||
warned_bad_stream_end = False # Use to keep from spewing warnings
|
||||
|
||||
def findindirect(self, objnum, gennum, PdfIndirect=PdfIndirect, int=int):
|
||||
''' Return a previously loaded indirect object, or create
|
||||
a placeholder for it.
|
||||
'''
|
||||
key = int(objnum), int(gennum)
|
||||
result = self.indirect_objects.get(key)
|
||||
if result is None:
|
||||
self.indirect_objects[key] = result = PdfIndirect(key)
|
||||
self.deferred_objects.add(key)
|
||||
result._loader = self.loadindirect
|
||||
return result
|
||||
|
||||
def readarray(self, source, PdfArray=PdfArray):
|
||||
''' Found a [ token. Parse the tokens after that.
|
||||
'''
|
||||
specialget = self.special.get
|
||||
result = []
|
||||
pop = result.pop
|
||||
append = result.append
|
||||
|
||||
for value in source:
|
||||
if value in ']R':
|
||||
if value == ']':
|
||||
break
|
||||
generation = pop()
|
||||
value = self.findindirect(pop(), generation)
|
||||
else:
|
||||
func = specialget(value)
|
||||
if func is not None:
|
||||
value = func(source)
|
||||
append(value)
|
||||
return PdfArray(result)
|
||||
|
||||
def readdict(self, source, PdfDict=PdfDict):
|
||||
''' Found a << token. Parse the tokens after that.
|
||||
'''
|
||||
specialget = self.special.get
|
||||
result = PdfDict()
|
||||
next = source.next
|
||||
|
||||
tok = next()
|
||||
while tok != '>>':
|
||||
if not tok.startswith('/'):
|
||||
source.exception('Expected PDF /name object')
|
||||
key = tok
|
||||
value = next()
|
||||
func = specialget(value)
|
||||
if func is not None:
|
||||
value = func(source)
|
||||
tok = next()
|
||||
else:
|
||||
tok = next()
|
||||
if value.isdigit() and tok.isdigit():
|
||||
if next() != 'R':
|
||||
source.exception('Expected "R" following two integers')
|
||||
value = self.findindirect(value, tok)
|
||||
tok = next()
|
||||
result[key] = value
|
||||
return result
|
||||
|
||||
def empty_obj(self, source, PdfObject=PdfObject):
|
||||
''' Some silly git put an empty object in the
|
||||
file. Back up so the caller sees the endobj.
|
||||
'''
|
||||
source.floc = source.tokstart
|
||||
|
||||
def badtoken(self, source):
|
||||
''' Didn't see that coming.
|
||||
'''
|
||||
source.exception('Unexpected delimiter')
|
||||
|
||||
def findstream(self, obj, tok, source, PdfDict=PdfDict, isinstance=isinstance, len=len):
|
||||
''' Figure out if there is a content stream
|
||||
following an object, and return the start
|
||||
pointer to the content stream if so.
|
||||
|
||||
(We can't read it yet, because we might not
|
||||
know how long it is, because Length might
|
||||
be an indirect object.)
|
||||
'''
|
||||
|
||||
isdict = isinstance(obj, PdfDict)
|
||||
if not isdict or tok != 'stream':
|
||||
source.exception("Expected 'endobj'%s token", isdict and " or 'stream'" or '')
|
||||
fdata = source.fdata
|
||||
startstream = source.tokstart + len(tok)
|
||||
gotcr = fdata[startstream] == '\r'
|
||||
startstream += gotcr
|
||||
gotlf = fdata[startstream] == '\n'
|
||||
startstream += gotlf
|
||||
if not gotlf:
|
||||
if not gotcr:
|
||||
source.exception(r'stream keyword not followed by \n')
|
||||
if not self.warned_bad_stream_start:
|
||||
source.warning(r"stream keyword terminated by \r without \n")
|
||||
self.private.warned_bad_stream_start = True
|
||||
return startstream
|
||||
|
||||
def readstream(self, obj, startstream, source,
|
||||
streamending = 'endstream endobj'.split(), int=int):
|
||||
fdata = source.fdata
|
||||
length = int(obj.Length)
|
||||
source.floc = target_endstream = startstream + length
|
||||
endit = source.multiple(2)
|
||||
obj._stream = fdata[startstream:target_endstream]
|
||||
if endit == streamending:
|
||||
return
|
||||
|
||||
# The length attribute does not match the distance between the
|
||||
# stream and endstream keywords.
|
||||
|
||||
do_warn, self.warned_bad_stream_end = self.warned_bad_stream_end, False
|
||||
|
||||
#TODO: Extract maxstream from dictionary of object offsets
|
||||
# and use rfind instead of find.
|
||||
maxstream = len(fdata) - 20
|
||||
endstream = fdata.find('endstream', startstream, maxstream)
|
||||
source.floc = startstream
|
||||
room = endstream - startstream
|
||||
if endstream < 0:
|
||||
source.error('Could not find endstream')
|
||||
return
|
||||
if length == room + 1 and fdata[startstream-2:startstream] == '\r\n':
|
||||
source.warning(r"stream keyword terminated by \r without \n")
|
||||
obj._stream = fdata[startstream-1:target_endstream-1]
|
||||
return
|
||||
source.floc = endstream
|
||||
if length > room:
|
||||
source.error('stream /Length attribute (%d) appears to be too big (size %d) -- adjusting',
|
||||
length, room)
|
||||
obj.stream = fdata[startstream:endstream]
|
||||
return
|
||||
if fdata[target_endstream:endstream].rstrip():
|
||||
source.error('stream /Length attribute (%d) might be smaller than data size (%d)',
|
||||
length, room)
|
||||
return
|
||||
endobj = fdata.find('endobj', endstream, maxstream)
|
||||
if endobj < 0:
|
||||
source.error('Could not find endobj after endstream')
|
||||
return
|
||||
if fdata[endstream:endobj].rstrip() != 'endstream':
|
||||
source.error('Unexpected data between endstream and endobj')
|
||||
return
|
||||
source.error('Illegal endstream/endobj combination')
|
||||
|
||||
def loadindirect(self, key):
|
||||
result = self.indirect_objects.get(key)
|
||||
if not isinstance(result, PdfIndirect):
|
||||
return result
|
||||
source = self.source
|
||||
offset = int(self.source.obj_offsets.get(key, '0'))
|
||||
if not offset:
|
||||
log.warning("Did not find PDF object %s" % (key,))
|
||||
return None
|
||||
|
||||
# Read the object header and validate it
|
||||
objnum, gennum = key
|
||||
source.floc = offset
|
||||
objid = source.multiple(3)
|
||||
ok = len(objid) == 3
|
||||
ok = ok and objid[0].isdigit() and int(objid[0]) == objnum
|
||||
ok = ok and objid[1].isdigit() and int(objid[1]) == gennum
|
||||
ok = ok and objid[2] == 'obj'
|
||||
if not ok:
|
||||
source.floc = offset
|
||||
source.next()
|
||||
objheader = '%d %d obj' % (objnum, gennum)
|
||||
fdata = source.fdata
|
||||
offset2 = fdata.find('\n' + objheader) + 1 or fdata.find('\r' + objheader) + 1
|
||||
if not offset2 or fdata.find(fdata[offset2-1] + objheader, offset2) > 0:
|
||||
source.warning("Expected indirect object '%s'" % objheader)
|
||||
return None
|
||||
source.warning("Indirect object %s found at incorrect offset %d (expected offset %d)" %
|
||||
(objheader, offset2, offset))
|
||||
source.floc = offset2 + len(objheader)
|
||||
|
||||
# Read the object, and call special code if it starts
|
||||
# an array or dictionary
|
||||
obj = source.next()
|
||||
func = self.special.get(obj)
|
||||
if func is not None:
|
||||
obj = func(source)
|
||||
|
||||
self.indirect_objects[key] = obj
|
||||
self.deferred_objects.remove(key)
|
||||
|
||||
# Mark the object as indirect, and
|
||||
# add it to the list of streams if it starts a stream
|
||||
obj.indirect = key
|
||||
tok = source.next()
|
||||
if tok != 'endobj':
|
||||
self.readstream(obj, self.findstream(obj, tok, source), source)
|
||||
return obj
|
||||
|
||||
def findxref(fdata):
|
||||
''' Find the cross reference section at the end of a file
|
||||
'''
|
||||
startloc = fdata.rfind('startxref')
|
||||
if startloc < 0:
|
||||
raise PdfParseError('Did not find "startxref" at end of file')
|
||||
source = PdfTokens(fdata, startloc, False)
|
||||
tok = source.next()
|
||||
assert tok == 'startxref' # (We just checked this...)
|
||||
tableloc = source.next_default()
|
||||
if not tableloc.isdigit():
|
||||
source.exception('Expected table location')
|
||||
if source.next_default().rstrip().lstrip('%') != 'EOF':
|
||||
source.exception('Expected %%EOF')
|
||||
return startloc, PdfTokens(fdata, int(tableloc), True)
|
||||
findxref = staticmethod(findxref)
|
||||
|
||||
def parsexref(self, source, int=int, range=range):
|
||||
''' Parse (one of) the cross-reference file section(s)
|
||||
'''
|
||||
fdata = source.fdata
|
||||
setdefault = source.obj_offsets.setdefault
|
||||
add_offset = source.all_offsets.append
|
||||
next = source.next
|
||||
tok = next()
|
||||
if tok != 'xref':
|
||||
source.exception('Expected "xref" keyword')
|
||||
start = source.floc
|
||||
try:
|
||||
while 1:
|
||||
tok = next()
|
||||
if tok == 'trailer':
|
||||
return
|
||||
startobj = int(tok)
|
||||
for objnum in range(startobj, startobj + int(next())):
|
||||
offset = int(next())
|
||||
generation = int(next())
|
||||
inuse = next()
|
||||
if inuse == 'n':
|
||||
if offset != 0:
|
||||
setdefault((objnum, generation), offset)
|
||||
add_offset(offset)
|
||||
elif inuse != 'f':
|
||||
raise ValueError
|
||||
except:
|
||||
pass
|
||||
try:
|
||||
# Table formatted incorrectly. See if we can figure it out anyway.
|
||||
end = source.fdata.rindex('trailer', start)
|
||||
table = source.fdata[start:end].splitlines()
|
||||
for line in table:
|
||||
tokens = line.split()
|
||||
if len(tokens) == 2:
|
||||
objnum = int(tokens[0])
|
||||
elif len(tokens) == 3:
|
||||
offset, generation, inuse = int(tokens[0]), int(tokens[1]), tokens[2]
|
||||
if offset != 0 and inuse == 'n':
|
||||
setdefault((objnum, generation), offset)
|
||||
add_offset(offset)
|
||||
objnum += 1
|
||||
elif tokens:
|
||||
log.error('Invalid line in xref table: %s' % repr(line))
|
||||
raise ValueError
|
||||
log.warning('Badly formatted xref table')
|
||||
source.floc = end
|
||||
source.next()
|
||||
except:
|
||||
source.floc = start
|
||||
source.exception('Invalid table format')
|
||||
|
||||
def readpages(self, node):
|
||||
pagename=PdfName.Page
|
||||
pagesname=PdfName.Pages
|
||||
catalogname = PdfName.Catalog
|
||||
typename = PdfName.Type
|
||||
kidname = PdfName.Kids
|
||||
|
||||
# PDFs can have arbitrarily nested Pages/Page
|
||||
# dictionary structures.
|
||||
def readnode(node):
|
||||
nodetype = node[typename]
|
||||
if nodetype == pagename:
|
||||
yield node
|
||||
elif nodetype == pagesname:
|
||||
for node in node[kidname]:
|
||||
for node in readnode(node):
|
||||
yield node
|
||||
elif nodetype == catalogname:
|
||||
for node in readnode(node[pagesname]):
|
||||
yield node
|
||||
else:
|
||||
log.error('Expected /Page or /Pages dictionary, got %s' % repr(node))
|
||||
try:
|
||||
return list(readnode(node))
|
||||
except (AttributeError, TypeError), s:
|
||||
log.error('Invalid page tree: %s' % s)
|
||||
return []
|
||||
|
||||
def __init__(self, fname=None, fdata=None, decompress=False, disable_gc=True):
|
||||
|
||||
# Runs a lot faster with GC off.
|
||||
disable_gc = disable_gc and gc.isenabled()
|
||||
try:
|
||||
if disable_gc:
|
||||
gc.disable()
|
||||
if fname is not None:
|
||||
assert fdata is None
|
||||
# Allow reading preexisting streams like pyPdf
|
||||
if hasattr(fname, 'read'):
|
||||
fdata = fname.read()
|
||||
else:
|
||||
try:
|
||||
f = open(fname, 'rb')
|
||||
fdata = f.read()
|
||||
f.close()
|
||||
except IOError:
|
||||
raise PdfParseError('Could not read PDF file %s' % fname)
|
||||
|
||||
assert fdata is not None
|
||||
if not fdata.startswith('%PDF-'):
|
||||
startloc = fdata.find('%PDF-')
|
||||
if startloc >= 0:
|
||||
log.warning('PDF header not at beginning of file')
|
||||
else:
|
||||
lines = fdata.lstrip().splitlines()
|
||||
if not lines:
|
||||
raise PdfParseError('Empty PDF file!')
|
||||
raise PdfParseError('Invalid PDF header: %s' % repr(lines[0]))
|
||||
|
||||
endloc = fdata.rfind('%EOF')
|
||||
if endloc < 0:
|
||||
raise PdfParseError('EOF mark not found: %s' % repr(fdata[-20:]))
|
||||
endloc += 6
|
||||
junk = fdata[endloc:]
|
||||
fdata = fdata[:endloc]
|
||||
if junk.rstrip('\00').strip():
|
||||
log.warning('Extra data at end of file')
|
||||
|
||||
private = self.private
|
||||
private.indirect_objects = {}
|
||||
private.deferred_objects = set()
|
||||
private.special = {'<<': self.readdict,
|
||||
'[': self.readarray,
|
||||
'endobj': self.empty_obj,
|
||||
}
|
||||
for tok in r'\ ( ) < > { } ] >> %'.split():
|
||||
self.special[tok] = self.badtoken
|
||||
|
||||
|
||||
startloc, source = self.findxref(fdata)
|
||||
private.source = source
|
||||
xref_table_list = []
|
||||
source.all_offsets = []
|
||||
while 1:
|
||||
source.obj_offsets = {}
|
||||
# Loop through all the cross-reference tables
|
||||
self.parsexref(source)
|
||||
tok = source.next()
|
||||
if tok != '<<':
|
||||
source.exception('Expected "<<" starting catalog')
|
||||
|
||||
newdict = self.readdict(source)
|
||||
|
||||
token = source.next()
|
||||
if token != 'startxref' and not xref_table_list:
|
||||
source.warning('Expected "startxref" at end of xref table')
|
||||
|
||||
# Loop if any previously-written tables.
|
||||
prev = newdict.Prev
|
||||
if prev is None:
|
||||
break
|
||||
if not xref_table_list:
|
||||
newdict.Prev = None
|
||||
original_indirect = self.indirect_objects.copy()
|
||||
original_newdict = newdict
|
||||
source.floc = int(prev)
|
||||
xref_table_list.append(source.obj_offsets)
|
||||
self.indirect_objects.clear()
|
||||
|
||||
if xref_table_list:
|
||||
for update in reversed(xref_table_list):
|
||||
source.obj_offsets.update(update)
|
||||
self.indirect_objects.clear()
|
||||
self.indirect_objects.update(original_indirect)
|
||||
newdict = original_newdict
|
||||
self.update(newdict)
|
||||
|
||||
#self.read_all_indirect(source)
|
||||
private.pages = self.readpages(self.Root)
|
||||
if decompress:
|
||||
self.uncompress()
|
||||
|
||||
# For compatibility with pyPdf
|
||||
private.numPages = len(self.pages)
|
||||
finally:
|
||||
if disable_gc:
|
||||
gc.enable()
|
||||
|
||||
# For compatibility with pyPdf
|
||||
def getPage(self, pagenum):
|
||||
return self.pages[pagenum]
|
||||
|
||||
def read_all(self):
|
||||
deferred = self.deferred_objects
|
||||
prev = set()
|
||||
while 1:
|
||||
new = deferred - prev
|
||||
if not new:
|
||||
break
|
||||
prev |= deferred
|
||||
for key in new:
|
||||
self.loadindirect(key)
|
||||
|
||||
def uncompress(self):
|
||||
self.read_all()
|
||||
uncompress(self.indirect_objects.itervalues())
|
|
@ -0,0 +1,295 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
The PdfWriter class writes an entire PDF file out to disk.
|
||||
|
||||
The writing process is not at all optimized or organized.
|
||||
|
||||
An instance of the PdfWriter class has two methods:
|
||||
addpage(page)
|
||||
and
|
||||
write(fname)
|
||||
|
||||
addpage() assumes that the pages are part of a valid
|
||||
tree/forest of PDF objects.
|
||||
'''
|
||||
|
||||
try:
|
||||
set
|
||||
except NameError:
|
||||
from sets import Set as set
|
||||
|
||||
from pdfrw.objects import PdfName, PdfArray, PdfDict, IndirectPdfDict, PdfObject, PdfString
|
||||
from pdfrw.compress import compress as do_compress
|
||||
from pdfrw.errors import PdfOutputError, log
|
||||
|
||||
NullObject = PdfObject('null')
|
||||
NullObject.indirect = True
|
||||
NullObject.Type = 'Null object'
|
||||
|
||||
def FormatObjects(f, trailer, version='1.3', compress=True, killobj=(),
|
||||
id=id, isinstance=isinstance, getattr=getattr,len=len,
|
||||
sum=sum, set=set, str=str, basestring=basestring,
|
||||
hasattr=hasattr, repr=repr, enumerate=enumerate,
|
||||
list=list, dict=dict, tuple=tuple,
|
||||
do_compress=do_compress, PdfArray=PdfArray,
|
||||
PdfDict=PdfDict, PdfObject=PdfObject, encode=PdfString.encode):
|
||||
''' FormatObjects performs the actual formatting and disk write.
|
||||
Should be a class, was a class, turned into nested functions
|
||||
for performace (to reduce attribute lookups).
|
||||
'''
|
||||
|
||||
def add(obj):
|
||||
''' Add an object to our list, if it's an indirect
|
||||
object. Just format it if not.
|
||||
'''
|
||||
# Can't hash dicts, so just hash the object ID
|
||||
objid = id(obj)
|
||||
|
||||
# Automatically set stream objects to indirect
|
||||
if isinstance(obj, PdfDict):
|
||||
indirect = obj.indirect or (obj.stream is not None)
|
||||
else:
|
||||
indirect = getattr(obj, 'indirect', False)
|
||||
|
||||
if not indirect:
|
||||
if objid in visited:
|
||||
log.warning('Replicating direct %s object, should be indirect for optimal file size' % type(obj))
|
||||
obj = type(obj)(obj)
|
||||
objid = id(obj)
|
||||
visiting(objid)
|
||||
result = format_obj(obj)
|
||||
leaving(objid)
|
||||
return result
|
||||
|
||||
objnum = indirect_dict_get(objid)
|
||||
|
||||
# If we haven't seen the object yet, we need to
|
||||
# add it to the indirect object list.
|
||||
if objnum is None:
|
||||
swapped = swapobj(objid)
|
||||
if swapped is not None:
|
||||
old_id = objid
|
||||
obj = swapped
|
||||
objid = id(obj)
|
||||
objnum = indirect_dict_get(objid)
|
||||
if objnum is not None:
|
||||
indirect_dict[old_id] = objnum
|
||||
return '%s 0 R' % objnum
|
||||
objnum = len(objlist) + 1
|
||||
objlist_append(None)
|
||||
indirect_dict[objid] = objnum
|
||||
deferred.append((objnum-1, obj))
|
||||
return '%s 0 R' % objnum
|
||||
|
||||
def format_array(myarray, formatter):
|
||||
# Format array data into semi-readable ASCII
|
||||
if sum([len(x) for x in myarray]) <= 70:
|
||||
return formatter % space_join(myarray)
|
||||
return format_big(myarray, formatter)
|
||||
|
||||
def format_big(myarray, formatter):
|
||||
bigarray = []
|
||||
count = 1000000
|
||||
for x in myarray:
|
||||
lenx = len(x) + 1
|
||||
count += lenx
|
||||
if count > 71:
|
||||
subarray = []
|
||||
bigarray.append(subarray)
|
||||
count = lenx
|
||||
subarray.append(x)
|
||||
return formatter % lf_join([space_join(x) for x in bigarray])
|
||||
|
||||
def format_obj(obj):
|
||||
''' format PDF object data into semi-readable ASCII.
|
||||
May mutually recurse with add() -- add() will
|
||||
return references for indirect objects, and add
|
||||
the indirect object to the list.
|
||||
'''
|
||||
while 1:
|
||||
if isinstance(obj, (list, dict, tuple)):
|
||||
if isinstance(obj, PdfArray):
|
||||
myarray = [add(x) for x in obj]
|
||||
return format_array(myarray, '[%s]')
|
||||
elif isinstance(obj, PdfDict):
|
||||
if compress and obj.stream:
|
||||
do_compress([obj])
|
||||
myarray = []
|
||||
dictkeys = [str(x) for x in obj.keys()]
|
||||
dictkeys.sort()
|
||||
for key in dictkeys:
|
||||
myarray.append(key)
|
||||
myarray.append(add(obj[key]))
|
||||
result = format_array(myarray, '<<%s>>')
|
||||
stream = obj.stream
|
||||
if stream is not None:
|
||||
result = '%s\nstream\n%s\nendstream' % (result, stream)
|
||||
return result
|
||||
obj = (PdfArray, PdfDict)[isinstance(obj, dict)](obj)
|
||||
continue
|
||||
|
||||
if not hasattr(obj, 'indirect') and isinstance(obj, basestring):
|
||||
return encode(obj)
|
||||
return str(getattr(obj, 'encoded', obj))
|
||||
|
||||
def format_deferred():
|
||||
while deferred:
|
||||
index, obj = deferred.pop()
|
||||
objlist[index] = format_obj(obj)
|
||||
|
||||
|
||||
indirect_dict = {}
|
||||
indirect_dict_get = indirect_dict.get
|
||||
objlist = []
|
||||
objlist_append = objlist.append
|
||||
visited = set()
|
||||
visiting = visited.add
|
||||
leaving = visited.remove
|
||||
space_join = ' '.join
|
||||
lf_join = '\n '.join
|
||||
f_write = f.write
|
||||
|
||||
deferred = []
|
||||
|
||||
# Don't reference old catalog or pages objects -- swap references to new ones.
|
||||
swapobj = {PdfName.Catalog:trailer.Root, PdfName.Pages:trailer.Root.Pages, None:trailer}.get
|
||||
swapobj = [(objid, swapobj(obj.Type)) for objid, obj in killobj.iteritems()]
|
||||
swapobj = dict((objid, obj is None and NullObject or obj) for objid, obj in swapobj).get
|
||||
|
||||
for objid in killobj:
|
||||
assert swapobj(objid) is not None
|
||||
|
||||
# The first format of trailer gets all the information,
|
||||
# but we throw away the actual trailer formatting.
|
||||
format_obj(trailer)
|
||||
# Keep formatting until we're done.
|
||||
# (Used to recurse inside format_obj for this, but
|
||||
# hit system limit.)
|
||||
format_deferred()
|
||||
# Now we know the size, so we update the trailer dict
|
||||
# and get the formatted data.
|
||||
trailer.Size = PdfObject(len(objlist) + 1)
|
||||
trailer = format_obj(trailer)
|
||||
|
||||
# Now we have all the pieces to write out to the file.
|
||||
# Keep careful track of the counts while we do it so
|
||||
# we can correctly build the cross-reference.
|
||||
|
||||
header = '%%PDF-%s\n%%\xe2\xe3\xcf\xd3\n' % version
|
||||
f_write(header)
|
||||
offset = len(header)
|
||||
offsets = [(0, 65535, 'f')]
|
||||
offsets_append = offsets.append
|
||||
|
||||
for i, x in enumerate(objlist):
|
||||
objstr = '%s 0 obj\n%s\nendobj\n' % (i + 1, x)
|
||||
offsets_append((offset, 0, 'n'))
|
||||
offset += len(objstr)
|
||||
f_write(objstr)
|
||||
|
||||
f_write('xref\n0 %s\n' % len(offsets))
|
||||
for x in offsets:
|
||||
f_write('%010d %05d %s\r\n' % x)
|
||||
f_write('trailer\n\n%s\nstartxref\n%s\n%%%%EOF\n' % (trailer, offset))
|
||||
|
||||
class PdfWriter(object):
|
||||
|
||||
_trailer = None
|
||||
|
||||
def __init__(self, version='1.3', compress=False):
|
||||
self.pagearray = PdfArray()
|
||||
self.compress = compress
|
||||
self.version = version
|
||||
self.killobj = {}
|
||||
|
||||
def addpage(self, page):
|
||||
self._trailer = None
|
||||
if page.Type != PdfName.Page:
|
||||
raise PdfOutputError('Bad /Type: Expected %s, found %s'
|
||||
% (PdfName.Page, page.Type))
|
||||
inheritable = page.inheritable # searches for resources
|
||||
self.pagearray.append(
|
||||
IndirectPdfDict(
|
||||
page,
|
||||
Resources = inheritable.Resources,
|
||||
MediaBox = inheritable.MediaBox,
|
||||
CropBox = inheritable.CropBox,
|
||||
Rotate = inheritable.Rotate,
|
||||
)
|
||||
)
|
||||
|
||||
# Add parents in the hierarchy to objects we
|
||||
# don't want to output
|
||||
killobj = self.killobj
|
||||
obj = page.Parent
|
||||
while obj is not None:
|
||||
objid = id(obj)
|
||||
if objid in killobj:
|
||||
break
|
||||
killobj[objid] = obj
|
||||
obj = obj.Parent
|
||||
return self
|
||||
|
||||
addPage = addpage # for compatibility with pyPdf
|
||||
|
||||
def addpages(self, pagelist):
|
||||
for page in pagelist:
|
||||
self.addpage(page)
|
||||
return self
|
||||
|
||||
def _get_trailer(self):
|
||||
trailer = self._trailer
|
||||
if trailer is not None:
|
||||
return trailer
|
||||
|
||||
# Create the basic object structure of the PDF file
|
||||
trailer = PdfDict(
|
||||
Root = IndirectPdfDict(
|
||||
Type = PdfName.Catalog,
|
||||
Pages = IndirectPdfDict(
|
||||
Type = PdfName.Pages,
|
||||
Count = PdfObject(len(self.pagearray)),
|
||||
Kids = self.pagearray
|
||||
)
|
||||
)
|
||||
)
|
||||
# Make all the pages point back to the page dictionary
|
||||
pagedict = trailer.Root.Pages
|
||||
for page in pagedict.Kids:
|
||||
page.Parent = pagedict
|
||||
self._trailer = trailer
|
||||
return trailer
|
||||
|
||||
def _set_trailer(self, trailer):
|
||||
self._trailer = trailer
|
||||
|
||||
trailer = property(_get_trailer, _set_trailer)
|
||||
|
||||
def write(self, fname, trailer=None):
|
||||
trailer = trailer or self.trailer
|
||||
|
||||
# Dump the data. We either have a filename or a preexisting
|
||||
# file object.
|
||||
preexisting = hasattr(fname, 'write')
|
||||
f = preexisting and fname or open(fname, 'wb')
|
||||
FormatObjects(f, trailer, self.version, self.compress, self.killobj)
|
||||
if not preexisting:
|
||||
f.close()
|
||||
|
||||
if __name__ == '__main__':
|
||||
import logging
|
||||
log.setLevel(logging.DEBUG)
|
||||
import pdfreader
|
||||
x = pdfreader.PdfReader('source.pdf')
|
||||
y = PdfWriter()
|
||||
for i, page in enumerate(x.pages):
|
||||
print ' Adding page', i+1, '\r',
|
||||
y.addpage(page)
|
||||
print
|
||||
y.write('result.pdf')
|
||||
print
|
|
@ -0,0 +1,228 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2012 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
A tokenizer for PDF streams.
|
||||
|
||||
In general, documentation used was "PDF reference",
|
||||
sixth edition, for PDF version 1.7, dated November 2006.
|
||||
|
||||
'''
|
||||
|
||||
from __future__ import generators
|
||||
|
||||
import re
|
||||
import itertools
|
||||
from pdfrw.objects import PdfString, PdfObject
|
||||
from pdfrw.errors import log, PdfParseError
|
||||
|
||||
def linepos(fdata, loc):
|
||||
line = fdata.count('\n', 0, loc) + 1
|
||||
line += fdata.count('\r', 0, loc) - fdata.count('\r\n', 0, loc)
|
||||
col = loc - max(fdata.rfind('\n', 0, loc), fdata.rfind('\r', 0, loc))
|
||||
return line, col
|
||||
|
||||
class PdfTokens(object):
|
||||
|
||||
# Table 3.1, page 50 of reference, defines whitespace
|
||||
eol = '\n\r'
|
||||
whitespace = '\x00 \t\f' + eol
|
||||
|
||||
# Text on page 50 defines delimiter characters
|
||||
# Escape the ]
|
||||
delimiters = r'()<>{}[\]/%'
|
||||
|
||||
# "normal" stuff is all but delimiters or whitespace.
|
||||
|
||||
p_normal = r'(?:[^\\%s%s]+|\\[^%s])+' % (whitespace, delimiters, whitespace)
|
||||
|
||||
p_comment = r'\%%[^%s]*' % eol
|
||||
|
||||
# This will get the bulk of literal strings.
|
||||
p_literal_string = r'\((?:[^\\()]+|\\.)*[()]?'
|
||||
|
||||
# This will get more pieces of literal strings
|
||||
# (Don't ask me why, but it hangs without the trailing ?.)
|
||||
p_literal_string_extend = r'(?:[^\\()]+|\\.)*[()]?'
|
||||
|
||||
# A hex string. This one's easy.
|
||||
p_hex_string = r'\<[%s0-9A-Fa-f]*\>' % whitespace
|
||||
|
||||
p_dictdelim = r'\<\<|\>\>'
|
||||
p_name = r'/[^%s%s]*' % (delimiters, whitespace)
|
||||
|
||||
p_catchall = '[^%s]' % whitespace
|
||||
|
||||
pattern = '|'.join([p_normal, p_name, p_hex_string, p_dictdelim, p_literal_string, p_comment, p_catchall])
|
||||
findtok = re.compile('(%s)[%s]*' % (pattern, whitespace), re.DOTALL).finditer
|
||||
findparen = re.compile('(%s)[%s]*' % (p_literal_string_extend, whitespace), re.DOTALL).finditer
|
||||
splitname = re.compile(r'\#([0-9A-Fa-f]{2})').split
|
||||
|
||||
def _cacheobj(cache, obj, constructor):
|
||||
''' This caching relies on the constructors
|
||||
returning something that will compare as
|
||||
equal to the original obj. This works
|
||||
fine with our PDF objects.
|
||||
'''
|
||||
result = cache.get(obj)
|
||||
if result is None:
|
||||
result = constructor(obj)
|
||||
cache[result] = result
|
||||
return result
|
||||
|
||||
def fixname(self, cache, token, constructor, splitname=splitname, join=''.join, cacheobj=_cacheobj):
|
||||
''' Inside name tokens, a '#' character indicates that
|
||||
the next two bytes are hex characters to be used
|
||||
to form the 'real' character.
|
||||
'''
|
||||
substrs = splitname(token)
|
||||
if '#' in join(substrs[::2]):
|
||||
self.warning('Invalid /Name token')
|
||||
return token
|
||||
substrs[1::2] = (chr(int(x, 16)) for x in substrs[1::2])
|
||||
result = cacheobj(cache, join(substrs), constructor)
|
||||
result.encoded = token
|
||||
return result
|
||||
|
||||
def _gettoks(self, startloc, cacheobj=_cacheobj,
|
||||
delimiters=delimiters, findtok=findtok, findparen=findparen,
|
||||
PdfString=PdfString, PdfObject=PdfObject):
|
||||
''' Given a source data string and a location inside it,
|
||||
gettoks generates tokens. Each token is a tuple of the form:
|
||||
<starting file loc>, <ending file loc>, <token string>
|
||||
The ending file loc is past any trailing whitespace.
|
||||
|
||||
The main complication here is the literal strings, which
|
||||
can contain nested parentheses. In order to cope with these
|
||||
we can discard the current iterator and loop back to the
|
||||
top to get a fresh one.
|
||||
|
||||
We could use re.search instead of re.finditer, but that's slower.
|
||||
'''
|
||||
fdata = self.fdata
|
||||
current = self.current = [(startloc, startloc)]
|
||||
namehandler = (cacheobj, self.fixname)
|
||||
cache = {}
|
||||
while 1:
|
||||
for match in findtok(fdata, current[0][1]):
|
||||
current[0] = tokspan = match.span()
|
||||
token = match.group(1)
|
||||
firstch = token[0]
|
||||
if firstch not in delimiters:
|
||||
token = cacheobj(cache, token, PdfObject)
|
||||
elif firstch in '/<(%':
|
||||
if firstch == '/':
|
||||
# PDF Name
|
||||
token = namehandler['#' in token](cache, token, PdfObject)
|
||||
elif firstch == '<':
|
||||
# << dict delim, or < hex string >
|
||||
if token[1:2] != '<':
|
||||
token = cacheobj(cache, token, PdfString)
|
||||
elif firstch == '(':
|
||||
# Literal string
|
||||
# It's probably simple, but maybe not
|
||||
# Nested parentheses are a bear, and if
|
||||
# they are present, we exit the for loop
|
||||
# and get back in with a new starting location.
|
||||
ends = None # For broken strings
|
||||
if fdata[match.end(1)-1] != ')':
|
||||
nest = 2
|
||||
m_start, loc = tokspan
|
||||
for match in findparen(fdata, loc):
|
||||
loc = match.end(1)
|
||||
ending = fdata[loc-1] == ')'
|
||||
nest += 1 - ending * 2
|
||||
if not nest:
|
||||
break
|
||||
if ending and ends is None:
|
||||
ends = loc, match.end(), nest
|
||||
token = fdata[m_start:loc]
|
||||
current[0] = m_start, match.end()
|
||||
if nest:
|
||||
# There is one possible recoverable error seen in
|
||||
# the wild -- some stupid generators don't escape (.
|
||||
# If this happens, just terminate on first unescaped ).
|
||||
# The string won't be quite right, but that's a science
|
||||
# fair project for another time.
|
||||
(self.error, self.exception)[not ends]('Unterminated literal string')
|
||||
loc, ends, nest = ends
|
||||
token = fdata[m_start:loc] + ')' * nest
|
||||
current[0] = m_start, ends
|
||||
token = cacheobj(cache, token, PdfString)
|
||||
elif firstch == '%':
|
||||
# Comment
|
||||
if self.strip_comments:
|
||||
continue
|
||||
else:
|
||||
self.exception('Tokenizer logic incorrect -- should never get here')
|
||||
|
||||
yield token
|
||||
if current[0] is not tokspan:
|
||||
break
|
||||
else:
|
||||
if self.strip_comments:
|
||||
break
|
||||
raise StopIteration
|
||||
|
||||
def __init__(self, fdata, startloc=0, strip_comments=True):
|
||||
self.fdata = fdata
|
||||
self.strip_comments = strip_comments
|
||||
self.iterator = iterator = self._gettoks(startloc)
|
||||
self.next = iterator.next
|
||||
|
||||
def setstart(self, startloc):
|
||||
''' Change the starting location.
|
||||
'''
|
||||
current = self.current
|
||||
if startloc != current[0][1]:
|
||||
current[0] = startloc, startloc
|
||||
|
||||
def floc(self):
|
||||
''' Return the current file position
|
||||
(where the next token will be retrieved)
|
||||
'''
|
||||
return self.current[0][1]
|
||||
floc = property(floc, setstart)
|
||||
|
||||
def tokstart(self):
|
||||
''' Return the file position of the most
|
||||
recently retrieved token.
|
||||
'''
|
||||
return self.current[0][0]
|
||||
tokstart = property(tokstart, setstart)
|
||||
|
||||
def __iter__(self):
|
||||
return self.iterator
|
||||
|
||||
def multiple(self, count, islice=itertools.islice, list=list):
|
||||
''' Retrieve multiple tokens
|
||||
'''
|
||||
return list(islice(self, count))
|
||||
|
||||
def next_default(self, default='nope'):
|
||||
for result in self:
|
||||
return result
|
||||
return default
|
||||
|
||||
def msg(self, msg, *arg):
|
||||
if arg:
|
||||
msg %= arg
|
||||
fdata = self.fdata
|
||||
begin, end = self.current[0]
|
||||
line, col = linepos(fdata, begin)
|
||||
if end > begin:
|
||||
tok = fdata[begin:end].rstrip()
|
||||
if len(tok) > 30:
|
||||
tok = tok[:26] + ' ...'
|
||||
return '%s (line=%d, col=%d, token=%s)' % (msg, line, col, repr(tok))
|
||||
return '%s (line=%d, col=%d)' % (msg, line, col)
|
||||
|
||||
def warning(self, *arg):
|
||||
log.warning(self.msg(*arg))
|
||||
|
||||
def error(self, *arg):
|
||||
log.error(self.msg(*arg))
|
||||
|
||||
def exception(self, *arg):
|
||||
raise PdfParseError(self.msg(*arg))
|
|
@ -0,0 +1,139 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
Converts pdfrw objects into reportlab objects.
|
||||
|
||||
Designed for and tested with rl 2.3.
|
||||
|
||||
Knows too much about reportlab internals.
|
||||
What can you do?
|
||||
|
||||
The interface to this function is through the makerl() function.
|
||||
|
||||
Parameters:
|
||||
canv - a reportlab "canvas" (also accepts a "document")
|
||||
pdfobj - a pdfrw PDF object
|
||||
|
||||
Returns:
|
||||
A corresponding reportlab object, or if the
|
||||
object is a PDF Form XObject, the name to
|
||||
use with reportlab for the object.
|
||||
|
||||
Will recursively convert all necessary objects.
|
||||
Be careful when converting a page -- if /Parent is set,
|
||||
will recursively convert all pages!
|
||||
|
||||
Notes:
|
||||
1) Original objects are annotated with a
|
||||
derived_rl_obj attribute which points to the
|
||||
reportlab object. This keeps multiple reportlab
|
||||
objects from being generated for the same pdfobj
|
||||
via repeated calls to makerl. This is great for
|
||||
not putting too many objects into the
|
||||
new PDF, but not so good if you are modifying
|
||||
objects for different pages. Then you
|
||||
need to do your own deep copying (of circular
|
||||
structures). You're on your own.
|
||||
|
||||
2) ReportLab seems weird about FormXObjects.
|
||||
They pass around a partial name instead of the
|
||||
object or a reference to it. So we have to
|
||||
reach into reportlab and get a number for
|
||||
a unique name. I guess this is to make it
|
||||
where you can combine page streams with
|
||||
impunity, but that's just a guess.
|
||||
|
||||
3) Updated 1/23/2010 to handle multipass documents
|
||||
(e.g. with a table of contents). These have
|
||||
a different doc object on every pass.
|
||||
|
||||
'''
|
||||
|
||||
from reportlab.pdfbase import pdfdoc as rldocmodule
|
||||
from pdfrw.objects import PdfDict, PdfArray, PdfName
|
||||
|
||||
RLStream = rldocmodule.PDFStream
|
||||
RLDict = rldocmodule.PDFDictionary
|
||||
RLArray = rldocmodule.PDFArray
|
||||
|
||||
|
||||
def _makedict(rldoc, pdfobj):
|
||||
rlobj = rldict = RLDict()
|
||||
if pdfobj.indirect:
|
||||
rlobj.__RefOnly__ = 1
|
||||
rlobj = rldoc.Reference(rlobj)
|
||||
pdfobj.derived_rl_obj[rldoc] = rlobj, None
|
||||
|
||||
for key, value in pdfobj.iteritems():
|
||||
rldict[key[1:]] = makerl_recurse(rldoc, value)
|
||||
|
||||
return rlobj
|
||||
|
||||
def _makestream(rldoc, pdfobj, xobjtype=PdfName.XObject):
|
||||
rldict = RLDict()
|
||||
rlobj = RLStream(rldict, pdfobj.stream)
|
||||
|
||||
if pdfobj.Type == xobjtype:
|
||||
shortname = 'pdfrw_%s' % (rldoc.objectcounter+1)
|
||||
fullname = rldoc.getXObjectName(shortname)
|
||||
else:
|
||||
shortname = fullname = None
|
||||
result = rldoc.Reference(rlobj, fullname)
|
||||
pdfobj.derived_rl_obj[rldoc] = result, shortname
|
||||
|
||||
for key, value in pdfobj.iteritems():
|
||||
rldict[key[1:]] = makerl_recurse(rldoc, value)
|
||||
|
||||
return result
|
||||
|
||||
def _makearray(rldoc, pdfobj):
|
||||
rlobj = rlarray = RLArray([])
|
||||
if pdfobj.indirect:
|
||||
rlobj.__RefOnly__ = 1
|
||||
rlobj = rldoc.Reference(rlobj)
|
||||
pdfobj.derived_rl_obj[rldoc] = rlobj, None
|
||||
|
||||
mylist = rlarray.sequence
|
||||
for value in pdfobj:
|
||||
mylist.append(makerl_recurse(rldoc, value))
|
||||
|
||||
return rlobj
|
||||
|
||||
def _makestr(rldoc, pdfobj):
|
||||
assert isinstance(pdfobj, (float, int, str)), repr(pdfobj)
|
||||
return pdfobj
|
||||
|
||||
def makerl_recurse(rldoc, pdfobj):
|
||||
docdict = getattr(pdfobj, 'derived_rl_obj', None)
|
||||
if docdict is not None:
|
||||
value = docdict.get(rldoc)
|
||||
if value is not None:
|
||||
return value[0]
|
||||
if isinstance(pdfobj, PdfDict):
|
||||
if pdfobj.stream is not None:
|
||||
func = _makestream
|
||||
else:
|
||||
func = _makedict
|
||||
if docdict is None:
|
||||
pdfobj.private.derived_rl_obj = {}
|
||||
elif isinstance(pdfobj, PdfArray):
|
||||
func = _makearray
|
||||
if docdict is None:
|
||||
pdfobj.derived_rl_obj = {}
|
||||
else:
|
||||
func = _makestr
|
||||
return func(rldoc, pdfobj)
|
||||
|
||||
def makerl(canv, pdfobj):
|
||||
try:
|
||||
rldoc = canv._doc
|
||||
except AttributeError:
|
||||
rldoc = canv
|
||||
rlobj = makerl_recurse(rldoc, pdfobj)
|
||||
try:
|
||||
name = pdfobj.derived_rl_obj[rldoc][1]
|
||||
except AttributeError:
|
||||
name = None
|
||||
return name or rlobj
|
|
@ -0,0 +1,52 @@
|
|||
# A part of pdfrw (pdfrw.googlecode.com)
|
||||
# Copyright (C) 2006-2009 Patrick Maupin, Austin, Texas
|
||||
# MIT license -- See LICENSE.txt for details
|
||||
|
||||
'''
|
||||
Currently, this sad little file only knows how to decompress
|
||||
using the flate (zlib) algorithm. Maybe more later, but it's
|
||||
not a priority for me...
|
||||
'''
|
||||
import zlib
|
||||
from pdfrw.objects import PdfDict, PdfName
|
||||
from pdfrw.errors import log
|
||||
|
||||
def streamobjects(mylist, isinstance=isinstance, PdfDict=PdfDict):
|
||||
for obj in mylist:
|
||||
if isinstance(obj, PdfDict) and obj.stream is not None:
|
||||
yield obj
|
||||
|
||||
def uncompress(mylist, warnings=set(), flate = PdfName.FlateDecode,
|
||||
decompress=zlib.decompressobj, isinstance=isinstance, list=list, len=len):
|
||||
ok = True
|
||||
for obj in streamobjects(mylist):
|
||||
ftype = obj.Filter
|
||||
if ftype is None:
|
||||
continue
|
||||
if isinstance(ftype, list) and len(ftype) == 1:
|
||||
# todo: multiple filters
|
||||
ftype = ftype[0]
|
||||
parms = obj.DecodeParms
|
||||
if ftype != flate or parms is not None:
|
||||
msg = 'Not decompressing: cannot use filter %s with parameters %s' % (repr(ftype), repr(parms))
|
||||
if msg not in warnings:
|
||||
warnings.add(msg)
|
||||
log.warning(msg)
|
||||
ok = False
|
||||
else:
|
||||
dco = decompress()
|
||||
error = None
|
||||
try:
|
||||
data = dco.decompress(obj.stream)
|
||||
except Exception, s:
|
||||
error = str(s)
|
||||
if error is None:
|
||||
assert not dco.unconsumed_tail
|
||||
if dco.unused_data.strip():
|
||||
error = 'Unconsumed compression data: %s' % repr(dco.unused_data[:20])
|
||||
if error is None:
|
||||
obj.Filter = None
|
||||
obj.stream = data
|
||||
else:
|
||||
log.error('%s %s' % (error, repr(obj.indirect)))
|
||||
return ok
|
|
@ -0,0 +1,38 @@
|
|||
#!/usr/bin/env python
|
||||
|
||||
from distutils.core import setup
|
||||
|
||||
setup(
|
||||
name='pdfrw',
|
||||
version='0.1',
|
||||
description='PDF file reader/writer library',
|
||||
long_description='''
|
||||
pdfrw lets you read and write PDF files, including
|
||||
compositing multiple pages together (e.g. to do watermarking,
|
||||
or to copy an image or diagram from one PDF to another),
|
||||
and can output by itself, or in conjunction with reportlab.
|
||||
|
||||
pdfrw will faithfully reproduce vector formats without
|
||||
rasterization, so the rst2pdf package has used pdfrw
|
||||
by default for PDF and SVG images by default since
|
||||
March 2010. Several small examples are provided.
|
||||
''',
|
||||
author='Patrick Maupin',
|
||||
author_email='pmaupin@gmail.com',
|
||||
platforms="Independent",
|
||||
url='http://code.google.com/p/pdfrw/',
|
||||
packages=['pdfrw', 'pdfrw.objects'],
|
||||
license="MIT",
|
||||
classifiers=[
|
||||
'Development Status :: 4 - Beta',
|
||||
'Environment :: Console',
|
||||
'Intended Audience :: Developers',
|
||||
'License :: OSI Approved :: MIT License',
|
||||
'Operating System :: OS Independent',
|
||||
'Programming Language :: Python',
|
||||
'Topic :: Multimedia :: Graphics :: Graphics Conversion',
|
||||
'Topic :: Software Development :: Libraries',
|
||||
'Topic :: Utilities'
|
||||
],
|
||||
keywords='pdf vector graphics',
|
||||
)
|
|
@ -0,0 +1 @@
|
|||
# This file intentionally left blank.
|
|
@ -0,0 +1,37 @@
|
|||
'''
|
||||
Run from the directory above like so:
|
||||
python -m tests.test_pdfstring
|
||||
'''
|
||||
|
||||
|
||||
import pdfrw
|
||||
import unittest
|
||||
|
||||
|
||||
class TestEncoding(unittest.TestCase):
|
||||
|
||||
@staticmethod
|
||||
def decode(value):
|
||||
return pdfrw.pdfobjects.PdfString(value).decode()
|
||||
|
||||
@staticmethod
|
||||
def encode(value):
|
||||
return str(pdfrw.pdfobjects.PdfString.encode(value))
|
||||
|
||||
@classmethod
|
||||
def encode_decode(cls, value):
|
||||
return cls.decode(cls.encode(value))
|
||||
|
||||
def roundtrip(self, value):
|
||||
self.assertEqual(value, self.encode_decode(value))
|
||||
|
||||
def test_doubleslash(self):
|
||||
self.roundtrip('\\')
|
||||
|
||||
|
||||
def main():
|
||||
unittest.main()
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
Loading…
Reference in New Issue