Commit Graph

364 Commits

Author SHA1 Message Date
Sylvain Pelissier 7bc62cd896 PDF extraction error handling 2016-01-13 09:16:56 +01:00
Sylvain Pelissier c83cbd87e7 Merge pull request #1 from maphew/master
Image extractor script with sample failing pdf
2016-01-07 08:29:57 +01:00
Matt Wilkie eeb2b659aa Fails with "ValueError: not enough image data"
```
> python pdf-image-extractor.py ..\PDF_Samples\GeoBase_NHNC1_Data_Model_UML_EN.pdf
Traceback (most recent call last):
  File "pdf-image-extractor.py", line 33, in <module>
    img = Image.frombytes(mode, size, data)
  File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 2047, in frombytes
    im.frombytes(data, decoder_name, args)
  File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 731, in frombytes
    raise ValueError("not enough image data")
ValueError: not enough image data
```

Source:
http://ftp2.cits.rncan.gc.ca/pub/geobase/official/nhn_rhn/doc/

"""
All distributed data are subject to the Open Government Licence – Canada.

Canada grants to the licensee a non-exclusive, fully paid, royalty-free
right and licence to exercise all intellectual property rights in the
data. This includes the right to use, incorporate, sublicense (with
further right of sublicensing), modify, improve, further develop, and
distribute the Data; and to manufacture or distribute derivative
products.

-- http://www.nrcan.gc.ca/earth-sciences/geography/topographic-information/free-data-geogratis/licence/17285
"""
2016-01-06 11:40:14 -08:00
Matt Wilkie ba3da42d68 Extract images from PDF without resampling or altering.
Adapted from work by Sylvain Pelissier (@sylvainpelissier)
http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python

Script works but has limited range of image types it is successful with.
Future commits will have sample PDFs and notes about what works/fails.
2016-01-06 11:26:40 -08:00
Sylvain Pelissier 39de327cd9 JPEG 2000 filter added 2015-12-10 10:18:47 +01:00
Sylvain Pelissier 7b591a285d JPEG sample 2015-12-05 11:44:08 +01:00
Sylvain Pelissier 098394a3b3 /DCTDecode stream data 2015-12-05 11:17:12 +01:00
Matthew Stamy 0900101f83 Merge pull request #221 from louib/parameterized_js
Parameterized JavaScript.
2015-09-23 14:31:53 -05:00
Louis-Bertrand Varin ab9395cc5b Adding unit tests for addJS. 2015-08-24 21:50:53 -04:00
Louis-Bertrand Varin 5052688261 Parameterized JavaScript.
According to the PDF doc, an entry in the document's name dictionary,
listing the JavaScript actions, is required to execute parameterized
function call.

Also note that:
> The names are arbitrary and need not bear any relation to the
> JavaScript name
> space.

In this case, the name is _0000000000_.
2015-08-15 00:20:35 -04:00
Matthew Stamy 7456f0acea Stronger equality test for resource values Fixes #182 2015-07-23 16:25:37 -05:00
Matthew Stamy cf269ddfa9 update changelog for patch 2015-07-20 15:11:09 -05:00
Matthew Stamy d0e08b90f5 Conform to semantic versioning. Patch number added 2015-07-20 14:23:51 -05:00
Matthew Stamy fc05b046c0 Smarter inline image parsing 2015-07-15 11:50:44 -05:00
Matthew Stamy 736dc27453 Replace usage of Str with isString 2015-07-09 14:26:39 -05:00
Matthew Stamy e87538baf1 Version 1.25 2015-07-07 16:05:22 -05:00
Matthew Stamy 80551fa094 Merge pull request #172 from jerickbixly/master
Python3 support for ASCII85Decode
2015-07-06 14:09:36 -05:00
Matthew Stamy 9022c7db14 Merge pull request #211 from speedplane/master
Fix "Stream has ended unexpectedly" for Name Objects
2015-06-30 15:40:29 -05:00
speedplane bf7339863e Also, fix up this regex, which appeared to be totally broken for all but the simplest cases. 2015-06-30 09:06:31 -04:00
speedplane 431ba70920 Fix a bug which could result in a "Stream has ended unexpectedly" error being raised unecessarily if a Name object runs right up against the end of a file stream. 2015-06-30 08:36:55 -04:00
Matthew Stamy ee0ace64b1 Merge pull request #210 from underdogio/dev/copy.encryption.sqwished
Added decryption key copying for PdfFileMerger
2015-06-26 14:42:46 -05:00
Todd Wolfson 541963c54b Added decryption key copying for PdfFileMerger 2015-06-26 14:09:12 -05:00
Matthew Stamy 7ea13fcbea Merge branch 'master' of https://github.com/mstamy2/PyPDF2 2015-06-18 12:50:29 -05:00
Matthew Stamy 8a144a3e2f Provide exception instead of assert false 2015-06-18 12:49:22 -05:00
Matthew Stamy 0b7f9a7d66 Merge pull request #209 from AlmightyOatmeal/patch-1
Add abbreviated short-names for filters
2015-06-18 11:59:16 -05:00
Jamie Ivanov afccc8fc94 Add abbreviated short-names for filters
After investigating an odd error:

# NotImplementedError: unsupported filter /Fl

I saw the PDF had this line:

# <</Filter/Fl/First 12/Length 828/N 2/Type/ObjStm>>stream

But most objects had:

# <</Size 306/Filter/FlateDecode/Length 947/Type/XRef/W[1 3 1]

After looking at the filters.py file, I saw the short-names were not added. After modifying the filters.py, the code is up and running again.

A thanks to Matthew Weiss for helping with this as well.
2015-06-18 10:00:23 -05:00
Matthew Stamy 969d6ef94c Merge pull request #122 from mozbugbox/get-page-number
Add method to get page number from Page/Outline objects
2015-06-17 14:42:24 -05:00
Matthew Stamy 6f1c5284df Read extra initial whitespace when reading object from stream resolves #204 2015-06-17 14:15:37 -05:00
Matthew Stamy 33d7f71ac4 Merge pull request #208 from peircej/master
Separate extracted text fields with EOLs
2015-06-17 13:16:03 -05:00
Jon Peirce 8271888434 Separate extracted text fields with EOLs
Each "TJ" entry is a separate piece of text so provide some way for the user to separate them in the extracted text. This way the user can then txt.split("\n") to get a list of all text blocks
2015-06-17 17:51:17 +01:00
Matthew Stamy 1cdcf7ebee Merge branch 'GuruLabs-roakes/guru_enhancements' 2015-06-16 15:55:21 -05:00
Matthew Stamy ac67ab6251 resolved merge conflict 2015-06-16 15:54:26 -05:00
Matthew Stamy 894b8d1916 Merge branch 'bamrhein-utils_fixes' 2015-06-15 15:21:16 -05:00
Matthew Stamy eb93deb3cd sys.maxint does not exist in Py 3 2015-06-15 15:19:32 -05:00
Matthew Stamy c2af8a0c6c Utilize isString 2015-06-15 15:01:21 -05:00
Matthew Stamy 56a4b9a04f Merge branch 'utils_fixes' of https://github.com/bamrhein/PyPDF2 into bamrhein-utils_fixes 2015-06-15 14:45:13 -05:00
Matthew Stamy 11bb9721b5 Merge pull request #148 from moshekaplan/patch-1
Add support for Embedded Files in the PDF
2015-06-15 14:41:16 -05:00
Matthew Stamy 1a2fc537b0 Merge branch 'master' of https://github.com/mstamy2/PyPDF2 2015-06-11 17:52:58 -05:00
Matthew Stamy 203f5510a0 merging 2015-06-11 17:47:11 -05:00
Matthew Stamy 9c105eb13b Merge branch 'linuxexp-Overflow-Error' 2015-06-11 16:55:44 -05:00
Matthew Stamy 2376a5ddb6 Merge branch 'Overflow-Error' of https://github.com/linuxexp/PyPDF2 into linuxexp-Overflow-Error 2015-06-11 16:51:10 -05:00
Matthew Stamy e3cf7c7207 Merge pull request #202 from vladir/feature/fixing_decodeStreamData_issue
Fix decode stream data issue
2015-06-11 16:40:06 -05:00
Rob Oakes 02de326fc3 Added methods which make it possible to create a copy of a document from a PDF reader instance
- Added a convenience method to merge a dictionary of form field values onto a page
2015-06-04 06:54:34 -04:00
Rob Oakes bca8a754e3 Merge branch 'upstream/merge' 2015-06-04 05:50:39 -04:00
Rob Oakes aa69bc95d7 Started work on a test suite, added a test for loading and decoding a PDF file
- Added resources for the test
2015-06-04 05:49:25 -04:00
Rob Oakes 4abded43ad Added instructions for running test suite 2015-06-04 05:48:05 -04:00
Rob Oakes 0a7b72d135 Merge branch 'gurulabs2' of http://dev.oak-tree.us/publishing/pypdf2 into roakes/gurulabs
Conflicts:
	.gitignore
	PyPDF2/generic.py
2015-06-04 01:18:57 -04:00
Vladir Parrado Cruz a87a394e05 If there is not data to decode we should not try to decode the data. 2015-05-31 15:09:19 +02:00
Matthew Stamy 646fd168cf Merge pull request #197 from elena/master
Fix "file has not been decrypted" error #51.
2015-05-26 14:06:49 -05:00
Elena Williams 15bd71bd1f Fix "file has not been decrypted" error #51.
Work around for PDFs which behave as if decrypted, though were 
encrypted without a password.
2015-05-01 06:45:10 +08:00