```
> python pdf-image-extractor.py ..\PDF_Samples\GeoBase_NHNC1_Data_Model_UML_EN.pdf
Traceback (most recent call last):
File "pdf-image-extractor.py", line 33, in <module>
img = Image.frombytes(mode, size, data)
File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 2047, in frombytes
im.frombytes(data, decoder_name, args)
File "C:\Python27\ArcGIS10.3\lib\site-packages\PIL\Image.py", line 731, in frombytes
raise ValueError("not enough image data")
ValueError: not enough image data
```
Source:
http://ftp2.cits.rncan.gc.ca/pub/geobase/official/nhn_rhn/doc/
"""
All distributed data are subject to the Open Government Licence – Canada.
Canada grants to the licensee a non-exclusive, fully paid, royalty-free
right and licence to exercise all intellectual property rights in the
data. This includes the right to use, incorporate, sublicense (with
further right of sublicensing), modify, improve, further develop, and
distribute the Data; and to manufacture or distribute derivative
products.
-- http://www.nrcan.gc.ca/earth-sciences/geography/topographic-information/free-data-geogratis/licence/17285
"""
According to the PDF doc, an entry in the document's name dictionary,
listing the JavaScript actions, is required to execute parameterized
function call.
Also note that:
> The names are arbitrary and need not bear any relation to the
> JavaScript name
> space.
In this case, the name is _0000000000_.
After investigating an odd error:
# NotImplementedError: unsupported filter /Fl
I saw the PDF had this line:
# <</Filter/Fl/First 12/Length 828/N 2/Type/ObjStm>>stream
But most objects had:
# <</Size 306/Filter/FlateDecode/Length 947/Type/XRef/W[1 3 1]
After looking at the filters.py file, I saw the short-names were not added. After modifying the filters.py, the code is up and running again.
A thanks to Matthew Weiss for helping with this as well.
Each "TJ" entry is a separate piece of text so provide some way for the user to separate them in the extracted text. This way the user can then txt.split("\n") to get a list of all text blocks