68 lines
2.6 KiB
ReStructuredText
68 lines
2.6 KiB
ReStructuredText
.. _ref-rich_content_extraction:
|
|
|
|
=======================
|
|
Rich Content Extraction
|
|
=======================
|
|
|
|
For some projects it is desirable to index text content which is stored in
|
|
structured files such as PDFs, Microsoft Office documents, images, etc.
|
|
Currently only Solr's `ExtractingRequestHandler`_ is directly supported by
|
|
Haystack but the approach below could be used with any backend which supports
|
|
this feature.
|
|
|
|
.. _`ExtractingRequestHandler`: http://wiki.apache.org/solr/ExtractingRequestHandler
|
|
|
|
Extracting Content
|
|
==================
|
|
|
|
:meth:`SearchBackend.extract_file_contents` accepts a file or file-like object
|
|
and returns a dictionary containing two keys: ``metadata`` and ``contents``. The
|
|
``contents`` value will be a string containing all of the text which the backend
|
|
managed to extract from the file contents. ``metadata`` will always be a
|
|
dictionary but the keys and values will vary based on the underlying extraction
|
|
engine and the type of file provided.
|
|
|
|
Indexing Extracted Content
|
|
==========================
|
|
|
|
Generally you will want to include the extracted text in your main document
|
|
field along with everything else specified in your search template. This example
|
|
shows how to override a hypothetical ``FileIndex``'s ``prepare`` method to
|
|
include the extract content along with information retrieved from the database::
|
|
|
|
def prepare(self, obj):
|
|
data = super(FileIndex, self).prepare(obj)
|
|
|
|
# This could also be a regular Python open() call, a StringIO instance
|
|
# or the result of opening a URL. Note that due to a library limitation
|
|
# file_obj must have a .name attribute even if you need to set one
|
|
# manually before calling extract_file_contents:
|
|
file_obj = obj.the_file.open()
|
|
|
|
extracted_data = self.backend.extract_file_contents(file_obj)
|
|
|
|
# Now we'll finally perform the template processing to render the
|
|
# text field with *all* of our metadata visible for templating:
|
|
t = loader.select_template(('search/indexes/myapp/file_text.txt', ))
|
|
data['text'] = t.render(Context({'object': obj,
|
|
'extracted': extracted_data}))
|
|
|
|
return data
|
|
|
|
This allows you to insert the extracted text at the appropriate place in your
|
|
template, modified or intermixed with database content as appropriate:
|
|
|
|
.. code-block:: html+django
|
|
|
|
{{ object.title }}
|
|
{{ object.owner.name }}
|
|
|
|
…
|
|
|
|
{% for k, v in extracted.metadata.items %}
|
|
{% for val in v %}
|
|
{{ k }}: {{ val|safe }}
|
|
{% endfor %}
|
|
{% endfor %}
|
|
|
|
{{ extracted.contents|striptags|safe }} |