debian-django-haystack/docs/rich_content_extraction.rst

68 lines
2.6 KiB
ReStructuredText

.. _ref-rich_content_extraction:
=======================
Rich Content Extraction
=======================
For some projects it is desirable to index text content which is stored in
structured files such as PDFs, Microsoft Office documents, images, etc.
Currently only Solr's `ExtractingRequestHandler`_ is directly supported by
Haystack but the approach below could be used with any backend which supports
this feature.
.. _`ExtractingRequestHandler`: http://wiki.apache.org/solr/ExtractingRequestHandler
Extracting Content
==================
:meth:`SearchBackend.extract_file_contents` accepts a file or file-like object
and returns a dictionary containing two keys: ``metadata`` and ``contents``. The
``contents`` value will be a string containing all of the text which the backend
managed to extract from the file contents. ``metadata`` will always be a
dictionary but the keys and values will vary based on the underlying extraction
engine and the type of file provided.
Indexing Extracted Content
==========================
Generally you will want to include the extracted text in your main document
field along with everything else specified in your search template. This example
shows how to override a hypothetical ``FileIndex``'s ``prepare`` method to
include the extract content along with information retrieved from the database::
def prepare(self, obj):
data = super(FileIndex, self).prepare(obj)
# This could also be a regular Python open() call, a StringIO instance
# or the result of opening a URL. Note that due to a library limitation
# file_obj must have a .name attribute even if you need to set one
# manually before calling extract_file_contents:
file_obj = obj.the_file.open()
extracted_data = self.backend.extract_file_contents(file_obj)
# Now we'll finally perform the template processing to render the
# text field with *all* of our metadata visible for templating:
t = loader.select_template(('search/indexes/myapp/file_text.txt', ))
data['text'] = t.render(Context({'object': obj,
'extracted': extracted_data}))
return data
This allows you to insert the extracted text at the appropriate place in your
template, modified or intermixed with database content as appropriate:
.. code-block:: html+django
{{ object.title }}
{{ object.owner.name }}
{% for k, v in extracted.metadata.items %}
{% for val in v %}
{{ k }}: {{ val|safe }}
{% endfor %}
{% endfor %}
{{ extracted.contents|striptags|safe }}