debian-django-haystack/docs/glossary.rst

77 lines
3.2 KiB
ReStructuredText

.. _ref-glossary:
========
Glossary
========
Search is a domain full of its own jargon and definitions. As this may be an
unfamiliar territory to many developers, what follows are some commonly used
terms and what they mean.
Engine
An engine, for the purposes of Haystack, is a third-party search solution.
It might be a full service (i.e. Solr_) or a library to build an
engine with (i.e. Whoosh_)
.. _Solr: http://lucene.apache.org/solr/
.. _Whoosh: https://bitbucket.org/mchaput/whoosh/
Index
The datastore used by the engine is called an index. Its structure can vary
wildly between engines but commonly they resemble a document store. This is
the source of all information in Haystack.
Document
A document is essentially a record within the index. It usually contains at
least one blob of text that serves as the primary content the engine searches
and may have additional data hung off it.
Corpus
A term for a collection of documents. When talking about the documents stored
by the engine (rather than the technical implementation of the storage), this
term is commonly used.
Field
Within the index, each document may store extra data with the main content as
a field. Also sometimes called an attribute, this usually represents metadata
or extra content about the document. Haystack can use these fields for
filtering and display.
Term
A term is generally a single word (or word-like) string of characters used
in a search query.
Stemming
A means of determining if a word has any root words. This varies by language,
but in English, this generally consists of removing plurals, an action form of
the word, et cetera. For instance, in English, 'giraffes' would stem to
'giraffe'. Similarly, 'exclamation' would stem to 'exclaim'. This is useful
for finding variants of the word that may appear in other documents.
Boost
Boost provides a means to take a term or phrase from a search query and alter
the relevance of a result based on if that term is found in the result, a form
of weighting. For instance, if you wanted to more heavily weight results that
included the word 'zebra', you'd specify a boost for that term within the
query.
More Like This
Incorporating techniques from information retrieval and artificial
intelligence, More Like This is a technique for finding other documents within
the index that closely resemble the document in question. This is useful for
programmatically generating a list of similar content for a user to browse
based on the current document they are viewing.
Faceting
Faceting is a way to provide insight to the user into the contents of your
corpus. In its simplest form, it is a set of document counts returned with
results when performing a query. These counts can be used as feedback for
the user, allowing the user to choose interesting aspects of their search
results and "drill down" into those results.
An example might be providing a facet on an ``author`` field, providing back a
list of authors and the number of documents in the index they wrote. This
could be presented to the user with a link, allowing the user to click and
narrow their original search to all results by that author.