debian-python-whoosh/docs/source/nested.rst

239 lines
9.8 KiB
ReStructuredText

===========================================
Indexing and searching document hierarchies
===========================================
Overview
========
Whoosh's full-text index is essentially a flat database of documents. However,
Whoosh supports two techniques for simulating the indexing and querying of
hierarchical documents, that is, sets of documents that form a parent-child
hierarchy, such as "Chapter - Section - Paragraph" or
"Module - Class - Method".
You can specify parent-child relationships *at indexing time*, by grouping
documents in the same hierarchy, and then use the
:class:`whoosh.query.NestedParent` and/or :class:`whoosh.query.NestedChildren`
to find parents based on their children or vice-versa.
Alternatively, you can use *query time joins*, essentially like external key
joins in a database, where you perform one search to find a relevant document,
then use a stored value on that document (for example, a ``parent`` field) to
look up another document.
Both methods have pros and cons.
Using nested document indexing
==============================
Indexing
--------
This method works by indexing a "parent" document and all its "child" documents
*as a "group"* so they are guaranteed to end up in the same segment. You can
use the context manager returned by ``IndexWriter.group()`` to group
documents::
with ix.writer() as w:
with w.group():
w.add_document(kind="class", name="Index")
w.add_document(kind="method", name="add document")
w.add_document(kind="method", name="add reader")
w.add_document(kind="method", name="close")
with w.group():
w.add_document(kind="class", name="Accumulator")
w.add_document(kind="method", name="add")
w.add_document(kind="method", name="get result")
with w.group():
w.add_document(kind="class", name="Calculator")
w.add_document(kind="method", name="add")
w.add_document(kind="method", name="add all")
w.add_document(kind="method", name="add some")
w.add_document(kind="method", name="multiply")
w.add_document(kind="method", name="close")
with w.group():
w.add_document(kind="class", name="Deleter")
w.add_document(kind="method", name="add")
w.add_document(kind="method", name="delete")
Alternatively you can use the ``start_group()`` and ``end_group()`` methods::
with ix.writer() as w:
w.start_group()
w.add_document(kind="class", name="Index")
w.add_document(kind="method", name="add document")
w.add_document(kind="method", name="add reader")
w.add_document(kind="method", name="close")
w.end_group()
Each level of the hierarchy should have a query that distinguishes it from
other levels (for example, in the above index, you can use ``kind:class`` or
``kind:method`` to match different levels of the hierarchy).
Once you've indexed the hierarchy of documents, you can use two query types to
find parents based on children or vice-versa.
(There is currently no support in the default query parser for nested queries.)
NestedParent query
------------------
The :class:`whoosh.query.NestedParent` query type lets you specify a query for
child documents, but have the query return an "ancestor" document from higher
in the hierarchy::
# First, we need a query that matches all the documents in the "parent"
# level we want of the hierarchy
all_parents = query.Term("kind", "class")
# Then, we need a query that matches the children we want to find
wanted_kids = query.Term("name", "close")
# Now we can make a query that will match documents where "name" is
# "close", but the query will return the "parent" documents of the matching
# children
q = query.NestedParent(all_parents, wanted_kids)
# results = Index, Calculator
Note that in a hierarchy with more than two levels, you can specify a "parents"
query that matches any level of the hierarchy, so you can return the top-level
ancestors of the matching children, or the second level, third level, etc.
The query works by first building a bit vector representing which documents are
"parents"::
Index
| Calculator
| |
1000100100000100
| |
| Deleter
Accumulator
Then for each match of the "child" query, it calculates the previous parent
from the bit vector and returns it as a match (it only returns each parent once
no matter how many children match). This parent lookup is very efficient::
1000100100000100
|
|<-+ close
NestedChildren query
--------------------
The opposite of ``NestedParent`` is :class:`whoosh.query.NestedChildren`. This
query lets you match parents but return their children. This is useful, for
example, to search for an album title and return the songs in the album::
# Query that matches all documents in the "parent" level we want to match
# at
all_parents = query.Term("kind", "album")
# Parent documents we want to match
wanted_parents = query.Term("album_title", "heaven")
# Now we can make a query that will match parent documents where "album_title"
# contains "heaven", but the query will return the "child" documents of the
# matching parents
q1 = query.NestedChildren(all_parents, wanted_parents)
You can then combine that query with an ``AND`` clause, for example to find
songs with "hell" in the song title that occur on albums with "heaven" in the
album title::
q2 = query.And([q1, query.Term("song_title", "hell")])
Deleting and updating hierarchical documents
--------------------------------------------
The drawback of the index-time method is *updating and deleting*. Because the
implementation of the queries depends on the parent and child documents being
contiguous in the segment, you can't update/delete just one child document.
You can only update/delete an entire top-level document at once (for example,
if your hierarchy is "Chapter - Section - Paragraph", you can only update or
delete entire chapters, not a section or paragraph). If the top-level of the
hierarchy represents very large blocks of text, this can involve a lot of
deleting and reindexing.
Currently ``Writer.update_document()`` does not automatically work with nested
documents. You must manually delete and re-add document groups to update them.
To delete nested document groups, use the ``Writer.delete_by_query()``
method with a ``NestedParent`` query::
# Delete the "Accumulator" class
all_parents = query.Term("kind", "class")
to_delete = query.Term("name", "Accumulator")
q = query.NestedParent(all_parents, to_delete)
with myindex.writer() as w:
w.delete_by_query(q)
Using query-time joins
======================
A second technique for simulating hierarchical documents in Whoosh involves
using a stored field on each document to point to its parent, and then using
the value of that field at query time to find parents and children.
For example, if we index a hierarchy of classes and methods using pointers
to parents instead of nesting::
# Store a pointer to the parent on each "method" document
with ix.writer() as w:
w.add_document(kind="class", c_name="Index", docstring="...")
w.add_document(kind="method", m_name="add document", parent="Index")
w.add_document(kind="method", m_name="add reader", parent="Index")
w.add_document(kind="method", m_name="close", parent="Index")
w.add_document(kind="class", c_name="Accumulator", docstring="...")
w.add_document(kind="method", m_name="add", parent="Accumulator")
w.add_document(kind="method", m_name="get result", parent="Accumulator")
w.add_document(kind="class", c_name="Calculator", docstring="...")
w.add_document(kind="method", m_name="add", parent="Calculator")
w.add_document(kind="method", m_name="add all", parent="Calculator")
w.add_document(kind="method", m_name="add some", parent="Calculator")
w.add_document(kind="method", m_name="multiply", parent="Calculator")
w.add_document(kind="method", m_name="close", parent="Calculator")
w.add_document(kind="class", c_name="Deleter", docstring="...")
w.add_document(kind="method", m_name="add", parent="Deleter")
w.add_document(kind="method", m_name="delete", parent="Deleter")
# Now do manual joins at query time
with ix.searcher() as s:
# Tip: Searcher.document() and Searcher.documents() let you look up
# documents by field values more easily than using Searcher.search()
# Children to parents:
# Print the docstrings of classes on which "close" methods occur
for child_doc in s.documents(m_name="close"):
# Use the stored value of the "parent" field to look up the parent
# document
parent_doc = s.document(c_name=child_doc["parent"])
# Print the parent document's stored docstring field
print(parent_doc["docstring"])
# Parents to children:
# Find classes with "big" in the docstring and print their methods
q = query.Term("kind", "class") & query.Term("docstring", "big")
for hit in s.search(q, limit=None):
print("Class name=", hit["c_name"], "methods:")
for child_doc in s.documents(parent=hit["c_name"]):
print(" Method name=", child_doc["m_name"])
This technique is more flexible than index-time nesting in that you can
delete/update individual documents in the hierarchy piece by piece, although it
doesn't support finding different parent levels as easily. It is also slower
than index-time nesting (potentially much slower), since you must perform
additional searches for each found document.
Future versions of Whoosh may include "join" queries to make this process more
efficient (or at least more automatic).