334 lines
13 KiB
ReStructuredText
334 lines
13 KiB
ReStructuredText
========================
|
|
Whoosh 2.x release notes
|
|
========================
|
|
|
|
Whoosh 2.7
|
|
==========
|
|
|
|
* Removed on-disk word graph implementation of spell checking in favor of much
|
|
simpler and faster FSA implementation over the term file.
|
|
|
|
* Many bug fixes.
|
|
|
|
* Removed backwards compatibility with indexes created by versions prior to
|
|
2.5. You may need to re-index if you are using an old index that hasn't been
|
|
updated.
|
|
|
|
* This is the last 2.x release before a major overhaul that will break backwards
|
|
compatibility.
|
|
|
|
|
|
Whoosh 2.5
|
|
==========
|
|
|
|
* Whoosh 2.5 will read existing indexes, but segments created by 2.5 will not
|
|
be readable by older versions of Whoosh.
|
|
|
|
* As a replacement for field caches to speed up sorting, Whoosh now supports
|
|
adding a ``sortable=True`` keyword argument to fields. This makes Whoosh store
|
|
a sortable representation of the field's values in a "column" format
|
|
(which associates a "key" value with each document). This is more robust,
|
|
efficient, and customizable than the old behavior.
|
|
You should now specify ``sortable=True`` on fields that you plan on using to
|
|
sort or group search results.
|
|
|
|
(You can still sort/group on fields that don't have ``sortable=True``,
|
|
however it will use more RAM and be slower as Whoosh caches the field values
|
|
in memory.)
|
|
|
|
Fields that use ``sortable=True`` can avoid specifying ``stored=True``. The
|
|
field's value will still be available on ``Hit`` objects (the value will be
|
|
retrieved from the column instead of from the stored fields). This may
|
|
actually be faster for certain types of values.
|
|
|
|
* Whoosh will now detect common types of OR queries and use optimized read-ahead
|
|
matchers to speed them up by several times.
|
|
|
|
* Whoosh now includes pure-Python implementations of the Snowball stemmers and
|
|
stop word lists for various languages adapted from NLTK. These are available
|
|
through the :class:`whoosh.analysis.LanguageAnalyzer` analyzer or through the
|
|
``lang=`` keyword argument to the
|
|
:class:`~whoosh.fields.TEXT` field.
|
|
|
|
* You can now use the
|
|
:meth:`whoosh.filedb.filestore.Storage.create()` and
|
|
:meth:`whoosh.filedb.filestore.Storage.destory()`
|
|
methods as a consistent API to set up and tear down different types of
|
|
storage.
|
|
|
|
* Many bug fixes and speed improvements.
|
|
|
|
* Switched unit tests to use ``py.test`` instead of ``nose``.
|
|
|
|
* Removed obsolete ``SpellChecker`` class.
|
|
|
|
|
|
Whoosh 2.4
|
|
==========
|
|
|
|
* By default, Whoosh now assembles the individual files of a segment into a
|
|
single file when committing. This has a small performance penalty but solves
|
|
a problem where Whoosh can keep too many files open. Whoosh is also now
|
|
smarter about using mmap.
|
|
|
|
* Added functionality to index and search hierarchical documents. See
|
|
:doc:`/nested`.
|
|
|
|
* Rewrote the Directed Acyclic Word Graph implementation (used in spell
|
|
checking) to be faster and more space-efficient. Word graph files created by
|
|
previous versions will be ignored, meaning that spell checking may become
|
|
slower unless/until you replace the old segments (for example, by
|
|
optimizing).
|
|
|
|
* Rewrote multiprocessing indexing to be faster and simpler. You can now
|
|
do ``myindex.writer(procs=n)`` to get a multiprocessing writer, or
|
|
``myindex.writer(procs=n, multisegment=True)`` to get a multiprocessing
|
|
writer that leaves behind multiple segments, like the old MultiSegmentWriter.
|
|
(``MultiSegmentWriter`` is still available as a function that returns the
|
|
new class.)
|
|
|
|
* When creating ``Term`` query objects for special fields (e.g. NUMERIC or
|
|
BOOLEAN), you can now use the field's literal type instead of a string as the
|
|
second argument, for example ``Term("num", 20)`` or ``Term("bool", True)``.
|
|
(This change may cause problems interacting with functions that expect
|
|
query objects to be pure textual, such as spell checking.)
|
|
|
|
* All writing to and reading from on-disk indexes is now done through "codec"
|
|
objects. This architecture should make it easier to add optional or
|
|
experimental features, and maintain backwards compatibility.
|
|
|
|
* Fixes issues #75, #137, #206, #213, #215, #219, #223, #226, #230, #233, #238,
|
|
#239, #240, #241, #243, #244, #245, #252, #253, and other bugs. Thanks to
|
|
Thomas Waldmann and Alexei Gousev for the help!
|
|
|
|
|
|
Whoosh 2.3.2
|
|
============
|
|
|
|
* Fixes bug in BM25F scoring function, leading to increased precision in search
|
|
results.
|
|
|
|
* Fixes issues #203, #205, #206, #208, #209, #212.
|
|
|
|
|
|
Whoosh 2.3.1
|
|
============
|
|
|
|
* Fixes issue #200.
|
|
|
|
|
|
Whoosh 2.3
|
|
==========
|
|
|
|
* Added a :class:`whoosh.query.Regex` term query type, similar to
|
|
:class:`whoosh.query.Wildcard`. The parser does not allow regex term queries
|
|
by default. You need to add the :class:`whoosh.qparser.RegexPlugin` plugin.
|
|
After you add the plugin, you can use ``r"expression"`` query syntax for
|
|
regular expression term queries. For example, ``r"foo.*bar"``.
|
|
|
|
* Added the :class:`whoosh.qparser.PseudoFieldPlugin` parser plugin. This
|
|
plugin lets you create "pseudo-fields" that run a transform function on
|
|
whatever query syntax the user applies the field to. This is fairly advanced
|
|
functionality right now; I'm trying to think of ways to make its power easier
|
|
to access.
|
|
|
|
* The documents in the lists in the dictionary returned by ``Results.groups()``
|
|
by default are now in the same relative order as in the results. This makes
|
|
it much easier to display the "top N" results in each category, for example.
|
|
|
|
* The ``groupids`` keyword argument to ``Searcher.search`` has been removed.
|
|
Instead you can now pass a :class:`whoosh.sorting.FacetMap` object to the
|
|
``Searcher.search`` method's ``maptype`` argument to control how faceted
|
|
documents are grouped, and/or set the ``maptype`` argument on individual
|
|
:class:`whoosh.sorting.FacetType`` objects to set custom grouping per facet.
|
|
See :doc:`../facets` for more information.
|
|
|
|
* Calling ``Searcher.documents()`` or ``Searcher.document_numbers()`` with no
|
|
arguments now yields all documents/numbers.
|
|
|
|
* Calling ``Writer.update_document()`` with no unique fields is now equivalent
|
|
to calling ``Writer.add_document()`` with the same arguments.
|
|
|
|
* Fixed a problem with keyword expansion where the code was building a cache
|
|
that was fast on small indexes, but unacceptably slow on large indexes.
|
|
|
|
* Added the hyphen (``-``) to the list of characters that match a "wildcard"
|
|
token, to make parsing slightly more predictable. A true fix will have to
|
|
wait for another parser rewrite.
|
|
|
|
* Fixed an unused ``__future__`` import and use of ``float("nan")`` which were
|
|
breaking under Python 2.5.
|
|
|
|
* Fixed a bug where vectored fields with only one term stored an empty term
|
|
vector.
|
|
|
|
* Various other bug fixes.
|
|
|
|
Whoosh 2.2
|
|
==========
|
|
|
|
* Fixes several bugs, including a bad bug in BM25F scoring.
|
|
|
|
* Added ``allow_overlap`` option to :class:`whoosh.sorting.StoredFieldFacet`.
|
|
|
|
* In :meth:`~whoosh.writing.IndexWriter.add_document`, You can now pass
|
|
query-like strings for BOOLEAN and DATETIME fields (e.g ``boolfield="true"``
|
|
and ``dtfield="20101131-16:01"``) as an alternative to actual ``bool`` or
|
|
``datetime`` objects. The implementation of this is incomplete: it only works
|
|
in the default ``filedb`` backend, and if the field is stored, the stored
|
|
value will be the string, not the parsed object.
|
|
|
|
* Added :class:`whoosh.analysis.CompoundWordFilter` and
|
|
:class:`whoosh.analysis.TeeFilter`.
|
|
|
|
|
|
Whoosh 2.1
|
|
==========
|
|
|
|
This release fixes several bugs, and contains speed improvments to highlighting.
|
|
See :doc:`/highlight` for more information.
|
|
|
|
|
|
Whoosh 2.0
|
|
==========
|
|
|
|
Improvements
|
|
------------
|
|
|
|
* Whoosh is now compatible with Python 3 (tested with Python 3.2). Special
|
|
thanks to Vinay Sajip who did the work, and also Jordan Sherer who helped
|
|
fix later issues.
|
|
|
|
* Sorting and grouping (faceting) now use a new system of "facet" objects which
|
|
are much more flexible than the previous field-based system.
|
|
|
|
For example, to sort by first name and then score::
|
|
|
|
from whoosh import sorting
|
|
|
|
mf = sorting.MultiFacet([sorting.FieldFacet("firstname"),
|
|
sorting.ScoreFacet()])
|
|
results = searcher.search(myquery, sortedby=mf)
|
|
|
|
In addition to the previously supported sorting/grouping by field contents
|
|
and/or query results, you can now use numeric ranges, date ranges, score, and
|
|
more. The new faceting system also supports overlapping groups.
|
|
|
|
(The old "Sorter" API still works but is deprecated and may be removed in a
|
|
future version.)
|
|
|
|
See :doc:`/facets` for more information.
|
|
|
|
* Completely revamped spell-checking to make it much faster, easier, and more
|
|
flexible. You can enable generation of the graph files use by spell checking
|
|
using the ``spelling=True`` argument to a field type::
|
|
|
|
schema = fields.Schema(text=fields.TEXT(spelling=True))
|
|
|
|
(Spelling suggestion methods will work on fields without ``spelling=True``
|
|
but will slower.) The spelling graph will be updated automatically as new
|
|
documents are added -- it is no longer necessary to maintain a separate
|
|
"spelling index".
|
|
|
|
You can get suggestions for individual words using
|
|
:meth:`whoosh.searching.Searcher.suggest`::
|
|
|
|
suglist = searcher.suggest("content", "werd", limit=3)
|
|
|
|
Whoosh now includes convenience methods to spell-check and correct user
|
|
queries, with optional highlighting of corrections using the
|
|
``whoosh.highlight`` module::
|
|
|
|
from whoosh import highlight, qparser
|
|
|
|
# User query string
|
|
qstring = request.get("q")
|
|
|
|
# Parse into query object
|
|
parser = qparser.QueryParser("content", myindex.schema)
|
|
qobject = parser.parse(qstring)
|
|
|
|
results = searcher.search(qobject)
|
|
|
|
if not results:
|
|
correction = searcher.correct_query(gobject, gstring)
|
|
# correction.query = corrected query object
|
|
# correction.string = corrected query string
|
|
|
|
# Format the corrected query string with HTML highlighting
|
|
cstring = correction.format_string(highlight.HtmlFormatter())
|
|
|
|
Spelling suggestions can come from field contents and/or lists of words.
|
|
For stemmed fields the spelling suggestions automatically use the unstemmed
|
|
forms of the words.
|
|
|
|
There are APIs for spelling suggestions and query correction, so highly
|
|
motivated users could conceivably replace the defaults with more
|
|
sophisticated behaviors (for example, to take context into account).
|
|
|
|
See :doc:`/spelling` for more information.
|
|
|
|
* :class:`whoosh.query.FuzzyTerm` now uses the new word graph feature as well
|
|
and so is much faster.
|
|
|
|
* You can now set a boost factor for individual documents as you index them,
|
|
to increase the score of terms in those documents in searches. See the
|
|
documentation for the :meth:`~whoosh.writing.IndexWriter.add_document` for
|
|
more information.
|
|
|
|
* Added built-in recording of which terms matched in which documents. Use the
|
|
``terms=True`` argument to :meth:`whoosh.searching.Searcher.search` and use
|
|
:meth:`whoosh.searching.Hit.matched_terms` and
|
|
:meth:`whoosh.searching.Hit.contains_term` to check matched terms.
|
|
|
|
* Whoosh now supports whole-term quality optimizations, so for example if the
|
|
system knows that a UnionMatcher cannot possibly contribute to the "top N"
|
|
results unless both sub-matchers match, it will replace the UnionMatcher with
|
|
an IntersectionMatcher which is faster to compute. The performance improvement
|
|
is not as dramatic as from block quality optimizations, but it can be
|
|
noticeable.
|
|
|
|
* Fixed a bug that prevented block quality optimizations in queries with words
|
|
not in the index, which could severely degrade performance.
|
|
|
|
* Block quality optimizations now use the actual scoring algorithm to calculate
|
|
block quality instead of an approximation, which fixes issues where ordering
|
|
of results could be different for searches with and without the optimizations.
|
|
|
|
* the BOOLEAN field type now supports field boosts.
|
|
|
|
* Re-architected the query parser to make the code easier to understand. Custom
|
|
parser plugins from previous versions will probably break in Whoosh 2.0.
|
|
|
|
* Various bug-fixes and performance improvements.
|
|
|
|
* Removed the "read lock", which caused more problems than it solved. Now when
|
|
opening a reader, if segments are deleted out from under the reader as it
|
|
is opened, the code simply retries.
|
|
|
|
|
|
Compatibility
|
|
-------------
|
|
|
|
* The term quality optimizations required changes to the on-disk formats.
|
|
Whoosh 2.0 if backwards-compatible with the old format. As you rewrite an
|
|
index using Whoosh 2.0, by default it will use the new formats for new
|
|
segments, making the index incompatible with older versions.
|
|
|
|
To upgrade an existing index to use the new formats immediately, use
|
|
``Index.optimize()``.
|
|
|
|
* Removed the experimental ``TermTrackingCollector`` since it is replaced by
|
|
the new built-in term recording functionality.
|
|
|
|
* Removed the experimental ``Searcher.define_facets`` feature until a future
|
|
release when it will be replaced by a more robust and useful feature.
|
|
|
|
* Reader iteration methods (``__iter__``, ``iter_from``, ``iter_field``, etc.)
|
|
now yield :class:`whoosh.reading.TermInfo` objects.
|
|
|
|
* The arguments to :class:`whoosh.query.FuzzyTerm` changed.
|
|
|
|
|
|
|