debian-python-whoosh/docs/source/releases/2_0.rst

========================
Whoosh 2.x release notes
========================

Whoosh 2.7
==========

* Removed on-disk word graph implementation of spell checking in favor of much
  simpler and faster FSA implementation over the term file.

* Many bug fixes.

* Removed backwards compatibility with indexes created by versions prior to
  2.5. You may need to re-index if you are using an old index that hasn't been
  updated.

* This is the last 2.x release before a major overhaul that will break backwards
  compatibility.


Whoosh 2.5
==========

* Whoosh 2.5 will read existing indexes, but segments created by 2.5 will not
  be readable by older versions of Whoosh.

* As a replacement for field caches to speed up sorting, Whoosh now supports
  adding a ``sortable=True`` keyword argument to fields. This makes Whoosh store
  a sortable representation of the field's values in a "column" format
  (which associates a "key" value with each document). This is more robust,
  efficient, and customizable than the old behavior.
  You should now specify ``sortable=True`` on fields that you plan on using to
  sort or group search results.

  (You can still sort/group on fields that don't have ``sortable=True``,
  however it will use more RAM and be slower as Whoosh caches the field values
  in memory.)

  Fields that use ``sortable=True`` can avoid specifying ``stored=True``. The
  field's value will still be available on ``Hit`` objects (the value will be
  retrieved from the column instead of from the stored fields). This may
  actually be faster for certain types of values.

* Whoosh will now detect common types of OR queries and use optimized read-ahead
  matchers to speed them up by several times.

* Whoosh now includes pure-Python implementations of the Snowball stemmers and
  stop word lists for various languages adapted from NLTK. These are available
  through the :class:`whoosh.analysis.LanguageAnalyzer` analyzer or through the
  ``lang=`` keyword argument to the
  :class:`~whoosh.fields.TEXT` field.

* You can now use the
  :meth:`whoosh.filedb.filestore.Storage.create()` and
  :meth:`whoosh.filedb.filestore.Storage.destory()`
  methods as a consistent API to set up and tear down different types of
  storage.

* Many bug fixes and speed improvements.

* Switched unit tests to use ``py.test`` instead of ``nose``.

* Removed obsolete ``SpellChecker`` class.


Whoosh 2.4
==========

* By default, Whoosh now assembles the individual files of a segment into a
  single file when committing. This has a small performance penalty but solves
  a problem where Whoosh can keep too many files open. Whoosh is also now
  smarter about using mmap.

* Added functionality to index and search hierarchical documents. See
  :doc:`/nested`.

* Rewrote the Directed Acyclic Word Graph implementation (used in spell
  checking) to be faster and more space-efficient. Word graph files created by
  previous versions will be ignored, meaning that spell checking may become
  slower unless/until you replace the old segments (for example, by
  optimizing).

* Rewrote multiprocessing indexing to be faster and simpler. You can now
  do ``myindex.writer(procs=n)`` to get a multiprocessing writer, or
  ``myindex.writer(procs=n, multisegment=True)`` to get a multiprocessing
  writer that leaves behind multiple segments, like the old MultiSegmentWriter.
  (``MultiSegmentWriter`` is still available as a function that returns the
  new class.)

* When creating ``Term`` query objects for special fields (e.g. NUMERIC or
  BOOLEAN), you can now use the field's literal type instead of a string as the
  second argument, for example ``Term("num", 20)`` or ``Term("bool", True)``.
  (This change may cause problems interacting with functions that expect
  query objects to be pure textual, such as spell checking.)

* All writing to and reading from on-disk indexes is now done through "codec"
  objects. This architecture should make it easier to add optional or
  experimental features, and maintain backwards compatibility.

* Fixes issues #75, #137, #206, #213, #215, #219, #223, #226, #230, #233, #238,
  #239, #240, #241, #243, #244, #245, #252, #253, and other bugs. Thanks to
  Thomas Waldmann and Alexei Gousev for the help!


Whoosh 2.3.2
============

* Fixes bug in BM25F scoring function, leading to increased precision in search
  results.

* Fixes issues #203, #205, #206, #208, #209, #212.


Whoosh 2.3.1
============

* Fixes issue #200.


Whoosh 2.3
==========

* Added a :class:`whoosh.query.Regex` term query type, similar to
  :class:`whoosh.query.Wildcard`. The parser does not allow regex term queries
  by default. You need to add the :class:`whoosh.qparser.RegexPlugin` plugin.
  After you add the plugin, you can use ``r"expression"`` query syntax for
  regular expression term queries. For example, ``r"foo.*bar"``.

* Added the :class:`whoosh.qparser.PseudoFieldPlugin` parser plugin. This
  plugin lets you create "pseudo-fields" that run a transform function on
  whatever query syntax the user applies the field to. This is fairly advanced
  functionality right now; I'm trying to think of ways to make its power easier
  to access.

* The documents in the lists in the dictionary returned by ``Results.groups()``
  by default are now in the same relative order as in the results. This makes
  it much easier to display the "top N" results in each category, for example.

* The ``groupids`` keyword argument to ``Searcher.search`` has been removed.
  Instead you can now pass a :class:`whoosh.sorting.FacetMap` object to the
  ``Searcher.search`` method's ``maptype`` argument to control how faceted
  documents are grouped, and/or set the ``maptype`` argument on individual
  :class:`whoosh.sorting.FacetType`` objects to set custom grouping per facet.
  See :doc:`../facets` for more information.

* Calling ``Searcher.documents()`` or ``Searcher.document_numbers()`` with no
  arguments now yields all documents/numbers.

* Calling ``Writer.update_document()`` with no unique fields is now equivalent
  to calling ``Writer.add_document()`` with the same arguments.

* Fixed a problem with keyword expansion where the code was building a cache
  that was fast on small indexes, but unacceptably slow on large indexes.

* Added the hyphen (``-``) to the list of characters that match a "wildcard"
  token, to make parsing slightly more predictable. A true fix will have to
  wait for another parser rewrite.

* Fixed an unused ``__future__`` import and use of ``float("nan")`` which were
  breaking under Python 2.5.

* Fixed a bug where vectored fields with only one term stored an empty term
  vector.

* Various other bug fixes.

Whoosh 2.2
==========

* Fixes several bugs, including a bad bug in BM25F scoring.

* Added ``allow_overlap`` option to :class:`whoosh.sorting.StoredFieldFacet`.

* In :meth:`~whoosh.writing.IndexWriter.add_document`, You can now pass
  query-like strings for BOOLEAN and DATETIME fields (e.g ``boolfield="true"``
  and ``dtfield="20101131-16:01"``) as an alternative to actual ``bool`` or
  ``datetime`` objects. The implementation of this is incomplete: it only works
  in the default ``filedb`` backend, and if the field is stored, the stored
  value will be the string, not the parsed object.

* Added :class:`whoosh.analysis.CompoundWordFilter` and
  :class:`whoosh.analysis.TeeFilter`.


Whoosh 2.1
==========

This release fixes several bugs, and contains speed improvments to highlighting.
See :doc:`/highlight` for more information.


Whoosh 2.0
==========

Improvements
------------

* Whoosh is now compatible with Python 3 (tested with Python 3.2). Special
  thanks to Vinay Sajip who did the work, and also Jordan Sherer who helped
  fix later issues.

* Sorting and grouping (faceting) now use a new system of "facet" objects which
  are much more flexible than the previous field-based system.

  For example, to sort by first name and then score::

      from whoosh import sorting

      mf = sorting.MultiFacet([sorting.FieldFacet("firstname"),
                               sorting.ScoreFacet()])
      results = searcher.search(myquery, sortedby=mf)

  In addition to the previously supported sorting/grouping by field contents
  and/or query results, you can now use numeric ranges, date ranges, score, and
  more. The new faceting system also supports overlapping groups.

  (The old "Sorter" API still works but is deprecated and may be removed in a
  future version.)

  See :doc:`/facets` for more information.

* Completely revamped spell-checking to make it much faster, easier, and more
  flexible. You can enable generation of the graph files use by spell checking
  using the ``spelling=True`` argument to a field type::

      schema = fields.Schema(text=fields.TEXT(spelling=True))

  (Spelling suggestion methods will work on fields without ``spelling=True``
  but will slower.) The spelling graph will be updated automatically as new
  documents are added -- it is no longer necessary to maintain a separate
  "spelling index".

  You can get suggestions for individual words using
  :meth:`whoosh.searching.Searcher.suggest`::

      suglist = searcher.suggest("content", "werd", limit=3)

  Whoosh now includes convenience methods to spell-check and correct user
  queries, with optional highlighting of corrections using the
  ``whoosh.highlight`` module::

      from whoosh import highlight, qparser

      # User query string
      qstring = request.get("q")

      # Parse into query object
      parser = qparser.QueryParser("content", myindex.schema)
      qobject = parser.parse(qstring)

      results = searcher.search(qobject)

      if not results:
        correction = searcher.correct_query(gobject, gstring)
        # correction.query = corrected query object
        # correction.string = corrected query string

        # Format the corrected query string with HTML highlighting
        cstring = correction.format_string(highlight.HtmlFormatter())

  Spelling suggestions can come from field contents and/or lists of words.
  For stemmed fields the spelling suggestions automatically use the unstemmed
  forms of the words.

  There are APIs for spelling suggestions and query correction, so highly
  motivated users could conceivably replace the defaults with more
  sophisticated behaviors (for example, to take context into account).

  See :doc:`/spelling` for more information.

* :class:`whoosh.query.FuzzyTerm` now uses the new word graph feature as well
  and so is much faster.

* You can now set a boost factor for individual documents as you index them,
  to increase the score of terms in those documents in searches. See the
  documentation for the :meth:`~whoosh.writing.IndexWriter.add_document` for
  more information.

* Added built-in recording of which terms matched in which documents. Use the
  ``terms=True`` argument to :meth:`whoosh.searching.Searcher.search` and use
  :meth:`whoosh.searching.Hit.matched_terms` and
  :meth:`whoosh.searching.Hit.contains_term` to check matched terms.

* Whoosh now supports whole-term quality optimizations, so for example if the
  system knows that a UnionMatcher cannot possibly contribute to the "top N"
  results unless both sub-matchers match, it will replace the UnionMatcher with
  an IntersectionMatcher which is faster to compute. The performance improvement
  is not as dramatic as from block quality optimizations, but it can be
  noticeable.

* Fixed a bug that prevented block quality optimizations in queries with words
  not in the index, which could severely degrade performance.

* Block quality optimizations now use the actual scoring algorithm to calculate
  block quality instead of an approximation, which fixes issues where ordering
  of results could be different for searches with and without the optimizations.

* the BOOLEAN field type now supports field boosts.

* Re-architected the query parser to make the code easier to understand. Custom
  parser plugins from previous versions will probably break in Whoosh 2.0.

* Various bug-fixes and performance improvements.

* Removed the "read lock", which caused more problems than it solved. Now when
  opening a reader, if segments are deleted out from under the reader as it
  is opened, the code simply retries.


Compatibility
-------------

* The term quality optimizations required changes to the on-disk formats.
  Whoosh 2.0 if backwards-compatible with the old format. As you rewrite an
  index using Whoosh 2.0, by default it will use the new formats for new
  segments, making the index incompatible with older versions.

  To upgrade an existing index to use the new formats immediately, use
  ``Index.optimize()``.

* Removed the experimental ``TermTrackingCollector`` since it is replaced by
  the new built-in term recording functionality.

* Removed the experimental ``Searcher.define_facets`` feature until a future
  release when it will be replaced by a more robust and useful feature.

* Reader iteration methods (``__iter__``, ``iter_from``, ``iter_field``, etc.)
  now yield :class:`whoosh.reading.TermInfo` objects.

* The arguments to :class:`whoosh.query.FuzzyTerm` changed.