115 lines
3.8 KiB
ReStructuredText
115 lines
3.8 KiB
ReStructuredText
|
===================================
|
||
|
Tips for speeding up batch indexing
|
||
|
===================================
|
||
|
|
||
|
|
||
|
Overview
|
||
|
========
|
||
|
|
||
|
Indexing documents tends to fall into two general patterns: adding documents
|
||
|
one at a time as they are created (as in a web application), and adding a bunch
|
||
|
of documents at once (batch indexing).
|
||
|
|
||
|
The following settings and alternate workflows can make batch indexing faster.
|
||
|
|
||
|
|
||
|
StemmingAnalyzer cache
|
||
|
======================
|
||
|
|
||
|
The stemming analyzer by default uses a least-recently-used (LRU) cache to limit
|
||
|
the amount of memory it uses, to prevent the cache from growing very large if
|
||
|
the analyzer is reused for a long period of time. However, the LRU cache can
|
||
|
slow down indexing by almost 200% compared to a stemming analyzer with an
|
||
|
"unbounded" cache.
|
||
|
|
||
|
When you're indexing in large batches with a one-shot instance of the
|
||
|
analyzer, consider using an unbounded cache::
|
||
|
|
||
|
w = myindex.writer()
|
||
|
# Get the analyzer object from a text field
|
||
|
stem_ana = w.schema["content"].format.analyzer
|
||
|
# Set the cachesize to -1 to indicate unbounded caching
|
||
|
stem_ana.cachesize = -1
|
||
|
# Reset the analyzer to pick up the changed attribute
|
||
|
stem_ana.clear()
|
||
|
|
||
|
# Use the writer to index documents...
|
||
|
|
||
|
|
||
|
The ``limitmb`` parameter
|
||
|
=========================
|
||
|
|
||
|
The ``limitmb`` parameter to :meth:`whoosh.index.Index.writer` controls the
|
||
|
*maximum* memory (in megabytes) the writer will use for the indexing pool. The
|
||
|
higher the number, the faster indexing will be.
|
||
|
|
||
|
The default value of ``128`` is actually somewhat low, considering many people
|
||
|
have multiple gigabytes of RAM these days. Setting it higher can speed up
|
||
|
indexing considerably::
|
||
|
|
||
|
from whoosh import index
|
||
|
|
||
|
ix = index.open_dir("indexdir")
|
||
|
writer = ix.writer(limitmb=256)
|
||
|
|
||
|
.. note::
|
||
|
The actual memory used will be higher than this value because of interpreter
|
||
|
overhead (up to twice as much!). It is very useful as a tuning parameter,
|
||
|
but not for trying to exactly control the memory usage of Whoosh.
|
||
|
|
||
|
|
||
|
The ``procs`` parameter
|
||
|
=======================
|
||
|
|
||
|
The ``procs`` parameter to :meth:`whoosh.index.Index.writer` controls the
|
||
|
number of processors the writer will use for indexing (via the
|
||
|
``multiprocessing`` module)::
|
||
|
|
||
|
from whoosh import index
|
||
|
|
||
|
ix = index.open_dir("indexdir")
|
||
|
writer = ix.writer(procs=4)
|
||
|
|
||
|
Note that when you use multiprocessing, the ``limitmb`` parameter controls the
|
||
|
amount of memory used by *each process*, so the actual memory used will be
|
||
|
``limitmb * procs``::
|
||
|
|
||
|
# Each process will use a limit of 128, for a total of 512
|
||
|
writer = ix.writer(procs=4, limitmb=128)
|
||
|
|
||
|
|
||
|
The ``multisegment`` parameter
|
||
|
==============================
|
||
|
|
||
|
The ``procs`` parameter causes the default writer to use multiple processors to
|
||
|
do much of the indexing, but then still uses a single process to merge the pool
|
||
|
of each sub-writer into a single segment.
|
||
|
|
||
|
You can get much better indexing speed by also using the ``multisegment=True``
|
||
|
keyword argument, which instead of merging the results of each sub-writer,
|
||
|
simply has them each just write out a new segment::
|
||
|
|
||
|
from whoosh import index
|
||
|
|
||
|
ix = index.open_dir("indexdir")
|
||
|
writer = ix.writer(procs=4, multisegment=True)
|
||
|
|
||
|
The drawback is that instead
|
||
|
of creating a single new segment, this option creates a number of new segments
|
||
|
**at least** equal to the number of processes you use.
|
||
|
|
||
|
For example, if you use ``procs=4``, the writer will create four new segments.
|
||
|
(If you merge old segments or call ``add_reader`` on the parent writer, the
|
||
|
parent writer will also write a segment, meaning you'll get five new segments.)
|
||
|
|
||
|
So, while ``multisegment=True`` is much faster than a normal writer, you should
|
||
|
only use it for large batch indexing jobs (or perhaps only for indexing from
|
||
|
scratch). It should not be the only method you use for indexing, because
|
||
|
otherwise the number of segments will tend to increase forever!
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|