539 lines
24 KiB
ReStructuredText
539 lines
24 KiB
ReStructuredText
Introduction
|
|
============
|
|
|
|
.. image:: https://secure.travis-ci.org/collective/collective.solr.png
|
|
|
|
collective.solr integrates the `Solr`_ search engine with `Plone`_.
|
|
|
|
Apache Solr is based on Lucene and is *the* enterprise open source search
|
|
engine. It powers the search of sites like Twitter, the Apple and iTunes Stores,
|
|
Wikipedia, Netflix and many more.
|
|
|
|
Solr does not only scale to any level of content, but provides rich search
|
|
functionality, like facetting, geospatial search, suggestions, spelling
|
|
corrections, indexing of binary formats and a whole variety of powerful tools to
|
|
configure custom search solutions. It has integrated clustering and
|
|
load-balancing to provide a high level of robustness.
|
|
|
|
collective.solr comes with a default configuration and setup of Solr that makes
|
|
it extremely easy to get started, yet provides a vastly superior search quality
|
|
compared to Plone's integrated text search based on ZCTextIndex.
|
|
|
|
|
|
Current Status
|
|
==============
|
|
|
|
The code is used in production in many sites and considered stable. This
|
|
add-on can be installed in a `Plone`_ 4.1 (or later) site to enable indexing
|
|
operations as well as searching (site and live search) using `Solr`_. Doing so
|
|
will not only significantly improve search quality and performance -
|
|
especially for a large number of indexed objects, but also reduce the memory
|
|
footprint of your `Plone`_ instance by allowing you to remove the
|
|
``SearchableText``, ``Description`` and ``Title`` indexes from the catalog. In
|
|
large sites with 100000 content objects and more, searches using ZCTextIndex
|
|
often taken 10 seconds or more and require a good deal of memory from ZODB
|
|
caches. Solr will typically answer these requests in 10ms to 50ms at which
|
|
point network latency and the rendering speed of Plone's page templates are a
|
|
more dominant factor.
|
|
|
|
|
|
Installation
|
|
============
|
|
|
|
The following buildout configuration may be used to get started quickly::
|
|
|
|
[buildout]
|
|
extends =
|
|
buildout.cfg
|
|
https://github.com/Jarn/collective.solr/raw/master/buildout/solr.cfg
|
|
|
|
[instance]
|
|
eggs += collective.solr
|
|
|
|
After saving this to let's say ``solr.cfg`` the buildout can be run and the
|
|
`Solr`_ server and `Plone`_ instance started::
|
|
|
|
$ python bootstrap.py
|
|
$ bin/buildout -c solr.cfg
|
|
...
|
|
$ bin/solr-instance start
|
|
$ bin/instance start
|
|
|
|
Next you should activate the ``collective.solr (site search)`` add-on in the
|
|
add-on control panel of Plone. After activation you should review the settings
|
|
in the new ``Solr Settings`` control panel. To index all your content in Solr
|
|
you can call the provided maintenance view::
|
|
|
|
http://localhost:8080/plone/@@solr-maintenance/reindex
|
|
|
|
Creating the initial index can take some considerably time. A typical indexing
|
|
rate for a Plone site running of a local disk is 20 index operations per second.
|
|
While Solr scales to orders of magnitude more than that, the limiting factor is
|
|
database access time in Plone.
|
|
|
|
If you have an existing site with a large volume of content, you can create an
|
|
initial Solr index on a staging server or development machine, then rsync it
|
|
over to the live machine, enable Solr and call `@@solr-maintenance/sync`. The
|
|
sync will usually take just a couple of minutes for catching up with changes in
|
|
the live database. You can also use this approach when making changes to the
|
|
index structure or changing the settings of existing fields.
|
|
|
|
Note that the example solr.cfg is bound to change. Always copy the file to your
|
|
local buildout. In general you should never rely on extending buildout config
|
|
files from servers that aren't under your control.
|
|
|
|
|
|
Features
|
|
========
|
|
|
|
Once installed and configured, this add-on introduces a number of end-user
|
|
features.
|
|
|
|
Supported scripts and languages
|
|
-------------------------------
|
|
|
|
In the default configuration all languages and scripts should be supported.
|
|
This broad support comes at the expense of avoiding any language specific
|
|
configuration.
|
|
|
|
The default text analysis uses libraries based on ICU standards to fold and
|
|
normalize any text as well as find token boundaries - in most languages word
|
|
boundaries.
|
|
|
|
Accented characters are folder into their unaccented base form and many other
|
|
characters are normalized. This normalization is similar to what Plone does when
|
|
generating url identifiers from titles. These changes are applied both to the
|
|
indexed text and the user provided search query, so in general there's a large
|
|
number of matches at the expense of specificity.
|
|
|
|
Non-alphabetic characters like hyphens, dots and colons are interpreted as word
|
|
boundaries, while case changes and alphanumeric combinations are left intact;
|
|
for example `WiFi` or `IPv4` will only be lower-cased but not split.
|
|
|
|
For any specific site, you likely know the supported content languages and could
|
|
further tune the text analysis. A common example is the use of stemming, to
|
|
generate base words for terms. This helps to avoid distinctions between singular
|
|
and plural forms of a word or it being used as an adjective. Stemming broadens
|
|
the found result even more, at a greater expense of specificity and needs to be
|
|
used carefully.
|
|
|
|
There's a plethora of text analysis options available in Solr if you are
|
|
interested in the subject or have specific needs.
|
|
|
|
|
|
Exclude from search and elevation
|
|
---------------------------------
|
|
|
|
By default this add-on introduces two new fields to the default content types
|
|
or any custom type derived from ATContentTypes.
|
|
|
|
The `showinsearch` boolean field lets you hide specific content items from the
|
|
search results, by setting the value to `false`.
|
|
|
|
The `searchwords` lines field allows you to specify multiple phrases per content
|
|
item. A phrase is specified per line. User searches containing any of these
|
|
phrases will show the content item as the first result for the search. This
|
|
technique is also known as `elevation`.
|
|
|
|
Both of these features depend on the default `search-pattern` to include the
|
|
required parts as included in the default configuration. The `searchwords`
|
|
approach to elevation doesn't depend on the Solr elevation feature, as that
|
|
would require maintaining a xml file as part of the Solr server configuration.
|
|
|
|
|
|
Facets
|
|
------
|
|
|
|
Plone's default search form is overridden to provide faceting support. The
|
|
available facets can be configured in the control panel. The provided search
|
|
form is currently more of an example and not used in many real world projects.
|
|
You likely want to override it with a custom implementation for your specific
|
|
site.
|
|
|
|
Starting with Plone 4.2, Plone will contain a modernized search form whose UI
|
|
supports faceting more naturally. At some point `c.solr` will extend this new
|
|
search form rather than providing its own.
|
|
|
|
|
|
Indexing binary documents
|
|
-------------------------
|
|
|
|
At this point collective.solr uses Plone's default capabilities to index binary
|
|
documents via `portal_transforms` and installing command line tools like `wv2`
|
|
or `pdftotext`. Work is under way to expose and use the `Apache Tika`_ Solr
|
|
integration available via the `update/extract` handler.
|
|
|
|
Once finished this will speed up indexing of binary documents considerably, as
|
|
the extraction will happen out-of-process on the Solr server side. Apache Tika
|
|
also supports a much larger list of formats than can be supported by adding
|
|
external command line tools.
|
|
|
|
There is room for more improvements in this area, as c.solr will still send the
|
|
binary data to Solr as part of the end-user request/transaction. To further
|
|
optimize this, Solr index operations can be stored in a task queue as provided
|
|
by `plone.app.async` or solutions build on top of `Celery`. This is currently
|
|
outside the scope of `collective.solr`.
|
|
|
|
.. _`Apache Tika`: http://tika.apache.org/
|
|
|
|
|
|
Spelling checking / suggestions
|
|
-------------------------------
|
|
|
|
Solr supports spell checking - or rather suggestions, as it doesn't contain a
|
|
formal dictionary but bases suggestions on the indexed corpus. The idea is to
|
|
present the user with alternative search terms for any query that is likely to
|
|
produce more or better results.
|
|
|
|
Currently this is not yet exposed in the `collective.solr` API's even though
|
|
the Solr server as set up by the buildout recipe already contains the required
|
|
configuration for this.
|
|
|
|
|
|
Wildcard searches
|
|
-----------------
|
|
|
|
Wildcard search support in Solr is rather poor. Unfortunately Plone's live
|
|
search uses this by default, so we have to support it. When doing wildcard
|
|
searches, Solr ignores any of the tokenizer and analyzer settings of the field
|
|
at query time. This often leads to a mismatch of the indexed data as modified
|
|
by those settings and the query term. In order to work around this, we try to
|
|
reproduce the essential parts of these analyzers on the `collective.solr` side.
|
|
The most common changes are lower-casing characters and folding non-ascii
|
|
characters to ascii as done by the `ICUFoldingFilterFactory`. Currently these
|
|
two changes are hard-wired and applied to all fields of type `solr.TextField`.
|
|
If you have different field settings you might need to overwrite
|
|
`collective.solr.utils.prepare_wildcard`.
|
|
|
|
|
|
Architecture
|
|
============
|
|
|
|
When working with Solr it's good to keep some things about it in mind. This
|
|
information is targeted at developers and integrators trying to use and extend
|
|
Solr in their Plone projects.
|
|
|
|
Dependencies
|
|
------------
|
|
|
|
Currently we depend on `collective.indexing` as a means to hook into the normal
|
|
catalog machinery of Plone to detect content changes. `c.indexing` before
|
|
version two had some persistent data structures that frequently caused problems
|
|
when removing the add-on. These problems have been fixed in version two.
|
|
Unfortunately `c.indexing` still has to hook the catalog machinery in various
|
|
evil ways, as the machinery lacks the required hooks for its use-case. Going
|
|
forward it is expected for `c.indexing` to be merged into the underlying
|
|
`ZCatalog` implementation, at which point `collective.solr` can use those hooks
|
|
directly.
|
|
|
|
Indexing
|
|
--------
|
|
|
|
Solr is not transactional aware or supports any kind of rollback or undo. We
|
|
therefor only sent data to Solr at the end of any successful request. This is
|
|
done via collective.indexing, a transaction manager and an end request
|
|
transaction hook. This means you won't see any changes done to content inside a
|
|
request when doing Solr searches later on in the same request. Inside tests you
|
|
need to either commit real transactions or otherwise flush the Solr connection.
|
|
There's no transaction concept, so one request doing a search might get some
|
|
results in its beginning, than a different request might add new information to
|
|
Solr. If the first request is still running and does the same search again it
|
|
might get different results taking the changes from the second request into
|
|
account.
|
|
|
|
Solr is not a real time search engine. While there's work under way to make Solr
|
|
capable of delivering real time results, there's currently always a certain
|
|
delay up to some minutes from the time data is sent to Solr to when it is
|
|
available in searches.
|
|
|
|
Search results are returned in Solr by distinct search threads. These search
|
|
threads hold a great number of caches which are crucial for Solr to perform.
|
|
When index or unindex operations are sent to Solr, it will keep those in memory
|
|
until a commit is executed on its own search index. When a commit occurs, all
|
|
search threads and thus all caches are thrown away and new threads are created
|
|
reflecting the data after the commit. While there's a certain amount of cache
|
|
data that is copied to the new search threads, this data has to be validated
|
|
against the new index which takes some time. The `useColdSearcher` and
|
|
`maxWarmingSearchers` options of the Solr recipe relate to this aspect. While
|
|
cache data is copied over and validated for a new search thread, the searcher
|
|
is `warming up`. If the warming up is not yet completed the searcher is
|
|
considered to be `cold`.
|
|
|
|
In order to get real good performance out of Solr, we need to minimize the
|
|
number of commits against the Solr index. We can achieve this by turning off
|
|
`auto-commit` and instead use `commitWithin`. So we don't sent a `commit`
|
|
to Solr at the end of each index/unindex request on the Plone side. Instead we
|
|
tell Solr to commit the data to its index at most after a certain time interval.
|
|
Values of 15 minutes to 1 minute work well for this interval. The larger you
|
|
can make this interval, the better the performance of Solr will be, at the cost
|
|
of search results lagging behind a bit. In this setup we also need to configure
|
|
the `autoCommitMaxTime` option of the Solr server, as `commitWithin` only works
|
|
for index but not unindex operations. Otherwise a large number of unindex
|
|
operations without any index operations occurring could not be reflected in the
|
|
index for a long time.
|
|
|
|
As a result of all the above, the Solr index and the Plone site will always have
|
|
slightly diverging contents. If you use Solr to do searches you need to be aware
|
|
of this, as you might get results for objects that no longer exist. So any
|
|
`brain/getObject` call on the Plone side needs to have error handling code
|
|
around it as the object might not be there anymore and traversing to it can
|
|
throw an exception.
|
|
|
|
When adding new or deleting old content or changing the workflow state of it,
|
|
you will also not see those actions reflected in searches right away, but only
|
|
after a delay of at most the `commitWithin` interval. After a `commitWithin`
|
|
operation is sent to Solr, any other operations happening during that time
|
|
window will be executed after the first interval is over. So with a 15 minute
|
|
interval, if document A is indexed at 5:15, B at 5:20 and C at 5:35, both A & B
|
|
will be committed at 5:30 and C at 5:50.
|
|
|
|
Searching
|
|
---------
|
|
|
|
Information retrieval is a complex science. We try to give a very brief
|
|
explanation here, refer to the literature and documentation of Lucene/Solr for
|
|
much more detailed information.
|
|
|
|
If you do searches in normal Plone, you have a search term and query the
|
|
SearchableText index with it. The SearchableText is a simple concatenation of
|
|
all searchable fields, by default title, description and the body text.
|
|
|
|
The default ZCTextIndex in Plone uses a simplified version of the Okapi BM25
|
|
algorithm described in papers in 1998. It uses two metrics to score documents:
|
|
|
|
- Term frequency: How often does a search term occur in a document
|
|
- Inverse document frequency: The inverse of in how many documents a term
|
|
occurs. Terms only occurring in a few documents are scored higher than those
|
|
occurring in many documents.
|
|
|
|
It calculates the sum of all scores, for every term common to the query and any
|
|
document. So for a query with two terms, a document is likely to score higher
|
|
if it contains both terms, except if one of them is a very common term and the
|
|
other document contains the non-common term more often.
|
|
|
|
The similarity function used in Solr/Lucene uses a different algorithm, based on
|
|
a combination of a boolean and vector space model, but taking the same
|
|
underlying metrics into account. In addition to the term frequency and inverse
|
|
document frequency Solr respects some more metrics:
|
|
|
|
- length normalization: The number of all terms in a field. Shorter fields
|
|
contribute higher scores compared to long fields.
|
|
- boost values: There's a variety of boost values that can be applied, both
|
|
index-time document boost values as well as boost values per search field or
|
|
search term
|
|
|
|
In its pre 2.0 versions, collective.solr used a naive approach and mirrored the
|
|
approach taken by ZCTextIndex. So it sent each search query as one query and
|
|
matched it against the full SearchableText field inside Solr. By doing that Solr
|
|
basically used the same algorithm as ZCTextIndex as it only had one field to
|
|
match with the entire text in it. The only difference was the use of the length
|
|
normalization, so shorter documents ranked higher than those with longer texts.
|
|
This actually caused search quality to be worse, as you'd frequently find
|
|
folders, links or otherwise rather empty documents. The Okapi BM25
|
|
implementation in ZCTextIndex deliberately ignores the document length for that
|
|
reason.
|
|
|
|
In order to get good or better search quality from Solr, we have to query it in
|
|
a different way. Instead of concatenating all fields into one big text, we need
|
|
to preserve the individual fields and use their intrinsic importance. We get the
|
|
main benefit be realizing that matches on the title and description are more
|
|
important than matches on the body text or other fields in a document.
|
|
collective.solr 2.0+ does exactly that by introducing a `search-pattern` to be
|
|
used for text searches. In its default form it causes each query to work against
|
|
the title, description and full searchable text fields and boosts the title by
|
|
a high and the description by a medium value. The length normalization already
|
|
provides an improvement for these fields, as the title is likely short, the
|
|
description a bit longer and the full text even longer. By using explicit boost
|
|
values the effect gets to be more pronounced.
|
|
|
|
If you do custom searches or want to include more fields into the full text
|
|
search you need to keep the above in mind. Simply setting the `searchable`
|
|
attribute on the schema of a field to `True` will only include it in the big
|
|
searchable text stream. If you for example include a field containing tags, the
|
|
simple tag names will likely 'drown' in the full body text. You might want to
|
|
instead change the search pattern to include the field and potentially put a
|
|
boost value on it - though it will be more important as it's likely to be
|
|
extremely short. Similarly extracting the full text of binary files and simply
|
|
appending them into the search stream might not be the best approach. You should
|
|
rather index those in a separate field and then maybe use a boost value of less
|
|
than one to make the field less important. Given two documents with the same
|
|
content, one as a normal page and one as a binary file, you'll likely want to
|
|
find the page first, as it's faster to access and read than the file.
|
|
|
|
There's a good number of other improvements you can do using query time and
|
|
index time boost values. To provide index time boost values, you can provide
|
|
a skin script called `solr_boost_index_values` which gets the object to be
|
|
indexed and the data sent to Solr as arguments and returns a dictionary of field
|
|
names to boost values for each document. The safest is to return a boost value
|
|
for the empty string, which results in a document boost value. Field level boost
|
|
values don't work with all searches, especially wildcard searches as done by
|
|
most simple web searches. The index time boost allows you to implement policies
|
|
like boosting certain content types over others, taking into account ratings or
|
|
number of comments as a measure of user feedback or anything else that can be
|
|
derived from each content item.
|
|
|
|
|
|
Production
|
|
==========
|
|
|
|
Java settings
|
|
-------------
|
|
|
|
Make sure you are using a `server` version of Java in production. The output
|
|
of::
|
|
|
|
$ java -version
|
|
|
|
should include `Java HotSpot(TM) Server VM` or
|
|
`Java HotSpot(TM) 64-Bit Server VM`. You can force the Java VM into server mode
|
|
by calling it with the `-server` command. Do not try to run Solr with versions
|
|
of OpenJDK or other non-official Java versions. They tend to not work well or
|
|
at all.
|
|
|
|
Depending on the size of your Solr index, you need to configure the Java VM to
|
|
have enough memory. Good starting values are `-Xms128M -Xmx256M`, as a rule of
|
|
thumb keep `Xmx` double the size of `Xms`.
|
|
|
|
You can configure these settings via the `java_opts` value in the
|
|
`collective.recipe.solrinstance` recipe section like::
|
|
|
|
java_opts =
|
|
-server
|
|
-Xms128M
|
|
-Xmx256M
|
|
|
|
|
|
Monitoring
|
|
----------
|
|
|
|
Java has a general monitoring framework called JMX. You can use this to get
|
|
a huge number of details about the Java process in general and Solr in
|
|
particular. Some hints are at http://wiki.apache.org/solr/SolrJmx. The default
|
|
`collective.recipe.solrinstance` config uses `<jmx />`, so we can use command
|
|
line arguments to configure it. Our example `buildout/solr.cfg` includes all
|
|
the relevant values in its `java_opts` variable.
|
|
|
|
To view all the available metrics, start Solr and then the `jconsole` command
|
|
included in the Java SDK and connect to the local process named `start.jar`.
|
|
Solr specific information is available from the MBeans tab under the `solr`
|
|
section. For example you'll find `avgTimePerRequest` within
|
|
`search/org.apache.solr.handler.component.SearchHandler` under `Attributes`.
|
|
|
|
If you want to integrate with munin, you can install the JMX plugin at:
|
|
http://exchange.munin-monitoring.org/plugins/jmx/details
|
|
|
|
Follow its install instructions and tweak the included examples to query the
|
|
information you want to track. To track the average time per search request,
|
|
add a file called `solr_avg_query_time.conf` into `/usr/share/munin/plugins`
|
|
with the following contents::
|
|
|
|
graph_title Average Query Time
|
|
graph_vlabel ms
|
|
graph_category Solr
|
|
|
|
solr_average_query_time.label time per request
|
|
solr_average_query_time.jmxObjectName solr/:type=search,id=org.apache.solr.handler.component.SearchHandler
|
|
solr_average_query_time.jmxAttributeName avgTimePerRequest
|
|
|
|
Then add a symlink to add the plugin::
|
|
|
|
$ ln -s /usr/share/munin/plugins/jmx_ /etc/munin/plugins/jmx_solr_avg_query_time
|
|
|
|
Point the jmx plugin to the Solr process, by
|
|
opening `/etc/munin/plugin-conf.d/munin-node.conf` and adding something like::
|
|
|
|
[jmx_*]
|
|
env.jmxurl service:jmx:rmi:///jndi/rmi://127.0.0.1:8984/jmxrmi
|
|
|
|
The host and port need to match those passed via `java_opts` to Solr. To check
|
|
if the plugins are working do::
|
|
|
|
$ export jmxurl="service:jmx:rmi:///jndi/rmi://127.0.0.1:8984/jmxrmi"
|
|
$ cd /etc/munin/plugins
|
|
|
|
And call the plugin you configured directly, like for example::
|
|
|
|
$ ./solr_avg_query_time
|
|
solr_average_query_time.value NaN
|
|
|
|
We include a number of useful configurations inside the package, in the
|
|
`collective/solr/munin_config` directory. You can copy all of them into the
|
|
`/usr/share/munin/plugins` directory and create the symlinks for all of them.
|
|
|
|
|
|
Replication
|
|
-----------
|
|
|
|
At this point Solr doesn't yet allow for a full fault tolerance setup. You can
|
|
read more about the `Solr Cloud`__ effort which aims to provide this.
|
|
|
|
But we can setup a simple master/slave replication using Solr's built-in
|
|
`Solr Replication`__ support, which is a first step in the right direction.
|
|
|
|
.. __: http://wiki.apache.org/solr/SolrCloud
|
|
.. __: http://wiki.apache.org/solr/SolrReplication
|
|
|
|
In order to use this, you can setup a Solr master server and give it some
|
|
extra config::
|
|
|
|
[solr-instance]
|
|
additional-solrconfig =
|
|
<requestHandler name="/replication" class="solr.ReplicationHandler" >
|
|
<lst name="master">
|
|
<str name="replicateAfter">commit</str>
|
|
<str name="replicateAfter">startup</str>
|
|
<str name="replicateAfter">optimize</str>
|
|
</lst>
|
|
</requestHandler>
|
|
|
|
Then you can point one or multiple slave servers to the master. Assuming the
|
|
master runs on `solr-master.domain.com` at port `8983`, we could write::
|
|
|
|
[solr-instance]
|
|
additional-solrconfig =
|
|
<requestHandler name="/replication" class="solr.ReplicationHandler" >
|
|
<lst name="slave">
|
|
<str name="masterUrl">http://solr-master.domain.com:8983/solr/replication</str>
|
|
<str name="pollInterval">00:00:30</str>
|
|
</lst>
|
|
</requestHandler>
|
|
|
|
A poll interval of 30 seconds should be fast enough without creating too much
|
|
overhead.
|
|
|
|
At this point `collective.solr` does not yet have support for connecting to
|
|
multiple servers and using the slaves as a fallback for querying. As there's no
|
|
master-master setup yet, fault tolerance for index changes cannot be provided.
|
|
|
|
Development
|
|
===========
|
|
|
|
Releases can be found on the Python Package Index at
|
|
http://pypi.python.org/pypi/collective.solr. The code and issue trackers can be
|
|
found on GitHub at https://github.com/Jarn/collective.solr.
|
|
|
|
For outstanding issues and features remaining to be implemented please see the
|
|
`to-do list`__ included in the package as well as it's `issue tracker`__.
|
|
|
|
.. __: https://github.com/Jarn/collective.solr/blob/master/TODO.txt
|
|
.. __: https://github.com/Jarn/collective.solr/issues
|
|
|
|
|
|
Credits
|
|
=======
|
|
|
|
This code was inspired by `enfold.solr`_ by `Enfold Systems`_ as well as `work
|
|
done at the snowsprint'08`__. The `solr.py` module is based on the original
|
|
python integration package from `Solr`_ itself.
|
|
|
|
Development was kindly sponsored by `Elkjop`_ and the
|
|
`Nordic Council and Nordic Council of Ministers`_.
|
|
|
|
.. _`enfold.solr`: https://svn.enfoldsystems.com/trac/public/browser/enfold.solr/branches/snowsprint08-buildout/enfold.solr
|
|
.. _`Enfold Systems`: http://www.enfoldsystems.com/
|
|
.. __: http://tarekziade.wordpress.com/2008/01/20/snow-sprint-report-1-indexing/
|
|
.. _`Elkjop`: http://www.elkjop.no/
|
|
.. _`Nordic Council and Nordic Council of Ministers`: http://www.norden.org/en/
|
|
.. _`Solr`: http://lucene.apache.org/solr/
|
|
.. _`Plone`: http://www.plone.org/
|