The boosting dismax query parser (bmax)
♻️ this is the official and maintained fork of the original @shopping24 repository maintained by solr.cool.
A synonym aware edismax query parser for Apache Solr. The bmax query parser relies on field types and tokenizer chains to parse the user query, discovers synonyms, subtopics, boost and penalize terms at query time. Hence it is highly configurable. It is the ideal query parser for e-commerce searches as it eliminates the usage of term and document frequency.
It does not accept any lucene query syntax (~-+()
). The query composed is a dismax query with a minimum must match of 100%.
This document covers Version 1.5.x and onwards. For the old 0.9.9 version, take a look at the release branch.
Fundamentals
Terminology
Synonym - a (bidirectional) syntactic or semantic equivalent to a origin term. It will expand recall and in ranking, matches on these synonyms will be scored almost as high as the origin term (default 0.9). Example: tv -> television
.
Subtopic - a unidirectional specification of a origin term that will expand recall and score lower than the origin term. Example: bicycle -> mountainbike
or laptop -> macbook
.
Penalize term - a term that semantically describes what should rank lower in a search result matching the origin term. These terms will not increase recall, documents matching penalize terms will rank lower. Example: mountainbike -> isbn
Boost term - a term that semantically describes what should rank higher in a search result matching the origin term. These terms will not increase recall, documents matching penalize terms will rank higher. Example: television -> hdmi
document and term frequency handling
The bmax query parser eliminates the usage of term and document frequency for document ranking. With subtopics, synonyms, boost and penalize terms disabled and query fields set to a single field, all returned documents are s cored 1.0
.
synonym and subtopic handling
Query epxansions that increase recall (synonyms and subtopics) are bound to the origin term. Given the synonym example violet to blue, the query blue bike would be rewriten by the bmax parser to (violet OR blue) AND bike
. If you add the subtopic mountainbike, ebike to bike, the query would be rewritten to (violet OR blue) AND (bike OR mountainbike OR ebike)
.
Out of the box synonym handling in Solr (dismax, edisxmax) loses these relationships during query analysis. As an example, given the synonym violet to blue a regular Solr synonym handling would rewrite the query blue bike to blue violet bike
. Depending on your query parser (and mm
setting in dismax) this could lead to higher recall with way less precision.
Using the Bmax query parser
To take andvantage of the bmax query parser, have it properly installed and configured as described in the next chapter. The query parser utilizes 2 components, the booster and the queryparser. The booster enriches the query with boost and penalize terms, the query parser transforms a given user query into a Lucene search query.
Use the following url parameters to fine tune your installation.
Boost component parameters
The Bmax boost component enriches the query with boost and penalize terms.
bmax.booster
(boolean) - enable/disable boost term component. Default isfalse
.bmax.booster.boost
(boolean) - enable/disable boost term resolution. Default istrue
.bmax.booster.boost.factor
(float) - boost factor that is multiplied to the boosts given in theqf
orbmax.booster.boost.qf
parameter for each query field respectivly, default is1.0
.bmax.booster.boost.strategy
(String) - strategy for combining boost terms with the main query:rq
- rerank query,bq
- boost query (additively),boost
- boost function (multiplicative). Default isrq
.bmax.booster.boost.docs
(int) - The number of documents to boost from the begin of the result set (rerank query strategy only). Default is400
.*bmax.booster.boost.extra
(String) - comma separated extra boost terms. Great to check new boost term ideas.bmax.booster.penalize
(boolean) - enable/disable penalize term resolution. Default istrue
.bmax.booster.penalize.factor
(float) - Penalize factor that is used as negative weight in the penalize query. Default is100.0
.bmax.booster.penalize.strategy
(String) - strategy for combining penalize terms with the main query:rq
- rerank query,bq
- boost query (additively),boost
- boost function (multiplicative). Default isrq
.bmax.booster.penalize.docs
(int) - The number of documents to penalize from the begin of the result set (rerank query strategy only). Default is400
.bmax.booster.penalize.extra
(String) - comma separated extra penalize terms. Great to check new ideas.
Query parser params
The Bmax query parser utilizes a Solr edismax query parser and the following standard url parameters can be used:
q
(string) – the user query. Lucene query syntax is not supported.qf
(string) – the query fields with their weights.bq
(string) – additive boost querybf
(string) – additive boost functionstie
(string) – the dismax tie breaker, default is0.0
.boost
(string) – multiplicative boost functionspf
(string) - the phrase fieldsps
(string) - the phrase slop for pf (default for pf2/pf3)pf2
(string) - the bigram phrase fieldsps2
(string) - the phrase slop for pf2pf3
(string) - the trigram phrase fieldsps3
(string) - the phrase slop for pf3phrase.tie
(float) - A tie breaker that is used when aggregating pf,pf2,pf3 queries. Defaults to the value oftie
To fine tune or debug your query, use the following extra arguments:
bmax.synonym
(boolean) - Enable / disable synoynm lookup, default istrue
bmax.synonym.boost
(float) – The term boost to be multiplicated for synonym terms with the boost defined in theqf
parameter for each query field respectively, default is0.1
.bmax.subtopic
(boolean) - Enable / disable subtopic lookup, default istrue
bmax.subtopic.boost
(float) – The term boost to be multiplicated for subtopic terms with the boost defined in theqf
parameter for each query field respectively, default is0.01
.bmax.subtopic.qf
(string) - The query fields in which to search for subtopics, defaults to the ones given in theqf
parameter.
Query clause reduction / term inspection
Before adding a term query clause to the main query or the boost query, a term inspection cache can be checked, whether the term exists in the field term values. If the term does not exist in the field term values, the term query clause is omitted. If you are using a lot of query fields, this can reduce the overall query clause count dramatically and speed up query computation.
bmax.inspect
(boolean) – Use the local term inspection cache to validate term query clauses. Default isfalse
. Set this totrue
in your main query configuration to lookup each term in the local term inspection cache.bmax.inspect.build
(boolean) – Build a local term inspection cache using the givenqf
. Default isfalse
. Configure a new/first searcher listener in yoursolrconfig.xml
and query all documents (*:*
) once with this parameter set totrue
. Supply the fields to inspect in theqf
parameter.
The term inspection cache is stored in a custom Solr cache named bmax.fieldTermCache
. Configure and size a cache in your solrconfig.xml
. The cache entries will be saved as Dictomaton FSTs in order to consume as less heap as possible.
Bmax query processing
Query processing in the bmax query parser is split into 2 steps:
- First is retrieving and supplying boost and penalize terms. This is done in the
BmaxBoostTermComponent
- Second is parsing the incoming query and building an appropriate Lucene query. This is done in the
BmaxQueryParser
.
1. Retrieving boost and penalize terms
The incoming user query (q
) is analyzed and boost terms are supplied in the bq
parameter. Penalize terms are added in the rq
and rqq
parameter to form a negative rerank query. Boost and penalize term retrieval is done in 3 steps:
- Run the incoming query in
q
through the configuredqueryParsingFieldType
- Expand synonyms for each query token through
synonymFieldType
. - Retrieve boost and penalize terms for each token through
boostTermFieldType
andpenalizeTermFieldType
respectivly.
Given the example above with q=blue bike cheap
the query parsing field type would remove noise and leave blue bike
. The synonym lookup would retrieve bicycle
as synonym for bike
and append it: blue bike bicylce
. This would be the input for penalize and boost term discovery.
The discovered boost terms crossbike bmx pedelec
are appended to the incoming query as a boost query bq={!dismax qf='...' mm=1 bq=''} crossbike bmx pedelec
. The discovered penalize terms are appended as rerank query rq={!rerank reRankQuery=$rqq reRankDocs=... reRankWeight=...}&rqq=...book OR toys ...
. The rerank query formulated is a boolean OR query.
2. Parsing the user query
The bmax query parser utilizes the edismax
query parser to build it's query. It recognizes the well known edismax
parameters:
q
– the main queryqf
– query fields (weighted)bq
– the boost query (additive)bf
– boost functions (additive)boost
– boost functions (multiplicative)pf,ps,pf2,ps2,pf3,ps3
– phrase boosts (additive). Note that the scores from these boosts are added up per type (pf,pf2,pf3) and field but dismax'ed between types and fields. Setphrase.tie=1.0
if you want the standard edismax behaviour and also add up the scores between fields and types.
Rerank queries are realized through the default Solr rerank postfilter. Query parsing is done in 3 steps:
- Run the incoming query in
q
through the configuredqueryParsingFieldType
- Expand synonyms for each query token through
synonymFieldType
. Synoynms treated as sematically equal to the source token. - Retrieve subtopic terms for each token and synonym through
subtopicFieldType
. Subtopics are bound to the source token in the main query.
Given the example above with q=blue bike cheap
the query parsing field type would remove noise and leave the tokens blue,bike
. The synonym lookup would retrieve bicycle
as synonym for bike
: blue,[bike,bicycle]
. Subtopic retrieval for each token creates: [blue,lavendel],[bike,bicycle,bmx,crossbike,roadbike]
.
The query constructed is always a dismax
query with a minimum must match of 100%
. The example above would create the following query:
BooleanQuery(MUST) of
DismaxQuery(MUST) of blue,lavendel
DismaxQuery(MUST) of bike,bicycle,bmx,crossbike,roadbike
The boost query (if given) is appended.
Installing the Bmax query parser
- Place the
solr-bmax-queryparser-<VERSION>-jar-with-dependencies.jar
in the/lib
directory of your Solr installation. - Configure at least one field type in your
schema.xml
that can be used for query parsing and tokenizing - Configure the
bmax
query parser in yoursolrconfig.xml
(see below) - Configure the
bmax.booster
search component in yoursolrconfig.xml
(see below) - Enable the
bmax
query parser using thedefType=bmax
parameter in your query.
This project is also vailable from Maven Central:
<dependency>
<groupId>cool.solr</groupId>
<artifactId>solr-bmax-queryparser</artifactId>
<version>2.7.0</version>
<classifier>jar-with-dependencies</classifier>
</dependency>
Configuring the query parser
Add the BmaxQParserPlugin
to the list of query parsers configured in your solrconfig.xml
. It takes the following configuration parameters:
<queryParser name="bmax" class="com.s24.search.solr.query.bmax.BmaxQParserPlugin">
<!-- use this field type's query analyzer to tokenize the query -->
<str name="queryParsingFieldType">bmax_query</str>
<!-- further field types for synonyms and subtopics -->
<str name="synonymFieldType">bmax_synonyms</str>
<str name="subtopicFieldType">bmax_subtopics</str>
</queryParser>
Configure the boost term component as follows:
<searchComponent name="bmax.booster" class="com.s24.search.solr.component.BmaxBoostTermComponent">
<!-- use the same as in query parser -->
<str name="queryParsingFieldType">bmax_query</str>
<str name="synonymFieldType">bmax_synonyms</str>
<!-- boost and penalize term retrieval -->
<str name="boostTermFieldType">bmax_boostterms</str>
<str name="penalizeTermFieldType">bmax_penalizeterms</str>
</searchComponent>
and add it to the components of your search handler in front of the query component:
<requestHandler name="/select" class="solr.SearchHandler" default="true">
<arr name="components">
...
<str>bmax.booster</str>
...
</arr>
</requestHandler>
Configuring the fieldTypes needed
A simple example for a field type in your schema.xml
, that tokenizes a incoming query and removes stopwords might be this:
<fieldType name="bmax_query" class="solr.TextField" indexed="false" stored="false">
<analyzer type="query">
<tokenizer class="solr.PatternTokenizerFactory"
pattern="[+;:,\s©®℗℠™&()/\p{Punct}<>»«]+" />
<!-- lower case -->
<filter class="solr.LowerCaseFilterFactory" />
<!-- Removes stopwords from the query. -->
<filter class="solr.StopFilterFactory"
words="stopwords.txt" ignoreCase="true"/>
</analyzer>
</fieldType>
This is a example of a synonym parser. The input is each token of the query analyzer above, one at a time. So, there's no need for any fancy tokenizing, the keyword tokenizer will do it. This analyzer chain utilizes the SynonymFilter
and as a last step removes all non-synonyms. With this nifty little trick, no unneeded synonyms get added to your query.
<fieldType name="bmax_synonyms" class="solr.TextField" indexed="false" stored="false">
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<!-- synonyms -->
<filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="false"/>
<!-- remove all non-synonyms -->
<filter class="solr.TypeTokenFilterFactory" types="list_tokentype_synonym.txt" useWhitelist="true"/>
</analyzer>
</fieldType>
For the boostterm field type, the SynonymFilter
might be handy as well.
Building the project
This should install the current version into your local repository
$ mvn clean install
Releasing the project to maven central
Define new versions
$ export NEXT_VERSION=<version>
$ export NEXT_DEVELOPMENT_VERSION=<version>-SNAPSHOT
Then execute the release chain
$ mvn org.codehaus.mojo:versions-maven-plugin:2.8.1:set -DgenerateBackupPoms=false -DnewVersion=$NEXT_VERSION
$ git commit -a -m "pushes to release version $NEXT_VERSION"
$ mvn -P release
Then, increment to next development version:
$ git tag -a v$NEXT_VERSION -m "`curl -s http://whatthecommit.com/index.txt`"
$ mvn org.codehaus.mojo:versions-maven-plugin:2.0:set -DgenerateBackupPoms=false -DnewVersion=$NEXT_DEVELOPMENT_VERSION
$ git commit -a -m "pushes to development version $NEXT_DEVELOPMENT_VERSION"
$ git push origin tag v$NEXT_VERSION && git push origin
Contributing
We're looking forward to your comments, issues and pull requests!
License
This project is licensed under the Apache License, Version 2.