Kind of Chinese Analysis for Elasticsearch
Requirements
- Java 7 update 55 or later
Structure of es-ik
-
ik-analysis-core
The algorithm of this module is coming from ik-analyzer. In principle, you can use this module to implement a Solor analyzer plugin or a Elasticsearch plugin.
You just need implement
DictionaryConfiguration
interface to provide dictionary content which is used by analysing content process. -
ik-analysis-es-plugin:
Integrate with ik-analyzer-core module and Elasticsearch. Define a kind of SPI which is
Configuration
extendsDictionaryConfiguration
-
es-ik-sqlite3
Persist dictionary's content into Sqlite3 database. This module is a kind of
service provider
to SPI Configuration defined in ik-analysis-es-plugin.
How to use es-ik
Actually, ik-analysis-es-plugin expose a interface DictionaryConfiguration
a kind of SPI. es-ik-sqlite3 implement it so that ik-analysis-es-plugin can get dictionary's content from Sqlite. In other words, you can get your implementation like persisting dictionary's content into Redis.
SPI is just a kind of concept. In java, I use ServiceLoader to implement that. As soon as your implementation conforms with ServiceLoader's usage, don't need to change ik-analysis-es-plugin module, you'll get a new ik-analysis-es-plugin's plugin. :P
How to use es-ik-sqlite3(currently version 1.0.1)
-
tell elasticsearch where is you sqlite3 db, add a configuration into your elasticsearch.yml, like:
ik_analysis_db_path: /opt/ik/dictionary.db
PS: you can download my dictionary.db from https://github.com/zacker330/es-ik-sqlite3-dictionary
-
get in you elasticsearch folder then install plugin:
./bin/plugin -i ik-analysis -u https://github.com/zacker330/es-ik-plugin-sqlite3-release/raw/master/es-ik-sqlite3-1.0.1.zip
-
test your configuration:
-
create songs index
curl -X PUT -H "Cache-Control: no-cache" -d '{ "settings":{ "index":{ "number_of_shards":1, "number_of_replicas": 1 } } }' 'http://localhost:9200/songs/'
-
create map for songs/song
curl -X PUT -H "Cache-Control: no-cache" -d '{ "song": { "_source": {"enabled": true}, "_all": { "indexAnalyzer": "ik_analysis", "searchAnalyzer": "ik_analysis", "term_vector": "no", "store": "true" }, "properties":{ "title":{ "type": "string", "store": "yes", "indexAnalyzer": "ik_analysis", "searchAnalyzer": "ik_analysis", "include_in_all": "true" } } } } ' 'http://localhost:9200/songs/_mapping/song'
-
test it
curl -X POST -d '林夕为我们作词' 'http://localhost:9200/songs/_analyze?analyzer=ik_analysis' response: {"tokens":[{"token":"林夕","start_offset":0,"end_offset":2,"type":"CN_WORD","position":1},{"token":"作词","start_offset":5,"end_offset":7,"type":"CN_WORD","position":2}]}
Create a empty sqlite3 db for es-ik-sqlite3
-
create database
sqlite3 dictionary.db
-
create tables
CREATE TABLE main_dictionary(term TEXT NOT NULL,unique(term)); CREATE TABLE quantifier_dictionary(term TEXT NOT NULL,unique(term)); CREATE TABLE stopword_dictionary(term TEXT NOT NULL,unique(term));
617052 records ~= 30MB db file