elasticsearch-analysis-pinyin

Pinyin Analysis for Elasticsearch

License	License The Apache Software License, Version 2.0
Categories	Categories Search Business Logic Libraries Elasticsearch
GroupId	GroupId org.elasticsearch
ArtifactId	ArtifactId elasticsearch-analysis-pinyin
Last Version	Last Version 5.2.2
Release Date	Release Date Mar 2, 2017
Type	Type jar
Description	Description elasticsearch-analysis-pinyin Pinyin Analysis for Elasticsearch
Source Code Management	Source Code Management http://github.com/medcl/elasticsearch-analysis-pinyin

Download elasticsearch-analysis-pinyin

Filename	Size
elasticsearch-analysis-pinyin-5.2.2.pom
elasticsearch-analysis-pinyin-5.2.2.zip	195 KB
elasticsearch-analysis-pinyin-5.2.2-sources.jar	13 KB
elasticsearch-analysis-pinyin-5.2.2-javadoc.jar	103 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.elasticsearch/elasticsearch-analysis-pinyin/ -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch-analysis-pinyin</artifactId>
    <version>5.2.2</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.elasticsearch/elasticsearch-analysis-pinyin/
implementation 'org.elasticsearch:elasticsearch-analysis-pinyin:5.2.2'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.elasticsearch/elasticsearch-analysis-pinyin/
implementation ("org.elasticsearch:elasticsearch-analysis-pinyin:5.2.2")

Apache Buildr

'org.elasticsearch:elasticsearch-analysis-pinyin:jar:5.2.2'

Apache Ivy

<dependency org="org.elasticsearch" name="elasticsearch-analysis-pinyin" rev="5.2.2">
  <artifact name="elasticsearch-analysis-pinyin" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.elasticsearch', module='elasticsearch-analysis-pinyin', version='5.2.2')
)

Scala SBT

libraryDependencies += "org.elasticsearch" % "elasticsearch-analysis-pinyin" % "5.2.2"

Leiningen

[org.elasticsearch/elasticsearch-analysis-pinyin "5.2.2"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.elasticsearch : elasticsearch	jar	5.2.2

runtime (1)

Group / Artifact	Type	Version
log4j : log4j	jar	1.2.16

test (7)

Group / Artifact	Type	Version
org.hamcrest : hamcrest-core	jar	1.3.RC2
org.hamcrest : hamcrest-library	jar	1.3.RC2
org.powermock : powermock-module-junit4	jar	1.6.2
org.powermock : powermock-api-mockito	jar	1.6.2
nl.jqno.equalsverifier : equalsverifier	jar	1.7.5
com.openpojo : openpojo	jar	0.8.1
junit : junit	jar	4.9

system (1)

Group / Artifact	Type	Version
org.nlpcn : nlp-lang	jar	1.7

Project Modules

There are no modules declared in this project.

Pinyin Analysis for Elasticsearch

This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin, integrates NLP tools (https://github.com/NLPchina/nlp-lang).

--------------------------------------------------
| Pinyin Analysis Plugin        | Elasticsearch  |
--------------------------------------------------
| master                        | 7.x -> master  |
--------------------------------------------------
| 6.x                           | 6.x            |
--------------------------------------------------  
| 5.x                           | 5.x            |
--------------------------------------------------  
| 1.8.1                         | 2.4.1          |
--------------------------------------------------  
| 1.7.5                         | 2.3.5          |
--------------------------------------------------  
| 1.6.1                         | 2.2.1          |
--------------------------------------------------
| 1.5.0                         | 2.1.0          |
--------------------------------------------------
| 1.4.0                         | 2.0.x          |
--------------------------------------------------
| 1.3.0                         | 1.6.x          |
--------------------------------------------------
| 1.2.2                         | 1.0.x          |
--------------------------------------------------

The plugin includes analyzer: pinyin , tokenizer: pinyin and token-filter: pinyin.

** Optional Parameters **

keep_first_letter when this option enabled, eg: 刘德华>ldh, default: true
keep_separate_first_letter when this option enabled, will keep first letters separately, eg: 刘德华>l,d,h, default: false, NOTE: query result maybe too fuzziness due to term too frequency
limit_first_letter_length set max length of the first_letter result, default: 16
keep_full_pinyin when this option enabled, eg: 刘德华> [liu,de,hua], default: true
keep_joined_full_pinyin when this option enabled, eg: 刘德华> [liudehua], default: false
keep_none_chinese keep non chinese letter or number in result, default: true
keep_none_chinese_together keep non chinese letter together, default: true, eg: DJ音乐家 -> DJ,yin,yue,jia, when set to false, eg: DJ音乐家 -> D,J,yin,yue,jia, NOTE: keep_none_chinese should be enabled first
keep_none_chinese_in_first_letter keep non Chinese letters in first letter, eg: 刘德华AT2016->ldhat2016, default: true
keep_none_chinese_in_joined_full_pinyin keep non Chinese letters in joined full pinyin, eg: 刘德华2016->liudehua2016, default: false
none_chinese_pinyin_tokenize break non chinese letters into separate pinyin term if they are pinyin, default: true, eg: liudehuaalibaba13zhuanghan -> liu,de,hua,a,li,ba,ba,13,zhuang,han, NOTE: keep_none_chinese and keep_none_chinese_together should be enabled first
keep_original when this option enabled, will keep original input as well, default: false
lowercase lowercase non Chinese letters, default: true
trim_whitespace default: true
remove_duplicated_term when this option enabled, duplicated term will be removed to save index, eg: de的>de, default: false, NOTE: position related query maybe influenced
ignore_pinyin_offset after 6.0, offset is strictly constrained, overlapped tokens are not allowed, with this parameter, overlapped token will allowed by ignore offset, please note, all position related query or highlight will become incorrect, you should use multi fields and specify different settings for different query purpose. if you need offset, please set it to false. default: true.

1.Create a index with custom pinyin analyzer

PUT /medcl/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : true,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "remove_duplicated_term" : true
                }
            }
        }
    }
}

2.Test Analyzer, analyzing a chinese name, such as 刘德华

GET /medcl/_analyze
{
  "text": ["刘德华"],
  "analyzer": "pinyin_analyzer"
}

{
  "tokens" : [
    {
      "token" : "liu",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "de",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "hua",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "刘德华",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 4
    }
  ]
}

3.Create mapping

POST /medcl/_mapping 
{
        "properties": {
            "name": {
                "type": "keyword",
                "fields": {
                    "pinyin": {
                        "type": "text",
                        "store": false,
                        "term_vector": "with_offsets",
                        "analyzer": "pinyin_analyzer",
                        "boost": 10
                    }
                }
            }
        }
    
}

4.Indexing

POST /medcl/_create/andy
{"name":"刘德华"}

5.Let's search

curl http://localhost:9200/medcl/_search?q=name:%E5%88%98%E5%BE%B7%E5%8D%8E
curl http://localhost:9200/medcl/_search?q=name.pinyin:%e5%88%98%e5%be%b7
curl http://localhost:9200/medcl/_search?q=name.pinyin:liu
curl http://localhost:9200/medcl/_search?q=name.pinyin:ldh
curl http://localhost:9200/medcl/_search?q=name.pinyin:de+hua

6.Using Pinyin-TokenFilter

PUT /medcl1/ 
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "user_name_analyzer" : {
                    "tokenizer" : "whitespace",
                    "filter" : "pinyin_first_letter_and_full_pinyin_filter"
                }
            },
            "filter" : {
                "pinyin_first_letter_and_full_pinyin_filter" : {
                    "type" : "pinyin",
                    "keep_first_letter" : true,
                    "keep_full_pinyin" : false,
                    "keep_none_chinese" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true,
                    "trim_whitespace" : true,
                    "keep_none_chinese_in_first_letter" : true
                }
            }
        }
    }
}

Token Test:刘德华张学友郭富城黎明四大天王

GET /medcl1/_analyze
{
  "text": ["刘德华 张学友 郭富城 黎明 四大天王"],
  "analyzer": "user_name_analyzer"
}

{
  "tokens" : [
    {
      "token" : "ldh",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "zxy",
      "start_offset" : 4,
      "end_offset" : 7,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "gfc",
      "start_offset" : 8,
      "end_offset" : 11,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "lm",
      "start_offset" : 12,
      "end_offset" : 14,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "sdtw",
      "start_offset" : 15,
      "end_offset" : 19,
      "type" : "word",
      "position" : 4
    }
  ]
}

7.Used in phrase query

option 1

PUT /medcl2/
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "pinyin_analyzer" : {
                    "tokenizer" : "my_pinyin"
                    }
            },
            "tokenizer" : {
                "my_pinyin" : {
                    "type" : "pinyin",
                    "keep_first_letter":false,
                    "keep_separate_first_letter" : false,
                    "keep_full_pinyin" : true,
                    "keep_original" : false,
                    "limit_first_letter_length" : 16,
                    "lowercase" : true
                }
            }
        }
    }
}
GET /medcl2/_search
{
  "query": {"match_phrase": {
    "name.pinyin": "刘德华"
  }}
}

option 2

 
PUT /medcl3/
{
   "settings" : {
       "analysis" : {
           "analyzer" : {
               "pinyin_analyzer" : {
                   "tokenizer" : "my_pinyin"
                   }
           },
           "tokenizer" : {
               "my_pinyin" : {
                   "type" : "pinyin",
                   "keep_first_letter":true,
                   "keep_separate_first_letter" : true,
                   "keep_full_pinyin" : true,
                   "keep_original" : false,
                   "limit_first_letter_length" : 16,
                   "lowercase" : true
               }
           }
       }
   }
}
   
POST /medcl3/_mapping 
{
  "properties": {
      "name": {
          "type": "keyword",
          "fields": {
              "pinyin": {
                  "type": "text",
                  "store": false,
                  "term_vector": "with_offsets",
                  "analyzer": "pinyin_analyzer",
                  "boost": 10
              }
          }
      }
  }
}
  
   
GET /medcl3/_analyze
{
   "text": ["刘德华"],
   "analyzer": "pinyin_analyzer"
}
 
POST /medcl3/_create/andy
{"name":"刘德华"}

GET /medcl3/_search
{
 "query": {"match_phrase": {
   "name.pinyin": "刘德h"
 }}
}

GET /medcl3/_search
{
 "query": {"match_phrase": {
   "name.pinyin": "刘dh"
 }}
}

GET /medcl3/_search
{
 "query": {"match_phrase": {
   "name.pinyin": "liudh"
 }}
}

GET /medcl3/_search
{
 "query": {"match_phrase": {
   "name.pinyin": "liudeh"
 }}
}

GET /medcl3/_search
{
 "query": {"match_phrase": {
   "name.pinyin": "liude华"
 }}
}

8.That's all, have fun.

Versions

Version
5.2.2 Mar 2, 2017
5.2.1 Feb 15, 2017
5.2.0 Feb 15, 2017
5.1.2 Jan 24, 2017
5.1.1 Dec 13, 2016
5.0.2 Nov 30, 2016
5.0.1 Nov 16, 2016
5.0.0 Oct 27, 2016
5.0.0-rc1 Oct 13, 2016
5.0.0-beta1 Sep 28, 2016
5.0.0-alpha5 Aug 26, 2016
1.8.3 Dec 14, 2016
1.8.2 Nov 23, 2016
1.8.1 Oct 13, 2016
1.7.5 Aug 18, 2016

elasticsearch-analysis-pinyin

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Source Code Management

Download elasticsearch-analysis-pinyin

How to add to project

Dependencies

compile (1)

runtime (1)

test (7)

system (1)

Project Modules

Pinyin Analysis for Elasticsearch

Versions