elasticsearch-phone

Elasticsearch Plugin for Phone and SIP Analysis

License

License

Categories

Categories

Search Business Logic Libraries Elasticsearch
GroupId

GroupId

com.inin.analytics
ArtifactId

ArtifactId

elasticsearch-phone
Last Version

Last Version

1.0.2
Release Date

Release Date

Type

Type

jar
Description

Description

elasticsearch-phone
Elasticsearch Plugin for Phone and SIP Analysis
Project URL

Project URL

https://github.com/MyPureCloud/elasticsearch-phone
Source Code Management

Source Code Management

https://github.com/MyPureCloud/elasticsearch-phone.git

Download elasticsearch-phone

How to add to project

<!-- https://jarcasting.com/artifacts/com.inin.analytics/elasticsearch-phone/ -->
<dependency>
    <groupId>com.inin.analytics</groupId>
    <artifactId>elasticsearch-phone</artifactId>
    <version>1.0.2</version>
</dependency>
// https://jarcasting.com/artifacts/com.inin.analytics/elasticsearch-phone/
implementation 'com.inin.analytics:elasticsearch-phone:1.0.2'
// https://jarcasting.com/artifacts/com.inin.analytics/elasticsearch-phone/
implementation ("com.inin.analytics:elasticsearch-phone:1.0.2")
'com.inin.analytics:elasticsearch-phone:jar:1.0.2'
<dependency org="com.inin.analytics" name="elasticsearch-phone" rev="1.0.2">
  <artifact name="elasticsearch-phone" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.inin.analytics', module='elasticsearch-phone', version='1.0.2')
)
libraryDependencies += "com.inin.analytics" % "elasticsearch-phone" % "1.0.2"
[com.inin.analytics/elasticsearch-phone "1.0.2"]

Dependencies

compile (4)

Group / Artifact Type Version
com.googlecode.libphonenumber : libphonenumber jar 7.0.7
org.elasticsearch : elasticsearch jar 1.6.0
org.apache.commons : commons-lang3 jar 3.4
commons-io : commons-io jar 2.4

test (5)

Group / Artifact Type Version
org.apache.lucene : lucene-test-framework jar 4.10.4
com.carrotsearch.randomizedtesting : randomizedtesting-runner jar 2.1.11
org.elasticsearch : elasticsearch test-jar 1.6.0
org.hamcrest : hamcrest-all jar 1.3
junit : junit jar 4.11

Project Modules

There are no modules declared in this project.

Elasticsearch-Phone

Indexing phone numbers & sip addresses in lucene is complicated. Most people use ngram tokenizers. We did that for a while with ngram min=3 & max=35, but the result was often 100s of tokens per sip address. Working in a call center focused company we quickly figured out how wasteful that is on the storage front. For us 6/7ths of our indexes were waisted on useless sip address tokens.

It's a hard problem to regex your way out of. An international phone number often includes a country code, but that can be 1, 2, or 3+ digits. A lot of people have requested elasticsearch integrate google's libphone library into a custom lucene analyzer. It hasn't happened yet, so here's a plugin that attempts to do just that.

Note: This is a young project. We'll improve as time goes on, but use at your own risk.

Building and installing the plugin

mvn package ./bin/plugin --url file:///....elasticsearch-phone/target/releases/elasticsearch-phone-1.0.0.zip --install elasticsearch-phone;

Analyzers

This project provides three analyzers that are intended for different contexts.

  • The phone analyzer supports SIP URIs and other phone numbers and is intended to be used when indexing. It strips common prefixes such as sip: and tel: (and indexes those as separate tokens) and tokenizes the phone number with various prefix lengths.
  • The phone-email analyzer extends the phone analyzer with additional tokenization for email addresses (e.g. generating tokens for the user part and the domain part of an email address).
  • The phone-search analyzer is intended to be used as a search_analyzer with one of the other two analyzers used for indexing. It does minimal tokenization: If a term starts with sip: or tel: it strips this part and generates a token for it. The analyzer also strips a leading + from phone numbers.

Example inputs

Provide a telephone or sip address prefixed by tel: or sip: with no spaces or symbols.

Your indexing template will need to specify the analyzer for the field. EG

            "field": {
              "type": "string",
              "analyzer": "phone",
              "search_analyzer": "phone-search"
            }

Sample allowed inputs (see PhoneTokenizerIntegrationTest and PhoneSearchIntegrationTest for more):

  • tel:+441344840400
  • tel:+498362930830
  • sip:abc@autosbcpc
  • sip:+13119310462;[email protected]:8060

Example tokenization

SIP URI

Input (with country code): sip:+13169410766;[email protected]:8060

Tokens:

sip:+13169410766;[email protected]:8060
sip:
13169410766;[email protected]:8060
13169410766;ext=2233
1
2233
3169410766
3
13
31
131
316
1316
3169
13169
31694
131694
316941
1316941
3169410
13169410
31694107
131694107
316941076
1316941076
13169410766

Phone number

Input (without a country code): tel:8177148350

Tokens:

tel:8177148350
tel:
8177148350
8
81
817
8177
81771
817714
8177148
81771483
817714835

Email address

Input: [email protected]

Tokens:

[email protected]
user.name
user *
name *
domain.com *
domain *
com *

Tokens marked with * are only generated by the phone-email tokenizer.

Search examples

Term

Term queries will return exact matches without analyzing (without normalization as lowercase).

"query": {
  "term" : { "field" : "8177" }
}
"query": {
  "term" : { "field" : "domain" }
}

Match

Match queries use the configured analyzer (or search_analyzer). In this example, the query will be translated to a boolean and of two term queries for (tel: and 8177).

"query": {
  "match" : {
      "field" : {
          "query" : "tel:8177",
          "operator" : "and"
      }
  }
}
com.inin.analytics

Genesys Cloud

Versions

Version
1.0.2
1.0.1
1.0.0