Aho-Corasick search

Aho-Corasick search with generic alphabets

License

License

GroupId

GroupId

eu.crydee
ArtifactId

ArtifactId

aho-corasick
Last Version

Last Version

2.0.1
Release Date

Release Date

Type

Type

jar
Description

Description

Aho-Corasick search
Aho-Corasick search with generic alphabets
Project URL

Project URL

https://github.com/m09/aho-corasick
Source Code Management

Source Code Management

https://github.com/m09/aho-corasick

Download aho-corasick

How to add to project

<!-- https://jarcasting.com/artifacts/eu.crydee/aho-corasick/ -->
<dependency>
    <groupId>eu.crydee</groupId>
    <artifactId>aho-corasick</artifactId>
    <version>2.0.1</version>
</dependency>
// https://jarcasting.com/artifacts/eu.crydee/aho-corasick/
implementation 'eu.crydee:aho-corasick:2.0.1'
// https://jarcasting.com/artifacts/eu.crydee/aho-corasick/
implementation ("eu.crydee:aho-corasick:2.0.1")
'eu.crydee:aho-corasick:jar:2.0.1'
<dependency org="eu.crydee" name="aho-corasick" rev="2.0.1">
  <artifact name="aho-corasick" type="jar" />
</dependency>
@Grapes(
@Grab(group='eu.crydee', module='aho-corasick', version='2.0.1')
)
libraryDependencies += "eu.crydee" % "aho-corasick" % "2.0.1"
[eu.crydee/aho-corasick "2.0.1"]

Dependencies

compile (2)

Group / Artifact Type Version
com.google.guava : guava jar 23.0
org.apache.commons : commons-lang3 jar 3.9

test (1)

Group / Artifact Type Version
junit : junit jar 4.12

Project Modules

There are no modules declared in this project.

Aho-Corasick Search

Build Status Deploy Status Maven Central version codecov

Generic implementation of the Aho-Corasick search algorithm. If you want a library to perform an Aho-Corasick search on strings only and not a generic alphabet, I'd recommend using Robert Bor's implementation instead of this library, it has many goodies for the string use-case.

A generic alphabet can be anything from pairs of token and part of speech in an NLP application to type of log entries in a security application…

See also the original paper by Aho and Corasick.

Requirements

To use this library, you need Java 8 and Maven 3.

Installation

Refer to the Maven Central page to find the installation instructions for your build tool or to download the jar directly.

Usage

To use the Aho-Corasick search algorithm, you first need to pass to the constructor of AhoCorasick the list of patterns you want to detect. A pattern is represented by an array of objects, so you need to pass a list of arrays to the constructor. For example, if our alphabet is the Character type and we want to be able to detect the patterns “hu”, “she” and “hers”, we can create the list of patterns as follows, using Java 8 functional idioms:

List<Character[]> patterns = Stream.of(
        "hu",
        "she",
        "hers")
        .map(s -> ArrayUtils.toObject(s.toCharArray()))
        .collect(Collectors.toList());

Once we have the list of patterns we can create the search object:

AhoCorasick<Character> ac = new AhoCorasick<>(patterns);

This builds the automaton required to efficiently search any text with all your patterns.

Now you have to prepare the texts you want to search on as arrays. For example, here, if we wanted to test the “hushers” text, we'd code (using Apache Commons Lang 3 to ease the arrays manipulation):

Character[] text = ArrayUtils.toObject("hushers".toCharArray());

You can now choose a search method to detect patterns in your text. The choice depends on your use-case:

  • searchPosToPattern will return a Guava multimap whose keys are positions in the text, and values are the patterns whose end matched at that position.

  • searchPatternToPos will return a Guava multimap whose keys are the patterns successfully matched and values are the end positions of the spans where they matched. If you do not want to manipulate either of those Guava multimaps, you can use the AhoCorasickHelpers class to produce a sorted set of Occurrences instead. An Occurrence is just a triple (begin, end, patternIndex). begin and end specify the span where patterns.get(patternIndex) was matched (end is inclusive). The occurrences are ordered by smallest begin, largest end and smallest patternIndex:

SortedSet<Occurrence> result
        = AhoCorasickHelpers.toOccurrenceList(
                ac.searchPosToPattern(text),
                patterns,
                false);
// result is {(0, 1, 0), (2, 4, 1), (3, 6, 2)}

Versions

Version
2.0.1
2.0.0
1.0.0