Collo

Collo is a lightweight library for String input partitioning and categorization.

License	License The Apache License, Version 2.0
Categories	Categories Ant Build Tools Data
GroupId	GroupId com.mantledillusion.data
ArtifactId	ArtifactId collo
Last Version	Last Version 1.1.0
Release Date	Release Date May 17, 2020
Type	Type jar
Description	Description Collo Collo is a lightweight library for String input partitioning and categorization.
Project URL	Project URL http://www.mantledillusion.com
Source Code Management	Source Code Management http://github.com/MantledIllusion/collo

Download collo

Filename	Size
collo-1.1.0.pom
collo-1.1.0.jar	13 KB
collo-1.1.0-sources.jar	8 KB
collo-1.1.0-javadoc.jar	56 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.mantledillusion.data/collo/ -->
<dependency>
    <groupId>com.mantledillusion.data</groupId>
    <artifactId>collo</artifactId>
    <version>1.1.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.mantledillusion.data/collo/
implementation 'com.mantledillusion.data:collo:1.1.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.mantledillusion.data/collo/
implementation ("com.mantledillusion.data:collo:1.1.0")

Apache Buildr

'com.mantledillusion.data:collo:jar:1.1.0'

Apache Ivy

<dependency org="com.mantledillusion.data" name="collo" rev="1.1.0">
  <artifact name="collo" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.mantledillusion.data', module='collo', version='1.1.0')
)

Scala SBT

libraryDependencies += "com.mantledillusion.data" % "collo" % "1.1.0"

Leiningen

[com.mantledillusion.data/collo "1.1.0"]

Dependencies

test (1)

Group / Artifact	Type	Version
junit : junit	jar	4.12

Project Modules

There are no modules declared in this project.

collo

Collo is a lightweight library for string input partitioning and grouping.

Collospermum Hastatum, also known as Perching Lily, is a New Zealand plant that uses its arching flax-like leaves for filtering water out of thin air, detouring it into channels to its base for consumption.

1. Shaping String Input

Many forms of string data has a certain format; for example, names have the first name before the last name and addresses usually start with the street, followed by the name of the city.

Such simple formats can easily be recognized using regular expressions. But if there is a certain variation possible in the format, the regular expression complexity will completely explode:

If there are a lot of optional parts ("Harry Potter" is valid, as is "Harry James Potter")
If there are multiple overall formats for the data ("Harry Potter" is a possibility as well as "Potter, Harry")
If the same entity can be referred to completely differently ("Harry Potter" and "Undesirable No 1" both refer to the same person)

In addition, if the input cannot be pre-categorized at all (not knowing a string will refer to a person, but instead it can contain a name, an address, ...), an endless if-else cascade of regular expression match checks will be the result.

Collo was made to counter all these problems at once.

1.1 The InputPart

The input part is a technical regular expression based representation of string segments.

Collo requires an Enum implementing the interface InputPart to represent the parts, so analysis results can be returned as PartEnum->Substring pairs.

private enum InputParts implements InputPart {
    FIRSTNAME("[A-Z]{1}[A-Za-z]*"),
    LASTNAME("[A-Z]{1}[A-Za-z]*"),
    UNDESIRABLE_NUMBER("Undesirable No \\d+"),

    HOUSENR("\\d+"),
    STREET("[A-Z]{1}[A-Za-z]*( [A-Z]{1}[A-Za-z]*)*"),
    CITY("[A-Z]{1}[A-Za-z]*( [A-Z]{1}[A-Za-z]*)*");
}

1.2 The InputGroup

The input group is a functional sequence of input parts that describe the same entity.

Collo requires an Enum to represent the groups, so analysis results can be returned as GroupEnum->Map<PartEnum, Substring> pairs.

    private enum InputGroups {
        FULLNAME, 
        FULLADDRESS;
    }

2. The InputAnalyzer

The InputAnalyzer class offers a static builder which allows adding an arbitrary number of InputGroup instances the analyzer will be able to recognize.

The InputGroup class offers a static builder which allows adding an arbitrary number of InputPart instances that make up the group.

In combination, both builders can be used to set up an analyzer that can recognize completely different shapes of strings:

InputAnalyzer<InputGroups, InputParts> analyzer = InputAnalyzer
    .forGroup(InputGroups.FULLNAME, InputGroup.
        andPart(InputParts.UNDESIRABLE_NUMBER, PartOccurrenceMode.EXCLUSIVE).
        forPart(InputParts.FORENAME).
        andPart(InputParts.LASTNAME).build())
    .andGroup(InputGroups.FULLADDRESS, InputGroup.
        forPart(InputParts.HOUSENR, PartOccurrenceMode.OPTIONAL).
        andPart(InputParts.STREET).
        andPart(InputParts.CITY).build())
    .build();

Setting a PartOccurrenceMode can help to cover cases when a part can occur but does not have to (OPTIONAL), or when such an optional part might be the only part when it occurs (EXCLUSIVE).

The analyzer has several methods to analyze input strings, but all of them base on InputAnalyzer.analyze(String term), which can split a term into the possible groups it matches:

inputAnalyzer.analyze("Harry Potter") :

[
  {
    "key": "FULLNAME",
    "value": [
      {
        "key": "FIRSTNAME",
        "value": "Harry"
      },
      {
        "key": "LASTNAME",
        "value": "Potter"
      }
    ]
  }
]

Versions

Version
1.1.0 May 17, 2020
1.0.0 Mar 14, 2018

Collo

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management