Next Text

Experimental similarity measure for text documents.

License	License The Apache License, Version 2.0
GroupId	GroupId org.objecttrouve
ArtifactId	ArtifactId nexttext
Last Version	Last Version 0.3.0
Release Date	Release Date Oct 24, 2020
Type	Type pom
Description	Description Next Text Experimental similarity measure for text documents.
Project URL	Project URL https://github.com/objecttrouve/nexttext
Source Code Management	Source Code Management https://github.com/objecttrouve/nexttext

Download nexttext

Filename	Size
nexttext-0.3.0.pom
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.objecttrouve/nexttext/ -->
<dependency>
    <groupId>org.objecttrouve</groupId>
    <artifactId>nexttext</artifactId>
    <version>0.3.0</version>
    <type>pom</type>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.objecttrouve/nexttext/
implementation 'org.objecttrouve:nexttext:0.3.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.objecttrouve/nexttext/
implementation ("org.objecttrouve:nexttext:0.3.0")

Apache Buildr

'org.objecttrouve:nexttext:pom:0.3.0'

Apache Ivy

<dependency org="org.objecttrouve" name="nexttext" rev="0.3.0">
  <artifact name="nexttext" type="pom" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.objecttrouve', module='nexttext', version='0.3.0')
)

Scala SBT

libraryDependencies += "org.objecttrouve" % "nexttext" % "0.3.0"

Leiningen

[org.objecttrouve/nexttext "0.3.0"]

Dependencies

There are no dependencies for this project. It is a standalone project that does not depend on any other jars.

Project Modules

There are no modules declared in this project.

Next Text

Experimental library for a simple text similarity measure.

CRI Distance

The CRI distance is a measure for superficial text similarity. "CRI" stands for character repetition intervals. The idea behind it is to gauge a text by capturing the numbers of symbols it takes for each character to repeat.

The normalized CRI distance of two texts is a value between 0 and 1, where 0 means the texts are identical and 0 means they have nothing in common at all.

Theory

By Example

Given an alphabet consisting of the letters "a" and "b". Let's index them with 0 and 1. Let's define an "interval" simply as the difference between the character indexes in the string. The length of the first interval is the position of the respective character. Because it takes that many symbols until the character "repeats" for the first time.

So much for the interval.

Now, let's look at the string "abba". It takes 0 positions to increment for the "a" to appear for the first time and one for the "b". From the first "b" to the second "b", the position increments by one, so the interval is one. From the first "a" to the next, the position increments by 3.

Now, let's put this in a matrix where the first dimension is the alphabet indexes and the second are the frequencies (counts) of the respective repetition interval (CRI) lengths:

String "abba":

Position	0	1	2	3
Character	a	b	b	a

CRI counts of "abba":

⬐ Symbol / CRI length →	0	1	2	3
0 ("a")	1	0	0	1
1 ("b")	0	2	0	0

Let's compare this with the matrix we get for the string "baba":

String "baba":

0	1	2	3
b	a	b	a

CRI counts of "baba":

⬐ Symbol / CRI length →	0	1	2	3
0 ("a")	0	1	1	0
1 ("b")	1	0	1	0

Now, let's calculate a naïve distance between those two matrixes simply by adding up the absolute differences between the individual cell values.

The distance between "abba" and "baba" is 8.

Let's compare the strings "abbababa" and "babaabba" in the same way. The two strings have been concatenated and concatenated in reverse order. The distance is still 8.

The algorithm honors identical subsequences that can appear in different orders and in different locations.

Finally, it would be nice to have a normalized value which is easier to interpret than the raw distance. We divide the distance by the sum of the two text's lengths and we get a result between 0 and 1, 0 meaning identical and 1 meaning nothing in common at all.

⚠️ The metric can be tricky if the compared strings don't contain any repeating characters. It returns 1 when differences occur at the beginning.

More Formally

Let s_i be a unique symbol where 0 ≤ i ≤ n .

Let A be an alphabet of n symbols:
A = {s₀, ... s_n}

Let S be a sequence of length L of symbols from A.

Let j,k (0≤j<;k, k<L) be the positions of two occurrences of s_i in S where no other occurrence of s_i intervenes.

Then a CRI interval is just the range of positions from j to k, not including k:
CRI_i,j,k = [j..k)

For the first occurrence of s_i we define j=0 and set k to the position of the first occurrence of s_i (j≤k).

The length l_k,j of the CRI is k - j.

The CRI count c_i,l is the count of all repetition intervals of s_i of length l.

Let M₁^i⨯l be a matrix of CRI counts c of S₁ by symbol index i and CRI length l.

Let M₂^i⨯l be a matrix of CRI counts c of S₂ by symbol index i and CRI length l.

Let M_diff be the subtraction of the two matrixes: M_diff = M₁ - M₂

Then the absolute CRI distance D is just the sum over the absolute values of d of M_diff: D = ∑|d_i,l|

The normalized CRI distance is D/(L₁+L₂).

Disclaimer

I have no idea whether this algorithm has any academic or whatsoever discourse. Didn't find it among the top 10 on Google. (Or googled the wrong buzzwords.) And didn't bother to do any research on the field. (Actually, I was just thinking about a pragmatic solution for an imminent task.) Giving the algorithm is so trivial, I'm sure it already exists. Or otherwise it's just crap. I'm implementing it anyway as a nice playground to learn Kotlin.

So, forgive me and let me know if I (unintentionally) plagiarized.

🙈 And condone the amateurish Kotlin...

Usage

I have a weakness for the builder pattern, so the API looks like this:

        val nextText = NextText.Builder()
                .withMinCodePoint(0)
                .withMaxCodePoint(127)
                .build()

        val criDistance = nextText.criDistance("text 1", "text 2")

        println("The normalized CRI distance is $criDistance.")

Versions

Version
0.3.0 Oct 24, 2020

Next Text

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download nexttext

How to add to project

Dependencies

Project Modules

Next Text

CRI Distance

Theory

By Example

More Formally

Disclaimer

Usage

Versions