Next Text

Experimental similarity measure for text documents.

License

License

GroupId

GroupId

org.objecttrouve
ArtifactId

ArtifactId

nexttext
Last Version

Last Version

0.3.0
Release Date

Release Date

Type

Type

pom
Description

Description

Next Text
Experimental similarity measure for text documents.
Project URL

Project URL

https://github.com/objecttrouve/nexttext
Source Code Management

Source Code Management

https://github.com/objecttrouve/nexttext

Download nexttext

How to add to project

<!-- https://jarcasting.com/artifacts/org.objecttrouve/nexttext/ -->
<dependency>
    <groupId>org.objecttrouve</groupId>
    <artifactId>nexttext</artifactId>
    <version>0.3.0</version>
    <type>pom</type>
</dependency>
// https://jarcasting.com/artifacts/org.objecttrouve/nexttext/
implementation 'org.objecttrouve:nexttext:0.3.0'
// https://jarcasting.com/artifacts/org.objecttrouve/nexttext/
implementation ("org.objecttrouve:nexttext:0.3.0")
'org.objecttrouve:nexttext:pom:0.3.0'
<dependency org="org.objecttrouve" name="nexttext" rev="0.3.0">
  <artifact name="nexttext" type="pom" />
</dependency>
@Grapes(
@Grab(group='org.objecttrouve', module='nexttext', version='0.3.0')
)
libraryDependencies += "org.objecttrouve" % "nexttext" % "0.3.0"
[org.objecttrouve/nexttext "0.3.0"]

Dependencies

There are no dependencies for this project. It is a standalone project that does not depend on any other jars.

Project Modules

There are no modules declared in this project.

Travis build

Latest release in Maven Repository

Next Text

Experimental library for a simple text similarity measure.

CRI Distance

The CRI distance is a measure for superficial text similarity. "CRI" stands for character repetition intervals. The idea behind it is to gauge a text by capturing the numbers of symbols it takes for each character to repeat.

The normalized CRI distance of two texts is a value between 0 and 1, where 0 means the texts are identical and 0 means they have nothing in common at all.

Theory

By Example

Given an alphabet consisting of the letters "a" and "b". Let's index them with 0 and 1. Let's define an "interval" simply as the difference between the character indexes in the string. The length of the first interval is the position of the respective character. Because it takes that many symbols until the character "repeats" for the first time.

So much for the interval.

Now, let's look at the string "abba". It takes 0 positions to increment for the "a" to appear for the first time and one for the "b". From the first "b" to the second "b", the position increments by one, so the interval is one. From the first "a" to the next, the position increments by 3.

Now, let's put this in a matrix where the first dimension is the alphabet indexes and the second are the frequencies (counts) of the respective repetition interval (CRI) lengths:

String "abba":

Position 0 1 2 3
Character a b b a

CRI counts of "abba":

⬐ Symbol / CRI length → 0 1 2 3
0 ("a") 1 0 0 1
1 ("b") 0 2 0 0

Let's compare this with the matrix we get for the string "baba":

String "baba":

0 1 2 3
b a b a

CRI counts of "baba":

⬐ Symbol / CRI length → 0 1 2 3
0 ("a") 0 1 1 0
1 ("b") 1 0 1 0

Now, let's calculate a naïve distance between those two matrixes simply by adding up the absolute differences between the individual cell values.

The distance between "abba" and "baba" is 8.

Let's compare the strings "abbababa" and "babaabba" in the same way. The two strings have been concatenated and concatenated in reverse order. The distance is still 8.

The algorithm honors identical subsequences that can appear in different orders and in different locations.

Finally, it would be nice to have a normalized value which is easier to interpret than the raw distance. We divide the distance by the sum of the two text's lengths and we get a result between 0 and 1, 0 meaning identical and 1 meaning nothing in common at all.

⚠️ The metric can be tricky if the compared strings don't contain any repeating characters. It returns 1 when differences occur at the beginning.

More Formally

Let si be a unique symbol where 0 ≤ i ≤ n .

Let A be an alphabet of n symbols:
A = {s0, ... sn}

Let S be a sequence of length L of symbols from A.

Let j,k (0≤j<;k, k<L) be the positions of two occurrences of si in S where no other occurrence of si intervenes.

Then a CRI interval is just the range of positions from j to k, not including k:
CRIi,j,k = [j..k)

For the first occurrence of si we define j=0 and set k to the position of the first occurrence of si (j≤k).

The length lk,j of the CRI is k - j.

The CRI count ci,l is the count of all repetition intervals of si of length l.

Let M1i⨯l be a matrix of CRI counts c of S1 by symbol index i and CRI length l.

Let M2i⨯l be a matrix of CRI counts c of S2 by symbol index i and CRI length l.

Let Mdiff be the subtraction of the two matrixes: Mdiff = M1 - M2

Then the absolute CRI distance D is just the sum over the absolute values of d of Mdiff: D = ∑|di,l|

The normalized CRI distance is D/(L1+L2).

Disclaimer

I have no idea whether this algorithm has any academic or whatsoever discourse. Didn't find it among the top 10 on Google. (Or googled the wrong buzzwords.) And didn't bother to do any research on the field. (Actually, I was just thinking about a pragmatic solution for an imminent task.) Giving the algorithm is so trivial, I'm sure it already exists. Or otherwise it's just crap. I'm implementing it anyway as a nice playground to learn Kotlin.

So, forgive me and let me know if I (unintentionally) plagiarized.

🙈 And condone the amateurish Kotlin...

Usage

I have a weakness for the builder pattern, so the API looks like this:

        val nextText = NextText.Builder()
                .withMinCodePoint(0)
                .withMaxCodePoint(127)
                .build()

        val criDistance = nextText.criDistance("text 1", "text 2")

        println("The normalized CRI distance is $criDistance.")

Versions

Version
0.3.0