TextMiningUtil

Provides various utility classes for text mining.

License	License Apache License version 2.0
GroupId	GroupId com.github.michael-rapp
ArtifactId	ArtifactId text-mining-util
Last Version	Last Version 2.1.3
Release Date	Release Date Oct 23, 2019
Type	Type jar
Description	Description TextMiningUtil Provides various utility classes for text mining.
Project URL	Project URL https://github.com/michael-rapp/TextMiningUtil
Source Code Management	Source Code Management https://github.com/michael-rapp/TextMiningUtil.git

Download text-mining-util

Filename	Size
text-mining-util-2.1.3.pom
text-mining-util-2.1.3.jar	149 KB
text-mining-util-2.1.3-sources.jar	45 KB
text-mining-util-2.1.3-javadoc.jar	261 bytes
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.michael-rapp/text-mining-util/ -->
<dependency>
    <groupId>com.github.michael-rapp</groupId>
    <artifactId>text-mining-util</artifactId>
    <version>2.1.3</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.michael-rapp/text-mining-util/
implementation 'com.github.michael-rapp:text-mining-util:2.1.3'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.michael-rapp/text-mining-util/
implementation ("com.github.michael-rapp:text-mining-util:2.1.3")

Apache Buildr

'com.github.michael-rapp:text-mining-util:jar:2.1.3'

Apache Ivy

<dependency org="com.github.michael-rapp" name="text-mining-util" rev="2.1.3">
  <artifact name="text-mining-util" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.michael-rapp', module='text-mining-util', version='2.1.3')
)

Scala SBT

libraryDependencies += "com.github.michael-rapp" % "text-mining-util" % "2.1.3"

Leiningen

[com.github.michael-rapp/text-mining-util "2.1.3"]

Dependencies

runtime (2)

Group / Artifact	Type	Version
org.jetbrains.kotlin : kotlin-stdlib	jar	1.3.50
com.github.michael-rapp : java-util	jar	[2.4.0,2.5.0)

test (1)

Group / Artifact	Type	Version
org.jetbrains.kotlin : kotlin-test-junit	jar	1.3.50

Project Modules

There are no modules declared in this project.

TextMiningUtil - README

"TextMiningUtil" is a Kotlin library that provides various utility classes for use in text mining such as text distance and similarity metrics. The library currently provides the following features:

Various metrics for measuring the similarity or dissimilarity of texts.
Tokenizers for splitting texts into shorter subtexts.

Note that this library was implemented in Java 8 prior to version 2.0.0.

License Agreement

This project is distributed under the Apache License version 2.0. For further information about this license agreement's content please refer to its full version, which is available at http://www.apache.org/licenses/LICENSE-2.0.txt.

Download

The latest release of this library can be downloaded as a zip archive from the download section of the project's Github page, which is available here. Furthermore, the library's source code is available as a Git repository, which can be cloned using the URL https://github.com/michael-rapp/TextMiningUtil.git.

Alternatively, the library can be added to your project as a Gradle dependency by adding the following to the build.gradle file:

dependencies {
    compile 'com.github.michael-rapp:text-mining-util:2.1.3'
}

When using Maven, the following dependency can be added to the pom.xml:

<dependency>
    <groupId>com.github.michael-rapp</groupId>
    <artifactId>text-mining-util</artifactId>
    <version>2.1.3</version>
</dependency>

Features

In the following a brief overview of the features, which are provided by the library, is given.

Metrics

The library comes with various metrics for measuring the similarity or dissimilarity of texts. The following metrics are provided:

DiceCoefficient: Measures the similarity of texts by splitting them into n-grams and calculating the percentage of n-grams that occur in both texts.
HammingDistance: Measures the distance between texts by counting the number of corresponding characters that are not equal (can only be applied to texts with the same length). HammingLoss and HammingAccuracy measure the dissimilarity, respectively similarity as a percentage.
LevenshteinDistance: Measures the distance between texts by counting the number of single-character edits that are necessary to change one text to another (can be applied to texts with different lengths). LevenshteinDissimilarity and LevenshteinSimilarity measure the dissimilarity, respectively similarity, as a percentage.
OptimalStringAlignmentDistance: Measures the distance between text by counting the number of single-character edits and transpositions of adjacent characters that are necessary to change one text to another (only one edit is allowed per substring; can be applied to texts with different lengths). OptimalStringAlignmentDissimilarity and OptimalStringAlignmentSimilarity measure the dissimilarity, respectively similarity, as a percentage.
DamerauLevenshteinDistance: Measures the distance between text by counting the number of single-character edits and transpositions of adjacent characters that are necessary to change one text to another (no restrictions; can be applied to texts with different length). DamerauLevenshteinDissimilarity and DamerauLevenshteinSimilarity measure the dissimilarity, respectively similarity, as a percentage.

Tokenizers

Tokenizers allow to split texts into shorter subtexts. The library provides the following implementations:

SubstringTokenizer: Allows to split texts into all possible substrings.
FixedLengthTokenizer: Allows to split texts into substrings with a specific length.
RegexTokenizer: Allows to split texts based on regular expressions (e.g. at whitespace or at certain delimiters).
NGramTokenizer: Allows to split texts into n-grams of specific lengths.

Contact information

For personal feedback or questions feel free to contact me via the mail address, which is mentioned on my Github profile. If you have found any bugs or want to post a feature request, please use the bugtracker to report them.

Versions

Version
2.1.3 Oct 23, 2019
2.1.2 Feb 23, 2019
2.1.1 Jan 28, 2019
2.1.0 Jan 28, 2019
2.0.0 Aug 7, 2018
1.2.0 May 21, 2018
1.1.1 Apr 29, 2018
1.1.0 Nov 19, 2017
1.0.0 Nov 17, 2017

TextMiningUtil

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management