Text Processing

Java implementation of text processing such as stemmers

License

License

MIT
Categories

Categories

Java Languages Data
GroupId

GroupId

com.github.chen0040
ArtifactId

ArtifactId

java-data-text
Last Version

Last Version

1.0.3
Release Date

Release Date

Type

Type

jar
Description

Description

Text Processing
Java implementation of text processing such as stemmers
Project URL

Project URL

https://github.com/chen0040/java-data-text
Source Code Management

Source Code Management

https://github.com/chen0040/java-data-text

Download java-data-text

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.chen0040/java-data-text/ -->
<dependency>
    <groupId>com.github.chen0040</groupId>
    <artifactId>java-data-text</artifactId>
    <version>1.0.3</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.chen0040/java-data-text/
implementation 'com.github.chen0040:java-data-text:1.0.3'
// https://jarcasting.com/artifacts/com.github.chen0040/java-data-text/
implementation ("com.github.chen0040:java-data-text:1.0.3")
'com.github.chen0040:java-data-text:jar:1.0.3'
<dependency org="com.github.chen0040" name="java-data-text" rev="1.0.3">
  <artifact name="java-data-text" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.chen0040', module='java-data-text', version='1.0.3')
)
libraryDependencies += "com.github.chen0040" % "java-data-text" % "1.0.3"
[com.github.chen0040/java-data-text "1.0.3"]

Dependencies

compile (3)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.20
org.slf4j : slf4j-log4j12 jar 1.7.20
org.apache.commons : commons-lang3 jar 3.4

provided (1)

Group / Artifact Type Version
org.projectlombok : lombok jar 1.16.6

test (10)

Group / Artifact Type Version
org.testng : testng jar 6.9.10
org.hamcrest : hamcrest-core jar 1.3
org.hamcrest : hamcrest-library jar 1.3
org.assertj : assertj-core jar 3.5.2
org.powermock : powermock-core jar 1.6.5
org.powermock : powermock-api-mockito jar 1.6.5
org.powermock : powermock-module-junit4 jar 1.6.5
org.powermock : powermock-module-testng jar 1.6.5
org.mockito : mockito-core jar 2.0.2-beta
org.mockito : mockito-all jar 2.0.2-beta

Project Modules

There are no modules declared in this project.

java-data-text

Package provides java implementation of various text preprocessing methods such as tokenizers, vocabulary, text filter, stemmer, and so on

Build Status Coverage Status Documentation Status

Install

Add the following dependency to your POM file:

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-data-text</artifactId>
  <version>1.0.3</version>
</dependency>

Features

  • Porter Stemmer

  • Punctuation Filter

  • Stop Word Removal

    • Xml Tag Removal
    • Ip Address Removal
    • Number Removal
  • English Tokenizer

Usage

To use any text filter, just create a new text filter and then calls its filter(...) method.

Porter Stemmer

import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.PorterStemmer;

TextFilter stemmer = new PorterStemmer();
List<String> words = Arrays.asList(
        "caresses",
        "ponies",
        "ties",
        "caress",
        "cats",
        "feed",
        "agreed",
        "disabled",
        "matting",
        "mating",
        "meeting",
        "milling",
        "messing",
        "meetings"
);

List<String> result = stemmer.filter(words);
for (int i = 0; i < words.size(); ++i)
{
    System.out.println(String.format("%s -> %s", words.get(i), result.get(i)));
}

StopWord Removal

import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.StopWordRemoval;

StopWordRemoval filter = new StopWordRemoval();

filter.setRemoveNumbers(false);
filter.setRemoveIpAddress(false);
filter.setRemoveXmlTag(false);

InputStream inputStream = FileUtils.getResource("documents.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String content = reader.lines().collect(Collectors.joining("\n"));
reader.close();

List<String> before = BasicTokenizer.doTokenize(content);
List<String> after = filter.filter(before);

Punctuation Filtering

import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.PunctuationFilter;

TextFilter filter = new PunctuationFilter();

InputStream inputStream = FileUtils.getResource("documents.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String content = reader.lines().collect(Collectors.joining("\n"));
reader.close();

List<String> before = BasicTokenizer.doTokenize(content);
List<String> after = filter.filter(before);

Versions

Version
1.0.3
1.0.2
1.0.1