java-data-text
Package provides java implementation of various text preprocessing methods such as tokenizers, vocabulary, text filter, stemmer, and so on
Install
Add the following dependency to your POM file:
<dependency>
<groupId>com.github.chen0040</groupId>
<artifactId>java-data-text</artifactId>
<version>1.0.3</version>
</dependency>
Features
-
Porter Stemmer
-
Punctuation Filter
-
Stop Word Removal
- Xml Tag Removal
- Ip Address Removal
- Number Removal
-
English Tokenizer
Usage
To use any text filter, just create a new text filter and then calls its filter(...) method.
Porter Stemmer
import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.PorterStemmer;
TextFilter stemmer = new PorterStemmer();
List<String> words = Arrays.asList(
"caresses",
"ponies",
"ties",
"caress",
"cats",
"feed",
"agreed",
"disabled",
"matting",
"mating",
"meeting",
"milling",
"messing",
"meetings"
);
List<String> result = stemmer.filter(words);
for (int i = 0; i < words.size(); ++i)
{
System.out.println(String.format("%s -> %s", words.get(i), result.get(i)));
}
StopWord Removal
import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.StopWordRemoval;
StopWordRemoval filter = new StopWordRemoval();
filter.setRemoveNumbers(false);
filter.setRemoveIpAddress(false);
filter.setRemoveXmlTag(false);
InputStream inputStream = FileUtils.getResource("documents.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String content = reader.lines().collect(Collectors.joining("\n"));
reader.close();
List<String> before = BasicTokenizer.doTokenize(content);
List<String> after = filter.filter(before);
Punctuation Filtering
import com.github.chen0040.data.text.TextFilter;
import com.github.chen0040.data.text.PunctuationFilter;
TextFilter filter = new PunctuationFilter();
InputStream inputStream = FileUtils.getResource("documents.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
String content = reader.lines().collect(Collectors.joining("\n"));
reader.close();
List<String> before = BasicTokenizer.doTokenize(content);
List<String> after = filter.filter(before);