LDA in Java

Java implementation of latent dirichlet allocation

License

License

MIT
Categories

Categories

Java Languages
GroupId

GroupId

com.github.chen0040
ArtifactId

ArtifactId

java-lda
Last Version

Last Version

1.0.4
Release Date

Release Date

Type

Type

jar
Description

Description

LDA in Java
Java implementation of latent dirichlet allocation
Project URL

Project URL

https://github.com/chen0040/java-lda
Source Code Management

Source Code Management

https://github.com/chen0040/java-lda

Download java-lda

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.chen0040/java-lda/ -->
<dependency>
    <groupId>com.github.chen0040</groupId>
    <artifactId>java-lda</artifactId>
    <version>1.0.4</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.chen0040/java-lda/
implementation 'com.github.chen0040:java-lda:1.0.4'
// https://jarcasting.com/artifacts/com.github.chen0040/java-lda/
implementation ("com.github.chen0040:java-lda:1.0.4")
'com.github.chen0040:java-lda:jar:1.0.4'
<dependency org="com.github.chen0040" name="java-lda" rev="1.0.4">
  <artifact name="java-lda" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.chen0040', module='java-lda', version='1.0.4')
)
libraryDependencies += "com.github.chen0040" % "java-lda" % "1.0.4"
[com.github.chen0040/java-lda "1.0.4"]

Dependencies

compile (4)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.20
org.slf4j : slf4j-log4j12 jar 1.7.20
com.github.chen0040 : java-data-text jar 1.0.3
com.github.chen0040 : java-data-frame jar 1.0.2

provided (1)

Group / Artifact Type Version
org.projectlombok : lombok jar 1.16.6

test (10)

Group / Artifact Type Version
org.testng : testng jar 6.9.10
org.hamcrest : hamcrest-core jar 1.3
org.hamcrest : hamcrest-library jar 1.3
org.assertj : assertj-core jar 3.5.2
org.powermock : powermock-core jar 1.6.5
org.powermock : powermock-api-mockito jar 1.6.5
org.powermock : powermock-module-junit4 jar 1.6.5
org.powermock : powermock-module-testng jar 1.6.5
org.mockito : mockito-core jar 2.0.2-beta
org.mockito : mockito-all jar 2.0.2-beta

Project Modules

There are no modules declared in this project.

java-lda

Package provides java implementation of the latent dirichlet allocation (LDA) for topic modelling

Build Status Coverage Status

Install

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-lda</artifactId>
  <version>1.0.4</version>
</dependency>

Usage

The sample code belows created a LDA which takes in the texts stored in "docs" list and created 20 different topics from these texts:

import com.github.chen0040.data.utils.TupleTwo;
import com.github.chen0040.lda.Lda;

List<String> docs = Arrays.asList("[paragraph1]", "[paragraph2]", ..., "[paragraphN]");

Lda method = new Lda();
method.setTopicCount(20);
method.setMaxVocabularySize(20000);
//method.setStemmerEnabled(true);
//method.setRemoveNumbers(true);
//method.setRemoveXmlTag(true);
//method.addStopWords(Arrays.asList("we", "they"));

LdaResult result = method.fit(docs);

System.out.println("Topic Count: "+result.topicCount());

for(int topicIndex = 0; topicIndex < topicCount; ++topicIndex){
 String topicSummary = result.topicSummary(topicIndex);
 List<TupleTwo<String, Integer>> topKeyWords = result.topKeyWords(topicIndex, 10);
 List<TupleTwo<Doc, Double>> topStrings = result.topDocuments(topicIndex, 5);

 System.out.println("Topic #" + (topicIndex+1) + ": " + topicSummary);

 for(TupleTwo<String, Integer> entry : topKeyWords){
    String keyword = entry._1();
    int score = entry._2();
    System.out.println("Keyword: " + keyword + "(" + score + ")");
 }

 for(TupleTwo<Doc, Double> entry : topStrings){
    double score = entry._2();
    int docIndex = entry._1().getDocIndex();
    String docContent = entry._1().getContent();
    System.out.println("Doc (" + docIndex + ", " + score + ")): " + docContent);
 }
}

The sample code belows takes the "result" variable from the above code and list the top 3 relevant topics of each document (which is one of the items in the "docs" list variable).

for(Doc doc : result.documents()){
 logger.info("Doc: {}", doc.getContent());
 List<TupleTwo<Integer, Double>> topTopics = doc.topTopics(3);

 logger.info("Top Topics: {} (score: {}), {} (score: {}), {} (score: {})",
         topTopics.get(0)._1(), topTopics.get(0)._2(),
         topTopics.get(1)._1(), topTopics.get(1)._2(),
         topTopics.get(2)._1(), topTopics.get(2)._2());
}

Versions

Version
1.0.4
1.0.3
1.0.2
1.0.1