LDA in Java

Java implementation of latent dirichlet allocation

License	License MIT
Categories	Categories Java Languages
GroupId	GroupId com.github.chen0040
ArtifactId	ArtifactId java-lda
Last Version	Last Version 1.0.4
Release Date	Release Date May 18, 2017
Type	Type jar
Description	Description LDA in Java Java implementation of latent dirichlet allocation
Project URL	Project URL https://github.com/chen0040/java-lda
Source Code Management	Source Code Management https://github.com/chen0040/java-lda

Download java-lda

Filename	Size
java-lda-1.0.4.pom
java-lda-1.0.4.jar	21 KB
java-lda-1.0.4-sources.jar	8 KB
java-lda-1.0.4-javadoc.jar	49 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.chen0040/java-lda/ -->
<dependency>
    <groupId>com.github.chen0040</groupId>
    <artifactId>java-lda</artifactId>
    <version>1.0.4</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.chen0040/java-lda/
implementation 'com.github.chen0040:java-lda:1.0.4'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.chen0040/java-lda/
implementation ("com.github.chen0040:java-lda:1.0.4")

Apache Buildr

'com.github.chen0040:java-lda:jar:1.0.4'

Apache Ivy

<dependency org="com.github.chen0040" name="java-lda" rev="1.0.4">
  <artifact name="java-lda" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.chen0040', module='java-lda', version='1.0.4')
)

Scala SBT

libraryDependencies += "com.github.chen0040" % "java-lda" % "1.0.4"

Leiningen

[com.github.chen0040/java-lda "1.0.4"]

Dependencies

compile (4)

Group / Artifact	Type	Version
org.slf4j : slf4j-api	jar	1.7.20
org.slf4j : slf4j-log4j12	jar	1.7.20
com.github.chen0040 : java-data-text	jar	1.0.3
com.github.chen0040 : java-data-frame	jar	1.0.2

provided (1)

Group / Artifact	Type	Version
org.projectlombok : lombok	jar	1.16.6

test (10)

Group / Artifact	Type	Version
org.testng : testng	jar	6.9.10
org.hamcrest : hamcrest-core	jar	1.3
org.hamcrest : hamcrest-library	jar	1.3
org.assertj : assertj-core	jar	3.5.2
org.powermock : powermock-core	jar	1.6.5
org.powermock : powermock-api-mockito	jar	1.6.5
org.powermock : powermock-module-junit4	jar	1.6.5
org.powermock : powermock-module-testng	jar	1.6.5
org.mockito : mockito-core	jar	2.0.2-beta
org.mockito : mockito-all	jar	2.0.2-beta

Project Modules

There are no modules declared in this project.

java-lda

Package provides java implementation of the latent dirichlet allocation (LDA) for topic modelling

Install

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-lda</artifactId>
  <version>1.0.4</version>
</dependency>

Usage

The sample code belows created a LDA which takes in the texts stored in "docs" list and created 20 different topics from these texts:

import com.github.chen0040.data.utils.TupleTwo;
import com.github.chen0040.lda.Lda;

List<String> docs = Arrays.asList("[paragraph1]", "[paragraph2]", ..., "[paragraphN]");

Lda method = new Lda();
method.setTopicCount(20);
method.setMaxVocabularySize(20000);
//method.setStemmerEnabled(true);
//method.setRemoveNumbers(true);
//method.setRemoveXmlTag(true);
//method.addStopWords(Arrays.asList("we", "they"));

LdaResult result = method.fit(docs);

System.out.println("Topic Count: "+result.topicCount());

for(int topicIndex = 0; topicIndex < topicCount; ++topicIndex){
 String topicSummary = result.topicSummary(topicIndex);
 List<TupleTwo<String, Integer>> topKeyWords = result.topKeyWords(topicIndex, 10);
 List<TupleTwo<Doc, Double>> topStrings = result.topDocuments(topicIndex, 5);

 System.out.println("Topic #" + (topicIndex+1) + ": " + topicSummary);

 for(TupleTwo<String, Integer> entry : topKeyWords){
    String keyword = entry._1();
    int score = entry._2();
    System.out.println("Keyword: " + keyword + "(" + score + ")");
 }

 for(TupleTwo<Doc, Double> entry : topStrings){
    double score = entry._2();
    int docIndex = entry._1().getDocIndex();
    String docContent = entry._1().getContent();
    System.out.println("Doc (" + docIndex + ", " + score + ")): " + docContent);
 }
}

The sample code belows takes the "result" variable from the above code and list the top 3 relevant topics of each document (which is one of the items in the "docs" list variable).

for(Doc doc : result.documents()){
 logger.info("Doc: {}", doc.getContent());
 List<TupleTwo<Integer, Double>> topTopics = doc.topTopics(3);

 logger.info("Top Topics: {} (score: {}), {} (score: {}), {} (score: {})",
         topTopics.get(0)._1(), topTopics.get(0)._2(),
         topTopics.get(1)._1(), topTopics.get(1)._2(),
         topTopics.get(2)._1(), topTopics.get(2)._2());
}

Versions

Version
1.0.4 May 18, 2017
1.0.3 May 16, 2017
1.0.2 May 15, 2017
1.0.1 May 15, 2017

LDA in Java

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management