Decision Tree and Ensemble Methods

Decision Tree and Ensemble methods implemented in Java

License

License

MIT
Categories

Categories

Java Languages
GroupId

GroupId

com.github.chen0040
ArtifactId

ArtifactId

java-decision-forest
Last Version

Last Version

1.0.3
Release Date

Release Date

Type

Type

jar
Description

Description

Decision Tree and Ensemble Methods
Decision Tree and Ensemble methods implemented in Java
Project URL

Project URL

https://github.com/chen0040/java-decision-forest
Source Code Management

Source Code Management

https://github.com/chen0040/java-decision-forest

Download java-decision-forest

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.chen0040/java-decision-forest/ -->
<dependency>
    <groupId>com.github.chen0040</groupId>
    <artifactId>java-decision-forest</artifactId>
    <version>1.0.3</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.chen0040/java-decision-forest/
implementation 'com.github.chen0040:java-decision-forest:1.0.3'
// https://jarcasting.com/artifacts/com.github.chen0040/java-decision-forest/
implementation ("com.github.chen0040:java-decision-forest:1.0.3")
'com.github.chen0040:java-decision-forest:jar:1.0.3'
<dependency org="com.github.chen0040" name="java-decision-forest" rev="1.0.3">
  <artifact name="java-decision-forest" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.chen0040', module='java-decision-forest', version='1.0.3')
)
libraryDependencies += "com.github.chen0040" % "java-decision-forest" % "1.0.3"
[com.github.chen0040/java-decision-forest "1.0.3"]

Dependencies

compile (3)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.20
org.slf4j : slf4j-log4j12 jar 1.7.20
com.github.chen0040 : java-data-frame jar 1.0.9

provided (1)

Group / Artifact Type Version
org.projectlombok : lombok jar 1.16.6

test (10)

Group / Artifact Type Version
org.testng : testng jar 6.9.10
org.hamcrest : hamcrest-core jar 1.3
org.hamcrest : hamcrest-library jar 1.3
org.assertj : assertj-core jar 3.5.2
org.powermock : powermock-core jar 1.6.5
org.powermock : powermock-api-mockito jar 1.6.5
org.powermock : powermock-module-junit4 jar 1.6.5
org.powermock : powermock-module-testng jar 1.6.5
org.mockito : mockito-core jar 2.0.2-beta
org.mockito : mockito-all jar 2.0.2-beta

Project Modules

There are no modules declared in this project.

java-decision-forest

Package implements decision tree and ensemble methods

Build Status Coverage Status

Features

  • ID3 Decision Tree with both numerical and categorical inputs
  • Isolation Forest for Anomaly Detection
  • Tree Ensembles such as Bagging and Adaboost

Install

Add the following dependency to your POM file:

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-decision-forest</artifactId>
  <version>1.0.3</version>
</dependency>

Usage

Classification

To create and train a ID3 classifier:

ID3 classifier = new ID3();
clasifier.fit(trainingData);

The "trainingData" is a data frame which holds data rows with labeled output (Please refers to this link to find out how to store data into a data frame)

To predict using the trained ARTMAP classifier:

String predicted_label = classifier.transform(dataRow);

The detail on how to use this can be found in the unit testing codes. Below is a complete sample codes of classifying on the libsvm-formatted heart-scale data:

InputStream inputStream = new FileInputStream("heart_scale");
DataFrame dataFrame = DataQuery.libsvm().from(inputStream).build();

// as the dataFrame obtained thus far has numeric output instead of labeled categorical output, the code below performs the categorical output conversion
dataFrame.unlock();
for(int i=0; i < dataFrame.rowCount(); ++i){
 DataRow row = dataFrame.row(i);
 row.setCategoricalTargetCell("category-label", "" + row.target());
}
dataFrame.lock();

ID3 classifier = new ID3();
classifier.fit(dataFrame);

for(int i = 0; i < dataFrame.rowCount(); ++i){
  DataRow tuple = dataFrame.row(i);
  String predicted_label = classifier.transform(tuple);
  System.out.println("predicted: "+predicted_label+"\tactual: "+tuple.categoricalTarget());
}

Classification via Ensemble (Bagging)

To create and train a Bagging ensemble classifier:

Bagging classifier = new Bagging();
clasifier.fit(trainingData);

The "trainingData" is a data frame which holds data rows with labeled output (Please refers to this link to find out how to store data into a data frame)

To predict using the trained ARTMAP classifier:

String predicted_label = classifier.transform(dataRow);

The detail on how to use this can be found in the unit testing codes. Below is a complete sample codes of classifying on the libsvm-formatted heart-scale data:

InputStream inputStream = new FileInputStream("heart_scale");
DataFrame dataFrame = DataQuery.libsvm().from(inputStream).build();

// as the dataFrame obtained thus far has numeric output instead of labeled categorical output, the code below performs the categorical output conversion
dataFrame.unlock();
for(int i=0; i < dataFrame.rowCount(); ++i){
 DataRow row = dataFrame.row(i);
 row.setCategoricalTargetCell("category-label", "" + row.target());
}
dataFrame.lock();

Bagging classifier = new Bagging();
classifier.fit(dataFrame);

for(int i = 0; i < dataFrame.rowCount(); ++i){
  DataRow tuple = dataFrame.row(i);
  String predicted_label = classifier.transform(tuple);
  System.out.println("predicted: "+predicted_label+"\tactual: "+tuple.categoricalTarget());
}

Classification via Ensemble (AdaBoost)

InputStream irisStream = new FileInputStream("iris.data");
DataFrame irisData = DataQuery.csv(",")
      .from(irisStream)
      .selectColumn(0).asNumeric().asInput("Sepal Length")
      .selectColumn(1).asNumeric().asInput("Sepal Width")
      .selectColumn(2).asNumeric().asInput("Petal Length")
      .selectColumn(3).asNumeric().asInput("Petal Width")
      .selectColumn(4).asCategory().asOutput("Iris Type")
      .build();

TupleTwo<DataFrame, DataFrame> parts = irisData.shuffle().split(0.9);

DataFrame trainingData = parts._1();
DataFrame crossValidationData = parts._2();

System.out.println(crossValidationData.head(10));

MultiClassAdaBoost multiClassClassifier = new MultiClassAdaBoost();
multiClassClassifier.fit(trainingData);

ClassifierEvaluator evaluator = new ClassifierEvaluator();

for(int i=0; i < crossValidationData.rowCount(); ++i) {
 String predicted = multiClassClassifier.classify(crossValidationData.row(i));
 String actual = crossValidationData.row(i).categoricalTarget();
 System.out.println("predicted: " + predicted + "\tactual: " + actual);
 evaluator.evaluate(actual, predicted);
}

evaluator.report();

Classification via Ensemble (SAMME)

InputStream irisStream = new FileInputStream("iris.data");
DataFrame irisData = DataQuery.csv(",")
      .from(irisStream)
      .selectColumn(0).asNumeric().asInput("Sepal Length")
      .selectColumn(1).asNumeric().asInput("Sepal Width")
      .selectColumn(2).asNumeric().asInput("Petal Length")
      .selectColumn(3).asNumeric().asInput("Petal Width")
      .selectColumn(4).asCategory().asOutput("Iris Type")
      .build();

TupleTwo<DataFrame, DataFrame> parts = irisData.shuffle().split(0.9);

DataFrame trainingData = parts._1();
DataFrame crossValidationData = parts._2();

System.out.println(crossValidationData.head(10));

SAMME multiClassClassifier = new SAMME();
multiClassClassifier.fit(trainingData);

ClassifierEvaluator evaluator = new ClassifierEvaluator();

for(int i=0; i < crossValidationData.rowCount(); ++i) {
 String predicted = multiClassClassifier.classify(crossValidationData.row(i));
 String actual = crossValidationData.row(i).categoricalTarget();
 System.out.println("predicted: " + predicted + "\tactual: " + actual);
 evaluator.evaluate(actual, predicted);
}

evaluator.report();

To create and train a Bagging ensemble classifier:

Anomaly Detection

The problem that we will be using as demo is the following anomaly detection problem:

scki-learn example for one-class

Below is the sample code which illustrates how to use Isolation Forest to detect outliers in the above problem:

DataQuery.DataFrameQueryBuilder schema = DataQuery.blank()
      .newInput("c1")
      .newInput("c2")
      .newOutput("anomaly")
      .end();

Sampler.DataSampleBuilder negativeSampler = new Sampler()
      .forColumn("c1").generate((name, index) -> randn() * 0.3 + (index % 2 == 0 ? -2 : 2))
      .forColumn("c2").generate((name, index) -> randn() * 0.3 + (index % 2 == 0 ? -2 : 2))
      .forColumn("anomaly").generate((name, index) -> 0.0)
      .end();

Sampler.DataSampleBuilder positiveSampler = new Sampler()
      .forColumn("c1").generate((name, index) -> rand(-4, 4))
      .forColumn("c2").generate((name, index) -> rand(-4, 4))
      .forColumn("anomaly").generate((name, index) -> 1.0)
      .end();

DataFrame data = schema.build();

data = negativeSampler.sample(data, 20);
data = positiveSampler.sample(data, 20);

System.out.println(data.head(10));

IsolationForest method = new IsolationForest();
method.setThreshold(0.38);
DataFrame learnedData = method.fitAndTransform(data);

BinaryClassifierEvaluator evaluator = new BinaryClassifierEvaluator();

for(int i = 0; i < learnedData.rowCount(); ++i){
 boolean predicted = learnedData.row(i).categoricalTarget().equals("1");
 boolean actual = data.row(i).target() == 1.0;
 evaluator.evaluate(actual, predicted);
 logger.info("predicted: {}\texpected: {}", predicted, actual);
}

logger.info("summary: {}", evaluator.getSummary());

Versions

Version
1.0.3
1.0.2
1.0.1