DataFrame in Java

Some common patterns of data frame in Java

License

License

MIT
Categories

Categories

Java Languages Data
GroupId

GroupId

com.github.chen0040
ArtifactId

ArtifactId

java-data-frame
Last Version

Last Version

1.0.11
Release Date

Release Date

Type

Type

jar
Description

Description

DataFrame in Java
Some common patterns of data frame in Java
Project URL

Project URL

https://github.com/chen0040/java-data-frame
Source Code Management

Source Code Management

https://github.com/chen0040/java-data-frame

Download java-data-frame

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.chen0040/java-data-frame/ -->
<dependency>
    <groupId>com.github.chen0040</groupId>
    <artifactId>java-data-frame</artifactId>
    <version>1.0.11</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.chen0040/java-data-frame/
implementation 'com.github.chen0040:java-data-frame:1.0.11'
// https://jarcasting.com/artifacts/com.github.chen0040/java-data-frame/
implementation ("com.github.chen0040:java-data-frame:1.0.11")
'com.github.chen0040:java-data-frame:jar:1.0.11'
<dependency org="com.github.chen0040" name="java-data-frame" rev="1.0.11">
  <artifact name="java-data-frame" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.chen0040', module='java-data-frame', version='1.0.11')
)
libraryDependencies += "com.github.chen0040" % "java-data-frame" % "1.0.11"
[com.github.chen0040/java-data-frame "1.0.11"]

Dependencies

compile (2)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.20
org.slf4j : slf4j-log4j12 jar 1.7.20

provided (1)

Group / Artifact Type Version
org.projectlombok : lombok jar 1.16.6

test (10)

Group / Artifact Type Version
org.testng : testng jar 6.9.10
org.hamcrest : hamcrest-core jar 1.3
org.hamcrest : hamcrest-library jar 1.3
org.assertj : assertj-core jar 3.5.2
org.powermock : powermock-core jar 1.6.5
org.powermock : powermock-api-mockito jar 1.6.5
org.powermock : powermock-module-junit4 jar 1.6.5
org.powermock : powermock-module-testng jar 1.6.5
org.mockito : mockito-core jar 2.0.2-beta
org.mockito : mockito-all jar 2.0.2-beta

Project Modules

There are no modules declared in this project.

java-data-frame

Package provides the core data frame implementation for numerical computation

Build Status Coverage Status

Features

  • Load data frame from CSV file
  • Load libsvm format files
  • Create data frame using data sampling

In the future more option will be added for the supported format

Install

Add the following dependency to your POM file:

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-data-frame</artifactId>
  <version>1.0.11</version>
</dependency>

Usage

Crate a data frame manually

The sample code below shows how to create a data frame manually:

DataFrame dataFrame = new BasicDataFrame();

DataRow row = dataFrame.newRow();
row.setCell("inputColumn1", 0.1);
row.setCategoricalCell("inputColumn2", "Hello");
row.setTargetCell("numericOutput", 0.2);
row.setCategoricalTargetCell("categoricalOutput", "YES");

dataFrame.addRow(row);

// add more rows here

// call lock to perform aggregation and prevent further addition of new rows
dataFrame.lock();

Note that you need to call "dataFrame.lock()" after you finish adding rows so that aggregation can be performed. After this api call, the data frame will prevent further addition of new rows. To start adding new rows again, call "dataFrame.unlock()" before adding more rows.

Create a data frame using Sampler

The sample code belows shows how to create a data frame using Sampler class:

DataQuery.DataFrameQueryBuilder schema = DataQuery.blank()
      .newInput("x1")
      .newInput("x2")
      .newOutput("y")
      .end();

// y = 4 + 0.5 * x1 + 0.2 * x2
Sampler.DataSampleBuilder sampler = new Sampler()
      .forColumn("x1").generate((name, index) -> randn() * 0.3 + index)
      .forColumn("x2").generate((name, index) -> randn() * 0.3 + index * index)
      .forColumn("y").generate((name, index) -> 4 + 0.5 * index + 0.2 * index * index + randn() * 0.3)
      .end();

DataFrame dataFrame = schema.build();

dataFrame = sampler.sample(dataFrame, 200);

The sample code above creates a data frame consisting of 200 rows and 3 columns ("x1", "x2", "y")

 

Print contents in a data frame

The sample code below shows how to print the content in the data frame:

System.out.pritnln(dataFrame.head(2));

dataFrame.stream().forEach(r -> System.out.println("row: " + r));
for(DataRow r : irisData) {
 System.out.println("row: "+ r);
}

Filtering

The sample code below create a new data frame from the old data frame using the filter predicate

DataFrame filtered = oldDataFrame.filter(row -> { ... });

Clone

The sample code below create a new data frame from the old data frame

DataFrame clone = oldDataFrame.makeCopy()

Sample and split

The shuffle the content of a data frame:

dataFrame.shuffle()

To split a data frame into two data frames:

TupleTwo<DataFrame, DataFrame> miniFrames = dataFrame.split(0.9);
DataFrame frame1 = miniFrames._1();
DataFrame frame2 = miniFrames._2();

The frame1 contains 90% of the rows in the original data frame, while frame2 contains the other 10% of the rows in the original data frame.

Convert numerical columns to categorical columns

For some algorithms which needs to treat numerical columns as categorical column, the library provides the KMeanDiscretizer to do this conversion:

The following line transforms a data frame which has a number numerical columns to a data frame which contains the categorical columns with numerical columns convert to categorical columns:

KMeansDiscretizer discretizer =new KMeansDiscretizer();
discretizer.setMaxLevelCount(12); // set number of discrete values for each numerical column
// discretizer.setMaxIters(500); // specifies the number of iterations to run k-means

DataFrame newFrame = discretizer.fitAndTransform(dataFrame);

The sample code belows is a complete code to illustrate this operation:

InputStream inputStream = FileUtils.getResource("carmileage.dat");

DataQuery.DataFrameQueryBuilder schema = DataQuery.csv().from(inputStream)
      .skipRows(29)
      .selectColumn(0).asCategory().asInput("MAKE/MODEL")
      .selectColumn(1).asNumeric().asInput("VOL")
      .selectColumn(2).asNumeric().asInput("HP")
      .selectColumn(3).asNumeric().asOutput("MPG")
      .selectColumn(4).asNumeric().asInput("SP")
      .selectColumn(5).asNumeric().asInput("WT");

DataFrame dataFrame = schema.build();
System.out.println(dataFrame.head(10));
System.out.println("categorical column count: " + dataFrame.getAllColumns().stream().filter(DataColumn::isCategorical).count());
System.out.println("numerical column count: " + dataFrame.getAllColumns().stream().filter(DataColumn::isNumerical).count());

KMeansDiscretizer discretizer =new KMeansDiscretizer();
discretizer.setMaxLevelCount(12); // set number of discrete values for each numerical column

DataFrame newFrame = discretizer.fitAndTransform(dataFrame);

System.out.println(newFrame.head(10));
System.out.println("categorical column count: " + newFrame.getAllColumns().stream().filter(DataColumn::isCategorical).count());
System.out.println("numerical column count: " + newFrame.getAllColumns().stream().filter(DataColumn::isNumerical).count());

Load from CSV file

Suppose you have a csv file named contraception.csv that has the following file format:

"","woman","district","use","livch","age","urban"
"1","1","1","N","3+",18.44,"Y"
"2","2","1","N","0",-5.5599,"Y"
"3","3","1","N","2",1.44,"Y"
"4","4","1","N","3+",8.44,"Y"
"5","5","1","N","0",-13.559,"Y"
"6","6","1","N","0",-11.56,"Y"

An example of java code to create a data frame from the above CSV file:

import com.github.chen0040.data.frame.DataFrame;
import com.github.chen0040.data.frame.DataQuery;
import com.github.chen0040.data.utils.StringUtils;

int column_use = 3;
int column_livch = 4;
int column_age = 5;
int column_urban = 6;
boolean skipFirstLine = true;
String columnSplitter = ",";
InputStream inputStream = new FileInputStream("contraception.csv");
DataFrame frame = DataQuery.csv(columnSplitter, skipFirstLine)
        .from(inputStream)
        .selectColumn(column_livch).asCategory().asInput("livch")
        .selectColumn(column_age).asNumeric().asInput("age")
        .selectColumn(column_age).transform(cell -> Math.pow(StringUtils.parseDouble(cell), 2)).asInput("age^2")
        .selectColumn(column_urban).asCategory().asInput("urban")
        .selectColumn(column_use).transform(cell -> cell.equals("Y") ? 1.0 : 0.0).asOutput("use")
        .build();

The code above create a data frame which has the following columns

  • livch1 (input): value = 1 if the "livch" column of the CSV contains value 1 ; 0 otherwise
  • livch2 (input): value = 1 if the "livch" column of the CSV contains value 2 ; 0 otherwise
  • livch3 (input): value = 1 if the "livch" column of the CSV contains value 3+ ; 0 otherwise
  • age (input): value = numeric value in the "age" column of the CSV
  • age^2 (input): value = square of numeric value in the "age" column of the CSV
  • urban (input): value = 1 if the "urban" column of the CSV has value "Y" ; 0 otherwise
  • use (output): value = 1 if the "use" column of the CSV has value "Y" ; 0 otherwise

In the above case, the output of the data frame is numerical, the code sample below shows how a data frame can be loaded for which the output is categorical:

InputStream irisStream = new FileInputStream("iris.data");
DataFrame irisData = DataQuery.csv(",", false)
      .from(irisStream)
      .selectColumn(0).asNumeric().asInput("Sepal Length")
      .selectColumn(1).asNumeric().asInput("Sepal Width")
      .selectColumn(2).asNumeric().asInput("Petal Length")
      .selectColumn(3).asNumeric().asInput("Petal Width")
      .selectColumn(4).asCategory().asOutput("Iris Type")
      .build();

Load libsvm formatted file

The sample code belows shows how a data frame can be created from "heart-scale.txt" which is in libsvm format:

DataFrame frame = DataQuery.libsvm().from(new FileInputStream("heart_scale.txt")).build();

Versions

Version
1.0.11
1.0.10
1.0.9
1.0.8
1.0.7
1.0.6
1.0.5
1.0.4
1.0.3
1.0.2
1.0.1