Data Analytics and Statistical Inference

Statistical inference package with fluent API

License	License MIT
Categories	Categories Java Languages Infer Application Testing & Monitoring Code Analysis
GroupId	GroupId com.github.chen0040
ArtifactId	ArtifactId java-statistical-inference
Last Version	Last Version 1.0.4
Release Date	Release Date May 25, 2017
Type	Type jar
Description	Description Data Analytics and Statistical Inference Statistical inference package with fluent API
Project URL	Project URL https://github.com/chen0040/java-statistical-inference
Source Code Management	Source Code Management https://github.com/chen0040/java-statistical-inference

Download java-statistical-inference

Filename	Size
java-statistical-inference-1.0.4.pom
java-statistical-inference-1.0.4.jar	72 KB
java-statistical-inference-1.0.4-sources.jar	40 KB
java-statistical-inference-1.0.4-javadoc.jar	264 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.chen0040/java-statistical-inference/ -->
<dependency>
    <groupId>com.github.chen0040</groupId>
    <artifactId>java-statistical-inference</artifactId>
    <version>1.0.4</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.chen0040/java-statistical-inference/
implementation 'com.github.chen0040:java-statistical-inference:1.0.4'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.chen0040/java-statistical-inference/
implementation ("com.github.chen0040:java-statistical-inference:1.0.4")

Apache Buildr

'com.github.chen0040:java-statistical-inference:jar:1.0.4'

Apache Ivy

<dependency org="com.github.chen0040" name="java-statistical-inference" rev="1.0.4">
  <artifact name="java-statistical-inference" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.chen0040', module='java-statistical-inference', version='1.0.4')
)

Scala SBT

libraryDependencies += "com.github.chen0040" % "java-statistical-inference" % "1.0.4"

Leiningen

[com.github.chen0040/java-statistical-inference "1.0.4"]

Dependencies

compile (4)

Group / Artifact	Type	Version
org.slf4j : slf4j-api	jar	1.7.20
org.slf4j : slf4j-log4j12	jar	1.7.20
org.apache.commons : commons-math3	jar	3.2
com.github.chen0040 : java-data-frame	jar	1.0.9

provided (1)

Group / Artifact	Type	Version
org.projectlombok : lombok	jar	1.16.6

test (10)

Group / Artifact	Type	Version
org.testng : testng	jar	6.9.10
org.hamcrest : hamcrest-core	jar	1.3
org.hamcrest : hamcrest-library	jar	1.3
org.assertj : assertj-core	jar	3.5.2
org.powermock : powermock-core	jar	1.6.5
org.powermock : powermock-api-mockito	jar	1.6.5
org.powermock : powermock-module-junit4	jar	1.6.5
org.powermock : powermock-module-testng	jar	1.6.5
org.mockito : mockito-core	jar	2.0.2-beta
org.mockito : mockito-all	jar	2.0.2-beta

Project Modules

There are no modules declared in this project.

java-statistical-inference

This package is a java implementation of an opinionated statistical inference engine with fluent api to make it easier for conducting statistical inference with little or no knowledge of statistical inference principles involved

Features

Confidence Interval for numerical variable and proportions (one group or two groups)
Hypothesis Testing for Single Numerical Variable
Hypothesis Testing for Single Categorical Variable (Proportion)
Hypothesis Testing for Two Group Numerical Variable
Hypothesis Testing for Two Group Categorical Variable (Proportion)
ANOVA: Independence Test between a Numerical Variable and a Categorical Variable
Chi-Square Test: Independence Test between a Categorical Variable and another Categorical Variable
ANOVA for Regression: Independence Test between a Numerical Variable and another Numerical Variable
Automatic change of sampling distribution based on sample size:
- Normal distribution for large sample on categorical variable (one or two groups)
- Bootstrap simulation for small sample on categorical variable (one or two groups)
- Normal distribution for large sample on numerical variable (one or two groups)
- Student-T distribution for small sample on numerical variable (one or two groups)
Central Limit Theorem Conditions Check

Install

Add the following dependency into your POM file:

<dependency>
  <groupId>com.github.chen0040</groupId>
  <artifactId>java-statistical-inference</artifactId>
  <version>1.0.4</version>
</dependency>

Usage

Single Numerical Variable

The code below shows how to declare a single numerical variable kie (knowledge inference engine):

Variable variable = new Variable("Amount");
NumericalSampleKie kie = variable.numericalSample();

The code below shows how to load observed data about the variable "Amount" into the kie:

kie.addObservations(new double[] { 0.2, 0.4, 0.6, 0.12, 0.9, 0.13, -0.12, -0.55, 0.5});

Alternatively the observed data can be loaded from a data frame (please refer to here for more example on how to create a data frame)

DataFrame dataFrame = DataQuery.csv().from(new FileInputStream("amount.csv"))
              .selectColumn(0).asNumeric().asInput("Amount").build();
kie.addObservations(dataFrame);

The code below shows the various statistics that can be obtained from the kie about the variable "Amount":

Mean mean = kie.mean();
double confidenceLevel = 0.95;
ConfidenceInterval confidenceInterval = mean.confidenceInterval(confidenceLevel);

System.out.println("sample.mean: " + kie.getSampleMean());
System.out.println("sample.sd: " + kie.getSampleSd());
System.out.println("sample.size: " + kie.getSampleSize());
System.out.println("sample.median: " + kie.getSampleMedian());
System.out.println("sample.max: " + kie.getSampleMax());
System.out.println("sample.min: " + kie.getSampleMin());
System.out.println("sample.1st.quartile: " + kie.getSampleFirstQuartile());
System.out.println("sample.3rd.quartile: " + kie.getSampleThirdQuartile());

System.out.println("sampling distribution: " + kie.getSamplingDistribution());

System.out.println("confidence interval for Amount: " + confidenceInterval);

The kie also provides user friendly statement for the confidence interval:

System.out.println(kie.mean().confidenceInterval(0.95).getSummary());

The code belows shows how to test the null hypothesis that "The population mean of Amount is 0.5", with significance level of 0.05:

double expected_mean = 0.5;
TestingOnValue test = kie.test4MeanEqualTo(expected_mean);

System.out.println("sampling distribution: " + test.getDistributionFamily());
System.out.println("test statistic: " + test.getTestStatistic());
System.out.println("p-value (one-tail): " + test.getPValueOneTail());
System.out.println("p-value (two-tails): " + test.getPValueTwoTails());

The kie also provides user friendly statement for the null hypothesis test:

TestingOnValue test = kie.test4MeanEqualTo(0.5);
System.out.println(test.getSummary());

Single Categorical Variable

The code below shows how to declare a single categorical variable kie (knowledge inference engine):

Variable variable = new Variable("Type");
NumericalSampleKie kie = variable.categoricalSample();

The code below shows how to load observed data about the variable "Type" into the kie:

kie.addObservations(new String[] { "Asset", "Liability", "Equity", "Revenue", "Expense", "Liability", "Equity", "Revenue", "Asset", "Liability", "Equity" });

Alternatively the observed data can be loaded from a data frame

InputStream inputStream = new FileInputStream("iris.data");
DataFrame dataFrame = DataQuery.csv(",").from(inputStream)
      .selectColumn(4).asCategory().asInput("Type").build();
kie.addObservations(dataFrame);

The code below shows the various statistics that can be obtained from the kie about the variable "Amount":

Proportion proportion = kie.proportion("Liability");
double confidenceLevel = 0.95;
ConfidenceInterval confidenceInterval = proportion.confidenceInterval(confidenceLevel);

System.out.println("sample.mean: " + kie.getSampleMean("Liability"));
System.out.println("sample.proportion: " + kie.getSampleProportion("Liability"));
System.out.println("sample.sd: " + kie.getSampleSd("Liability"));
System.out.println("sample.size: " + kie.getSampleSize());

System.out.println("sampling distribution: " + kie.getSamplingDistribution());

System.out.println("confidence interval for Type == Liability: " + confidenceInterval);

The kie also provides user friendly statement for the confidence interval:

System.out.println(kie.proportion("Liability").confidenceInterval(0.95).getSummary());

The code belows shows how to test the null hypothesis that "The population proportion of Type==Liability is 0.5", with significance level of 0.05:

double expected_proportion = 0.5;
TestingOnValue test = kie.test4MeanEqualTo(expected_proportion);

System.out.println("sampling distribution: " + test.getDistributionFamily());
System.out.println("test statistic: " + test.getTestStatistic());
System.out.println("p-value (one-tail): " + test.getPValueOneTail());
System.out.println("p-value (two-tails): " + test.getPValueTwoTails());

The kie also provides user friendly statement for the null hypothesis test:

TestingOnValue test = kie.test4ProportionEqualTo(0.5);
System.out.println(test.getSummary());

Paired Sample for a Numerical Variable

The sample code below shows how to run statistical inference on the sample from a paired observations (e.g. before, after) for a numerical variable:

Variable variable1 = new Variable("Begin");
Variable variable2 = new Variable("End");

InputStream inputStream = new FileInputStream("calcium-paired.dat");
DataFrame dataFrame = DataQuery.csv().from(inputStream)
      .selectColumn(1).asNumeric().asInput("Begin")
      .selectColumn(2).asNumeric().asInput("End")
      .build();

PairedSampleKie kie = variable2.pair(variable1).numericalSample();
kie.addObservations(dataFrame);

Mean mean = kie.difference();


ConfidenceInterval confidenceInterval = mean.confidenceInterval(0.95);
TestingOnValue test = kie.testDifferenceEqualTo(0.5);

System.out.println("sample.difference-mean: " + kie.getSampleDifferenceMean());
System.out.println("sample.difference-sd: " + kie.getSampleDifferenceSd());
System.out.println("sample.size: " + kie.getSampleSize());
System.out.println("sample.median: " + kie.getSampleMedian());
System.out.println("sample.max: " + kie.getSampleMax());
System.out.println("sample.min: " + kie.getSampleMin());
System.out.println("sample.1st.quartile: " + kie.getSampleFirstQuartile());
System.out.println("sample.3rd.quartile: " + kie.getSampleThirdQuartile());

System.out.println("sampling distribution (difference): " + kie.getSamplingDistribution());

System.out.println("95% confidence interval: " + confidenceInterval);

System.out.println("========================================================");

System.out.println(confidenceInterval.getSummary());
System.out.println(test.getSummary());

In the above codes, the "calcium-paired.dat" contains results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks. All subjects had their blood pressure tested before and after the 12-week period.

Compare Two Groups for a Numerical Variable

The sample below shows the statistical inference on samples from two different groups (e.g., from two different experiment setup) for a numerical variable:

Variable variable = new Variable("Decrease");
TwoGroupNumericalSampleKie kie = variable.twoGroupNumericalSample(new Variable("Treatment"), "Calcium", "Placebo");

InputStream inputStream = new FileInputStream("calcium.dat");
DataFrame dataFrame = DataQuery.csv().from(inputStream)
      .skipRows(33)
      .selectColumn(0).asCategory().asInput("Treatment")
      .selectColumn(3).asNumeric().asInput("Decrease")
      .build();

kie.addObservations(dataFrame);

MeanDifference difference = kie.difference();
ConfidenceInterval confidenceInterval = difference.confidenceInterval(0.95);

TestingOnValueDifference test = kie.test4GroupDifference();

System.out.println("sample1.mean: " + kie.getGroup1SampleMean());
System.out.println("sample1.sd: " + kie.getGroup1SampleSd());
System.out.println("sample1.size: " + kie.getGroup1SampleSize());
System.out.println("sample1.median: " + kie.getGroup1SampleMedian());
System.out.println("sample1.max: " + kie.getGroup1SampleMax());
System.out.println("sample1.min: " + kie.getGroup1SampleMin());
System.out.println("sample1.1st.quartile: " + kie.getGroup1SampleFirstQuartile());
System.out.println("sample1.3rd.quartile: " + kie.getGroup1SampleThirdQuartile());

System.out.println("sample2.mean: " + kie.getGroup2SampleMean());
System.out.println("sample2.sd: " + kie.getGroup2SampleSd());
System.out.println("sample2.size: " + kie.getGroup2SampleSize());
System.out.println("sample2.median: " + kie.getGroup2SampleMedian());
System.out.println("sample2.max: " + kie.getGroup2SampleMax());
System.out.println("sample2.min: " + kie.getGroup2SampleMin());
System.out.println("sample2.2st.quartile: " + kie.getGroup2SampleFirstQuartile());
System.out.println("sample2.3rd.quartile: " + kie.getGroup2SampleThirdQuartile());

System.out.println("sampling distribution: " + kie.getSamplingDistribution());

System.out.println("95% confidence interval: " + confidenceInterval);

System.out.println("========================================================");

System.out.println(confidenceInterval.getSummary());
System.out.println(test.getSummary());

In the above codes, the "calcium.dat" contains results of a randomized comparative experiment to investigate the effect of calcium on blood pressure in African-American men. A treatment group of 10 men received a calcium supplement for 12 weeks, and a control group of 11 men received a placebo during the same period. All subjects had their blood pressure tested before and after the 12-week period.

The "kie.test4GroupDifference()" can be used to test whether the numerical variable is independent of another categorical variable which has two levels (i.e. the "group" variable)

Compare Two Groups for a Categorical Variable

The sample below shows the statistical inference on samples from two different groups (e.g., from two different experiment setup) for a categorical variable:

Variable variable_use = new Variable("UseContraceptive");
Variable variable_urban = new Variable("IsUrban");

InputStream inputStream = new FileInputStream("contraception.csv");
DataFrame dataFrame = DataQuery.csv(",")
      .from(inputStream)
      .selectColumn(3).asCategory().asInput("UseContraceptive")
      .selectColumn(6).asCategory().asInput("IsUrban")
      .build();

TwoGroupCategoricalSampleKie kie = variable_use.twoGroupCategoricalSampleKie(variable_urban, "Y", "N");

kie.addObservations(dataFrame);

ProportionDifference difference = kie.proportionDifference("Y");
ConfidenceInterval confidenceInterval = difference.confidenceInterval(0.95);

TestingOnProportionDifference test = kie.test4GroupDifference("Y");

System.out.println("sample1.mean: " + kie.getGroup1SampleMean("Y"));
System.out.println("sample1.proportion: " + kie.getGroup1SampleProportion("Y"));
System.out.println("sample1.sd: " + kie.getGroup1SampleSd("Y"));
System.out.println("sample1.size: " + kie.getGroup1SampleSize());

System.out.println("sample2.mean: " + kie.getGroup2SampleMean("Y"));
System.out.println("sample2.proportion: " + kie.getGroup2SampleProportion("Y"));
System.out.println("sample2.sd: " + kie.getGroup2SampleSd("Y"));
System.out.println("sample2.size: " + kie.getGroup2SampleSize());

System.out.println("sampling distribution: " + kie.getSamplingDistribution("Y"));

System.out.println("95% confidence interval: " + confidenceInterval);

System.out.println("========================================================");

System.out.println(confidenceInterval.getSummary());
System.out.println(test.getSummary());

In the above codes, the "contraception.csv" contains results of whether a person is from urban area and whether he/she uses contraception.

The "kie.test4GroupDifference('Y')" can be used to test whether the categorical variable is independent of another categorical variable which has two levels (i.e. the "group" variable)

ANOVA: Independence Test for a Numerical variable and a Categorical Variable

The sample code belows show to test for the independence between a categorical variable (explanatory variable) a numerical variable (response variable):

Variable variable1 = new Variable("Age");
Variable variable2 = new Variable("LiveChannel");

CategoricalToNumericalSampleKie kie = variable1.multipleGroupNumericalSample(variable2);

InputStream inputStream = FileUtils.getResource("contraception.csv");
DataFrame dataFrame = DataQuery.csv(",")
      .from(inputStream)
      .skipRows(1)
      .selectColumn(5).asNumeric().asInput("Age")
      .selectColumn(4).asCategory().asInput("LiveChannel")
      .build();

kie.addObservations(dataFrame);

Anova test = kie.test4Independence();

System.out.println(test.getSummary());

In the above codes, the "contraception.csv" contains results of which channel the person watch (categorical) and what is his/her age (numeric).

Chi-Square: Independence Test for two Categorical Variables

The sample code belows show to test for the independence between two categorical variables

Variable variable1 = new Variable("UseContraceptive");
Variable variable2 = new Variable("LiveChannel");

CategoricalToCategoricalSampleKie kie = variable1.multipleGroupCategoricalSample(variable2);

InputStream inputStream = new FileInputStream("contraception.csv");
DataFrame dataFrame = DataQuery.csv(",")
      .from(inputStream)
      .skipRows(1)
      .selectColumn(3).transform(text -> text.equals("Y") ? "Use" : "DontUse").asInput("UseContraceptive")
      .selectColumn(4).asCategory().asInput("LiveChannel")
      .build();

kie.addObservations(dataFrame);

ChiSquareTest test = kie.test4Independence();

ContingencyTable contingencyTable = kie.getOrCreateContingencyTable();

System.out.println(contingencyTable.getSummary());

System.out.println(test.getSummary());

In the above codes, the "contraception.csv" contains results of whether a person watch which live channel (categorical variable) and whether he/she uses contraception (another categorical variable).

Linear dependency between two numerical variables X and Y

The sample code between shows how to analyze the linear dependency between two numerical variable X and Y:

final Random random = new Random(System.currentTimeMillis());

// regression: y is strongly correlated to x by y = 25 + 5 * x
Sampler.DataSampleBuilder builder = new Sampler().forColumn("x").generate((name, index) -> (double)index)
      .forColumn("y").generate((name, index) -> 25 + (index + random.nextDouble()) * 5 + random.nextDouble())
       .end();

DataFrame dataFrame = DataQuery.blank()
      .newInput("x")
      .newOutput("y")
      .end().build();
dataFrame = builder.sample(dataFrame, 100);

System.out.println(dataFrame.head(10));

Variable x = new Variable("x");
XYSampleKie kie = x.regression(new Variable("y"));

kie.addObservations(dataFrame);

SampleLinearRegression model = kie.model();

System.out.println("correlation between x and y: " + model.getCorrelation());
System.out.println("y-intercept: " + model.getIntercept());
System.out.println("slope: " + model.getSlope());
System.out.println("R^2: " + model.getR2()); // explained variability
System.out.println("SD(X): " + model.getSX());
System.out.println("SD(Y): " + model.getSY());
System.out.println("Mean(X): " + model.getXBar());
System.out.println("Mean(Y): " + model.getYBar());

Anova4Regression anova = kie.test4Independence();

System.out.println(anova.getSummary());

Versions

Version
1.0.4 May 25, 2017
1.0.1 May 16, 2017

Data Analytics and Statistical Inference

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management