discretizer4j
This project provides a Java implementation of several discretization algorithms (aka binning).
This is often a useful step in order to cope with overfitting in machine learning models or overly specific explanations from XAI algorithms such as Anchors, when working with numerical data.
We concentrate on univariate algorithms, both supervised and unsupervised, to keep things simple and away from decision tree algorithms. We chose the Java language to achieve a reasonable performance, to easily integrate with AnchorsJ (and because we did not find any other suitable open source java package).
Current implementations:
- Unsupervised:
- Equal Frequency in
PercentileMedianDiscretizer
- Equal Size in
EqualSizeDiscretizer
- Proportional k-Interval Discretizer in
EqualSizeDiscretizer
- Manual Discretization in
ManualDiscretizer
- Random Discretization in
RandomDiscretizer
- Equal Frequency in
- Supervised:
- FUSINTER Discretizer in
FUSINTERDiscretizer
- Minimum Description Length Principle Discretizer in
MDLPDiscretizer
- Ameva Discretizer in
AmevaDiscretizer
- FUSINTER Discretizer in
Getting Started
Prerequisites and Installation
In order to use the core project, no installation other than Java (version 8+) is are required. The intended way of using the algorithms is to use them as a maven depencency. Our maven coordinates are as follows:
<dependency>
<groupId>de.viadee</groupId>
<artifactId>discretizer4j</artifactId>
<version>1.0.0</version>
</dependency>
There are no transitive dependencies.
Using the Algorithm
To discretize a continuous feature, one has to create a Discretizer (extending the AbstractDiscretizer
). The Discretizer then has to be fitted. This may be built as follows:
Discretizer discretizer = new Discretizer();
discretizer.fit(values, labels);
The fitted discretizer can then be used to get all DiscretizerTransitions
, that have been fitted by the algorithm. Or values can be applied to the discretizer, the apply function returns the discretized labels.
discretizer.getTransitions();
// returns:
// DiscretizationTransition From ]1, 14.5) to class 0.0
// DiscretizationTransition From [14.5, 19.5) to class 1.0
// DiscretizationTransition From [19.5, 22.5) to class 2.0
// DiscretizationTransition From [22.5, 36.5) to class 3.0
// DiscretizationTransition From [36.5, 40[ to class 4.0
discretizer.apply(new Double[]{1.5, 17.0, 10.0})
// returns:
// Double[0.0, 1.0, 0.0]
The fitting creates DiscretizerTransitions
. These consist of a discretizedLabel (Double) and a discretizedOrigin. The Origin is either a unique value, if the UniqueValueDiscretizer
was used, or a combination of a minValue and maxValue, which determine the Interval limits of the Transition.
Tutorials and Examples
Small examples for all implemented discretizers can be found in the unit tests.
To see these discretizers in a more complex project, please refer to the XAI Examples. Here discretization was used in the context of explainable artificial intelligence.
Collaboration
The project is operated and further developed by the viadee Consulting AG in Münster, Westphalia. Results from theses at the WWU Münster and the FH Münster have been incorporated. Contact person is Dr. Frank Köhne from viadee.
- Implementation of additional Discretizers ar planned.
- Community contributions to the project are welcome: Please open Github-Issues with suggestions (or PR), which we can then edit in the team.
Authors
- Marvin Gronhorst - Marvin Gronhorst
- Tobias Goerke - Tobias Goerke
- Colin Juers - Colin Juers
- Dr. Frank Köhne - Dr. Frank Köhne
License
BSD 3-Clause License
Acknowledgments
Garcia et al. for the extensive research of discretization techniques.