Machine Learning

Last Version: 0.0.17

Release Date:

Last Version: 0.0.15

Release Date:

Last Version: 1.0.4

Release Date:

julielab-ranklib-mallet

de.julielab : julielab-ranklib-mallet

The Parent POM for all JULIE Lab projects.

Last Version: 1.0.0

Release Date:

DMNBtext

nz.ac.waikato.cms.weka : DMNBtext

Class for building and using a Discriminative Multinomial Naive Bayes classifier. For more information see: Jiang Su,Harry Zhang,Charles X. Ling,Stan Matwin: Discriminative Parameter Learning for Bayesian Networks. In: ICML 2008', 2008.

Last Version: 1.0.2

Release Date:

iterativeAbsoluteErrorRegression

nz.ac.waikato.cms.weka : iterativeAbsoluteErrorRegression

Provides a regression scheme that uses Schlossmacher's iteratively reweighted least squares method to fit a model that minimizes absolute error. The scheme can be used with any base learner in WEKA that performs least-squares regression

Last Version: 1.0.0

Release Date:

racedIncrementalLogitBoost

nz.ac.waikato.cms.weka : racedIncrementalLogitBoost

Classifier for incremental learning of large datasets by way of racing logit-boosted committees. For more information see: Eibe Frank, Geoffrey Holmes, Richard Kirkby, Mark Hall: Racing committees for large datasets. In: Proceedings of the 5th International Conferenceon Discovery Science, 153-164, 2002.

Last Version: 1.0.2

Release Date:

leastMedSquared

nz.ac.waikato.cms.weka : leastMedSquared

Implements a least median squared linear regression utilizing the existing weka LinearRegression class to form predictions. Least squared regression functions are generated from random subsamples of the data. The least squared regression with the lowest meadian squared error is chosen as the final model. The basis of the algorithm is Peter J. Rousseeuw, Annick M. Leroy (1987). Robust regression and outlier detection.

Last Version: 1.0.2

Release Date:

simpleCART

nz.ac.waikato.cms.weka : simpleCART

Class implementing minimal cost-complexity pruning. Note when dealing with missing values, use "fractional instances" method instead of surrogate split method. For more information, see: Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California.

Last Version: 1.0.2

Release Date:

thresholdSelector

nz.ac.waikato.cms.weka : thresholdSelector

A metaclassifier that selecting a mid-point threshold on the probability output by a Classifier. The midpoint threshold is set so that a given performance measure is optimized. Currently this is the F-measure. Performance is measured either on the training data, a hold-out set or using cross-validation. In addition, the probabilities returned by the base learner can have their range expanded so that the output probabilities will reside between 0 and 1 (this is useful if the scheme normally produces probabilities in a very narrow range).

Last Version: 1.0.3

Release Date:

streamingUnivariateStats

nz.ac.waikato.cms.weka : streamingUnivariateStats

This package provides A Knowledge Flow step to compute summary statistics incrementally

Last Version: 1.0.1

Release Date:

clojureClassifier

nz.ac.waikato.cms.weka : clojureClassifier

Wrapper classifier for classifiers written in the Clojure language.

Last Version: 1.0.1

Release Date:

niftiLoader

nz.ac.waikato.cms.weka : nifiLoader

Package for loading a directory containing MRI data in NIfTI format. The directory to be loaded must contain as many subdirectories as there are classes of MRI data. Each subdirectory name will be used as the class label for the corresponding .nii files in that subdirectory. (This is the same strategy as the one used by WEKA's TextDirectoryLoader.) Currently, the package only reads volume information for the first time slot from each .nii file. The readDoubleVol(short ttt) method from the Nifti1Dataset class (http://niftilib.sourceforge.net/java_api_html/Nifti1Dataset.html) is used to read the data for each volume into a sparse WEKA instance (with ttt=0). For an LxMxN volume (the dimensions must be the same for each .nii file in the directory!), the order of values in the generated instance is [(z_1, y_1, x_1), ..., (z_1, y_1, x_L), (z_1, y_2, x_1), ..., (z_1, y_M, x_L), (z_2, y_1, x_1), ..., (z_N, y_M, x_L)]. If the volume is an image, then only x and y coordinates are used.

Last Version: 1.0.1

Release Date:

stackingC

nz.ac.waikato.cms.weka : stackingC

Implements StackingC (more efficient version of stacking). For more information, see A.K. Seewald: How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness. In: Nineteenth International Conference on Machine Learning, 554-561, 2002. Note: requires meta classifier to be a numeric prediction scheme

Last Version: 1.0.4

Release Date:

distributedWekaHadoop

nz.ac.waikato.cms.weka : distributedWekaHadoop

This package provides loaders and savers for HDFS, plus Hadoop jobs and tasks that wrap the tasks provided in distributedWekaBase. Includes libraries for Hadoop 1.1.2.

Last Version: 1.0.15

Release Date:

tertius

nz.ac.waikato.cms.weka : tertius

Finds rules according to confirmation measure (Tertius-type algorithm). For more information see: P. A. Flach, N. Lachiche (1999). Confirmation-Guided Discovery of first-order rules with Tertius. Machine Learning. 42:61-95.

Last Version: 1.0.2

Release Date:

wekaPython

nz.ac.waikato.cms.weka : wekaPython

Integration with CPython for Weka. Python version 2.7.x or higher is required. Also requires the following packages to be installed in python: numpy, pandas, matplotlib and scikit-learn. This package provides a wrapper classifier and clusterer that, between them, cover 60+ scikit-learn algorithms. It also provides a general scripting step for the Knowlege Flow along with scripting plugin environments for the Explorer and Knowledge Flow.

Last Version: 1.0.13

Release Date:

scatterPlot3D

nz.ac.waikato.cms.weka : scatterPlot3D

A visualization component for displaying a 3D scatter plot of the data using Java 3D. Requires Java 3D to be installed. This version adds built-in sampling controls to the GUI. The default sampling percentage is set so that a maximum of 5000 instances are plotted. The user can adjust this higher or lower to suit their available processing speed and memory.

Last Version: 1.0.7

Release Date:

linearForwardSelection

nz.ac.waikato.cms.weka : linearForwardSelection

Extension of BestFirst. Takes a restricted number of k attributes into account. Fixed-set selects a fixed number k of attributes, whereas k is increased in each step when fixed-width is selected. The search uses either the initial ordering to select the top k attributes, or performs a ranking (with the same evalutator the search uses later on). The search direction can be forward, or floating forward selection (with opitional backward search steps). For more information see: Martin Guetlein (2006). Large Scale Attribute Selection Using Wrappers. Freiburg, Germany.

Last Version: 1.0.2

Release Date:

scriptingClassifiers

nz.ac.waikato.cms.weka : scriptingClassifiers

Wrapper classifiers for Jython and Groovy code. Even though the classifier is serializable, the trained classifier cannot be stored persistently. I.e., one cannot store a model file and re-load it at a later point in time again to make predictions.

Last Version: 1.0.2

Release Date:

bestFirstTree

nz.ac.waikato.cms.weka : bestFirstTree

Class for building a best-first decision tree classifier. This class uses binary split for both nominal and numeric attributes. For missing values, the method of 'fractional' instances is used. For more information, see: Haijian Shi (2007). Best-first decision tree learning. Hamilton, NZ. Jerome Friedman, Trevor Hastie, Robert Tibshirani (2000). Additive logistic regression : A statistical view of boosting. Annals of statistics. 28(2):337-407.

Last Version: 1.0.4

Release Date:

latentSemanticAnalysis

nz.ac.waikato.cms.weka : latentSemanticAnalysis

Performs latent semantic analysis and transformation of the data. Use in conjunction with a Ranker search. A low-rank approximation of the full data is found by specifying the number of singular values to use. The dataset may be transformed to give the relation of either the attributes or the instances (default) to the concept space created by the transformation.

Last Version: 1.0.5

Release Date:

gridSearch

nz.ac.waikato.cms.weka : gridSearch

Performs a grid search of parameter pairs for the a classifier (Y-axis, default is LinearRegression with the "Ridge" parameter) and the PLSFilter (X-axis, "# of Components") and chooses the best pair found for the actual predicting. The initial grid is worked on with 2-fold CV to determine the values of the parameter pairs for the selected type of evaluation (e.g., accuracy). The best point in the grid is then taken and a 10-fold CV is performed with the adjacent parameter pairs. If a better pair is found, then this will act as new center and another 10-fold CV will be performed (kind of hill-climbing). This process is repeated until no better pair is found or the best pair is on the border of the grid. In case the best pair is on the border, one can let GridSearch automatically extend the grid and continue the search. Check out the properties 'gridIsExtendable' (option '-extend-grid') and 'maxGridExtensions' (option '-max-grid-extensions <num>'). GridSearch can handle doubles, integers (values are just cast to int) and booleans (0 is false, otherwise true). float, char and long are supported as well. The best filter/classifier setup can be accessed after the buildClassifier call via the getBestFilter/getBestClassifier methods. Note on the implementation: after the data has been passed through the filter, a default NumericCleaner filter is applied to the data in order to avoid numbers that are getting too small and might produce NaNs in other schemes.

Last Version: 1.0.12

Release Date:

kernelLogisticRegression

nz.ac.waikato.cms.weka : kernelLogisticRegression

This package contains a classifier that can be used to train a two-class kernel logistic regression model with the kernel functions that are available in WEKA. It optimises the negative log-likelihood with a quadratic penalty. Both, BFGS and conjugate gradient descent, are available as optimisation methods, but the former is normally faster. It is possible to use multiple threads, but the speed-up is generally very marginal when used with BFGS optimisation. With conjugate gradient descent optimisation, greater speed-ups can be achieved when using multiple threads. With the default kernel, the dot product kernel, this method produces results that are close to identical to those obtained using standard logistic regression in WEKA, provided a sufficiently large value for the parameter determining the size of the quadratic penalty is used in both cases.

Last Version: 1.0.0

Release Date:

DilcaDistance

nz.ac.waikato.cms.weka : DilcaDistance

This package implements the parameter free version of the DILCA distance. This approach allows to learn value-to-value distances between each pair of values for each attribute of the dataset. The distance between two values is computed indirectly based on the their distribution w.r.t. a set of related attributes (the context) carefully chosen.

Last Version: 1.0.1

Release Date:

raceSearch

nz.ac.waikato.cms.weka : raceSearch

Races the cross validation error of competing attribute subsets. Use in conjuction with a ClassifierSubsetEval. RaceSearch has four modes: forward selection races all single attribute additions to a base set (initially no attributes), selects the winner to become the new base set and then iterates until there is no improvement over the base set. Backward elimination is similar but the initial base set has all attributes included and races all single attribute deletions. Schemata search is a bit different. Each iteration a series of races are run in parallel. Each race in a set determines whether a particular attribute should be included or not---ie the race is between the attribute being "in" or "out". The other attributes for this race are included or excluded randomly at each point in the evaluation. As soon as one race has a clear winner (ie it has been decided whether a particular attribute should be inor not) then the next set of races begins, using the result of the winning race from the previous iteration as new base set. Rank race first ranks the attributes using an attribute evaluator and then races the ranking. The race includes no attributes, the top ranked attribute, the top two attributes, the top three attributes, etc. It is also possible to generate a raked list of attributes through the forward racing process. If generateRanking is set to true then a complete forward race will be run---that is, racing continues until all attributes have been selected. The order that they are added in determines a complete ranking of all the attributes. Racing uses paired and unpaired t-tests on cross-validation errors of competing subsets. When there is a significant difference between the means of the errors of two competing subsets then the poorer of the two can be eliminated from the race. Similarly, if there is no significant difference between the mean errors of two competing subsets and they are within some threshold of each other, then one can be eliminated from the race.

Last Version: 1.0.2

Release Date:

isotonicRegression

nz.ac.waikato.cms.weka : isotonicRegression

Learns an isotonic regression model. Picks the attribute that results in the lowest squared error. Missing values are not allowed. Can only deal with numeric attributes. Considers the monotonically increasing case as well as the monotonically decreasing case.

Last Version: 1.0.2

Release Date:

classAssociationRules

nz.ac.waikato.cms.weka : classAssociationRules

Class association rules algorithms (including an implementation of the CBA algorithm). For more information see: W. Li, J. Han, J.Pei: CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. In ICDM'01:369-376,2001. B. Liu, W. Hsu, Y. Ma: Integrating Classification and Association Rule Mining. In KDD'98:80-86,1998.

Last Version: 1.0.3

Release Date:

ridor

nz.ac.waikato.cms.weka : ridor

An implementation of a RIpple-DOwn Rule learner. It generates a default rule first and then the exceptions for the default rule with the least (weighted) error rate. Then it generates the "best" exceptions for each exception and iterates until pure. Thus it performs a tree-like expansion of exceptions.The exceptions are a set of rules that predict classes other than the default. IREP is used to generate the exceptions. For more information about Ripple-Down Rules, see: Brian R. Gaines, Paul Compton (1995). Induction of Ripple-Down Rules Applied to Modeling Large Databases. J. Intell. Inf. Syst. 5(3):211-228.

Last Version: 1.0.2

Release Date:

cascadeKMeans

nz.ac.waikato.cms.weka : cascadeKMeans

k-means clustering with automatic selection of k. Restarts k-means and selects the best k using the Calinski and Harabasz criterion, without cross-validation.

Last Version: 1.0.4

Release Date:

probabilisticSignificanceAE

nz.ac.waikato.cms.weka : probabilisticSignificanceAE

Evaluates the worth of an attribute by computing the Probabilistic Significance as a two-way function (attribute-classes and classes-attribute association). For more information see: Amir Ahmad, Lipika Dey (2004). A feature selection technique for classificatory analysis.

Last Version: 1.0.2

Release Date:

levenshteinEditDistance

nz.ac.waikato.cms.weka : levenshteinEditDistance

Computes the Levenshtein edit distance between two strings.

Last Version: 1.0.2

Release Date:

grading

nz.ac.waikato.cms.weka : grading

Implements Grading. The base classifiers are "graded". For more information, see A.K. Seewald, J. Fuernkranz: An Evaluation of Grading Classifiers. In: Advances in Intelligent Data Analysis: 4th International Conference, Berlin/Heidelberg/New York/Tokyo, 115-124, 2001.

Last Version: 1.0.2

Release Date:

timeSeriesFilters

nz.ac.waikato.cms.weka : timeSeriesFilters

Description=Provides a set of filters for time series. Currently contains PAA and SAX transformation filters and a filter that converts symbolic time series to string attribute values. The time series need to be given as values of a relation-valued attribute in the ARFF file. For example data in ARFF format, check the data directory of this package.

Last Version: 1.0.0

Release Date:

bayesianLogisticRegression

nz.ac.waikato.cms.weka : bayesianLogisticRegression

Implements Bayesian Logistic Regression for both Gaussian and Laplace Priors. For more information, see Alexander Genkin, David D. Lewis, David Madigan (2004). Large-scale bayesian logistic regression for text categorization.

Last Version: 1.0.5

Release Date:

largeScaleKernelLearning

nz.ac.waikato.cms.weka : largeScaleKernelLearning

This package provides filters to enable kernel-based learning from large datasets. It currently only contains the Nystroem method.

Last Version: 1.0.1

Release Date:

multilayerPerceptronCS

nz.ac.waikato.cms.weka : multilayerPerceptronCS

An extension of the standard MultilayerPerceptron classifier in Weka that adds context-sensitive Multiple Task Learning (csMTL)

Last Version: 1.0.2

Release Date:

distributedWekaHadoopCore

nz.ac.waikato.cms.weka : distributedWekaHadoopCore

This package provides loaders and savers for HDFS, plus Hadoop jobs and tasks that wrap the tasks provided in distributedWekaBase.

Last Version: 1.0.21

Release Date:

prefuseGraphViewer

nz.ac.waikato.cms.weka : prefuseGraphViewer

Knowledge Flow visualization component for displaying tree and graph structures from those schemes that can output them. This component is an alternative to the Knowledge Flow's built-in GraphViewer and uses the PrefuseTree and PrefuseGraph packages which, in turn, use the prefuse visualization library.

Last Version: 1.0.4

Release Date:

multiBoostAB

nz.ac.waikato.cms.weka : multiBoostAB

Class for boosting a classifier using the MultiBoosting method. MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, Multi-boosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution. For more information, see Geoffrey I. Webb (2000). MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning. Vol.40(No.2).

Last Version: 1.0.2

Release Date:

percentageErrorMetrics

nz.ac.waikato.cms.weka : percentageErrorMetrics

Provides root mean square percentage error and mean absolute percentage error for evaluating regression schemes.

Last Version: 1.0.1

Release Date:

tabuAndScatterSearch

nz.ac.waikato.cms.weka : tabuAndScatterSearch

Search methods contributed by Adrian Pino (ScatterSearchV1, TabuSearch). ScatterSearch: Performs an Scatter Search through the space of attribute subsets. Start with a population of many significants and diverses subset stops when the result is higher than a given treshold or there's not more improvement. For more information see: Felix Garcia Lopez (2004). Solving feature subset selection problem by a Parallel Scatter Search. Elsevier. Tabu Search: Abdel-Rahman Hedar, Jue Wangy, Masao Fukushima (2006). Tabu Search for Attribute Reduction in Rough Set Theory.

Last Version: 1.0.2

Release Date:

ordinalClassClassifier

nz.ac.waikato.cms.weka : ordinalClassClassifier

Meta classifier that allows standard classification algorithms to be applied to ordinal class problems. For more information see: Eibe Frank, Mark Hall: A Simple Approach to Ordinal Classification. In: 12th European Conference on Machine Learning, 145-156, 2001. Robert E. Schapire, Peter Stone, David A. McAllester, Michael L. Littman, Janos A. Csirik: Modeling Auction Price Uncertainty Using Boosting-based Conditional Density Estimation. In: Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), 546-553, 2002.

Last Version: 1.0.5

Release Date:

votingFeatureIntervals

nz.ac.waikato.cms.weka : votingFeatureIntervals

Classification by voting feature intervals. Intervals are constucted around each class for each attribute (basically discretization). Class counts are recorded for each interval on each attribute. Classification is by voting. For more info see: G. Demiroz, A. Guvenir: Classification by voting feature intervals. In: 9th European Conference on Machine Learning, 85-92, 1997.

Last Version: 1.0.2

Release Date:

multiLayerPerceptrons

nz.ac.waikato.cms.weka : multiLayerPerceptrons

This package currently contains classes for training multilayer perceptrons with one hidden layer, where the number of hidden units is user specified. MLPClassifier can be used for classification problems and MLPRegressor is the corresponding class for numeric prediction tasks. The former has as many output units as there are classes, the latter only one output unit. Both minimise a penalised squared error with a quadratic penalty on the (non-bias) weights, i.e., they implement "weight decay", where this penalised error is averaged over all training instances. The size of the penalty can be determined by the user by modifying the "ridge" parameter to control overfitting. The sum of squared weights is multiplied by this parameter before added to the squared error. Both classes use BFGS optimisation by default to find parameters that correspond to a local minimum of the error function. but optionally conjugated gradient descent is available, which can be faster for problems with many parameters. Logistic functions are used as the activation functions for all units apart from the output unit in MLPRegressor, which employs the identity function. Input attributes are standardised to zero mean and unit variance. MLPRegressor also rescales the target attribute (i.e., "class") using standardisation. All network parameters are initialised with small normally distributed random values.

Last Version: 1.0.10

Release Date:

userClassifier

nz.ac.waikato.cms.weka : userClassifier

Interactively classify through visual means. You are Presented with a scatter graph of the data against two user selectable attributes, as well as a view of the decision tree. You can create binary splits by creating polygons around data plotted on the scatter graph, as well as by allowing another classifier to take over at points in the decision tree should you see fit. For more information see: Malcolm Ware, Eibe Frank, Geoffrey Holmes, Mark Hall, Ian H. Witten (2001). Interactive machine learning: letting users build classifiers. Int. J. Hum.-Comput. Stud. 55(3):281-292.

Last Version: 1.0.3

Release Date:

J48graft

nz.ac.waikato.cms.weka : J48graft

Class for generating a grafted (pruned or unpruned) C4.5 decision tree. For more information, see Geoff Webb: Decision Tree Grafting From the All-Tests-But-One Partition.

Last Version: 1.0.3

Release Date:

SPegasos

nz.ac.waikato.cms.weka : SPegasos

Implements the stochastic variant of the Pegasos (Primal Estimated sub-GrAdient SOlver for SVM) method of Shalev-Shwartz et al. (2007). This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes, so the coefficients in the output are based on the normalized data. Can either minimize the hinge loss (SVM) or log loss (logistic regression). For more information, see S. Shalev-Shwartz, Y. Singer, N. Srebro: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. In: 24th International Conference on MachineLearning, 807-814, 2007.

Last Version: 1.0.2

Release Date:

lazyBayesianRules

nz.ac.waikato.cms.weka : lazyBayesianRules

Lazy Bayesian Rules Classifier. The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. Lazy Bayesian Rules selectively relaxes the independence assumption, achieving lower error rates over a range of learning tasks. LBR defers processing to classification time, making it a highly efficient and accurate classification algorithm when small numbers of objects are to be classified. For more information, see: Zijian Zheng, G. Webb (2000). Lazy Learning of Bayesian Rules. Machine Learning. 4(1):53-84.

Last Version: 1.0.2

Release Date:

functionalTrees

nz.ac.waikato.cms.weka : functionalTrees

Functional trees (decision trees with oblique splits and functions at the leaves)

Last Version: 1.0.4

Release Date: