Weka

denormalize

nz.ac.waikato.cms.weka : denormalize

An instance filter that collapses instances with a common grouping ID value into a single instance. Useful for converting transactional data into a format that Weka's association rule learners can handle. IMPORTANT: assumes that the incoming batch of instances has been sorted on the grouping attribute. The values of nominal attributes are converted to indicator attributes. These can be either binary (with f and t values) or unary with missing values used to indicate absence. The later is Weka's old market basket format, which is useful for Apriori. Numeric attributes can be aggregated within groups by computing the average, sum, minimum or maximum.

Last Version: 1.0.3

Release Date:

distributedWekaHadoop

nz.ac.waikato.cms.weka : distributedWekaHadoop

This package provides loaders and savers for HDFS, plus Hadoop jobs and tasks that wrap the tasks provided in distributedWekaBase. Includes libraries for Hadoop 1.1.2.

Last Version: 1.0.15

Release Date:

distributedWekaHadoopCore

nz.ac.waikato.cms.weka : distributedWekaHadoopCore

This package provides loaders and savers for HDFS, plus Hadoop jobs and tasks that wrap the tasks provided in distributedWekaBase.

Last Version: 1.0.21

Release Date:

dualPerturbAndCombine

nz.ac.waikato.cms.weka : dualPerturbAndCombine

Class for building and using classification and regression trees based on the closed-form dual perturb and combine algorithm described in Pierre Geurts, Lous Wehenkel: Closed-form dual perturb and combine for tree-based models. In: Proceedings of the 22nd International Conference on Machine Learning, 233-240, 2005.

Last Version: 1.0.0

Release Date:

elasticNet

nz.ac.waikato.cms.weka : elasticNet

An implementation of the elastic net method for linear regression

Last Version: 1.0.1

Release Date:

ensembleLibrary

nz.ac.waikato.cms.weka : ensembleLibrary

Manages a libary of ensemble classifiers

Last Version: 1.0.4

Release Date:

ensemblesOfNestedDichotomies

nz.ac.waikato.cms.weka : ensemblesOfNestedDichotomies

A meta classifier for handling multi-class datasets with 2-class classifiers by building an ensemble of nested dichotomies. For more info, check Lin Dong, Eibe Frank, Stefan Kramer: Ensembles of Balanced Nested Dichotomies for Multi-class Problems. In: PKDD, 84-95, 2005. Eibe Frank, Stefan Kramer: Ensembles of nested dichotomies for multi-class problems. In: Twenty-first International Conference on Machine Learning, 2004.

Last Version: 1.0.6

Release Date:

extraTrees

nz.ac.waikato.cms.weka : extraTrees

Package for generating a single Extra-Tree. Use with the RandomCommittee meta classifier to generate an Extra-Trees forest for classification or regression. This classifier requires all predictors to be numeric. Missing values are not allowed. Instance weights are taken into account. For more information, see Pierre Geurts, Damien Ernst, Louis Wehenkel (2006). Extremely randomized trees. Machine Learning. 63(1):3-42.

Last Version: 1.0.2

Release Date:

filteredAttributeSelection

nz.ac.waikato.cms.weka : filteredAttributeSelection

This package provides two meta attribute selection evaluators that can apply an arbitrary filter to the input data before executing the actual attribute selection scheme. One filters data and then passes it to an attribute evaluator (FilteredAttributeEval), and the other filters data and then passes it to a subset evaluator (FilteredSubsetEval).

Last Version: 1.0.2

Release Date:

functionalTrees

nz.ac.waikato.cms.weka : functionalTrees

Functional trees (decision trees with oblique splits and functions at the leaves)

Last Version: 1.0.4

Release Date:

fuzzyLaticeReasoning

nz.ac.waikato.cms.weka : fuzzyLaticeReasoning

The Fuzzy Lattice Reasoning Classifier uses the notion of Fuzzy Lattices for creating a Reasoning Environment. The current version can be used for classification using numeric predictors. For more information see: I. N. Athanasiadis, V. G. Kaburlasos, P. A. Mitkas, V. Petridis: Applying Machine Learning Techniques on Air Quality Data for Real-Time Decision Support. In: 1st Intl. NAISO Symposium on Information Technologies in Environmental Engineering (ITEE-2003), Gdansk, Poland, 2003; V. G. Kaburlasos, I. N. Athanasiadis, P. A. Mitkas, V. Petridis (2003). Fuzzy Lattice Reasoning (FLR) Classifier and its Application on Improved Estimation of Ambient Ozone Concentration.

Last Version: 1.0.2

Release Date:

fuzzyUnorderedRuleInduction

nz.ac.waikato.cms.weka : fuzzyUnorderedRuleInduction

FURIA: Fuzzy Unordered Rule Induction Algorithm. For details please see: Jens Christian Huehn, Eyke Huellermeier (2009). FURIA: An Algorithm for Unordered Fuzzy Rule Induction. Data Mining and Knowledge Discovery.

Last Version: 1.0.2

Release Date:

generalizedSequentialPatterns

nz.ac.waikato.cms.weka : generalizedSequentialPatterns

Class implementing a GSP algorithm for discovering sequential patterns in a sequential data set. The attribute identifying the distinct data sequences contained in the set can be determined by the respective option. Furthermore, the set of output results can be restricted by specifying one or more attributes that have to be contained in each element/itemset of a sequence. For further information see: Ramakrishnan Srikant, Rakesh Agrawal (1996). Mining Sequential Patterns: Generalizations and Performance Improvements.

Last Version: 1.0.2

Release Date:

grading

nz.ac.waikato.cms.weka : grading

Implements Grading. The base classifiers are "graded". For more information, see A.K. Seewald, J. Fuernkranz: An Evaluation of Grading Classifiers. In: Advances in Intelligent Data Analysis: 4th International Conference, Berlin/Heidelberg/New York/Tokyo, 115-124, 2001.

Last Version: 1.0.2

Release Date:

gridSearch

nz.ac.waikato.cms.weka : gridSearch

Performs a grid search of parameter pairs for the a classifier (Y-axis, default is LinearRegression with the "Ridge" parameter) and the PLSFilter (X-axis, "# of Components") and chooses the best pair found for the actual predicting. The initial grid is worked on with 2-fold CV to determine the values of the parameter pairs for the selected type of evaluation (e.g., accuracy). The best point in the grid is then taken and a 10-fold CV is performed with the adjacent parameter pairs. If a better pair is found, then this will act as new center and another 10-fold CV will be performed (kind of hill-climbing). This process is repeated until no better pair is found or the best pair is on the border of the grid. In case the best pair is on the border, one can let GridSearch automatically extend the grid and continue the search. Check out the properties 'gridIsExtendable' (option '-extend-grid') and 'maxGridExtensions' (option '-max-grid-extensions <num>'). GridSearch can handle doubles, integers (values are just cast to int) and booleans (0 is false, otherwise true). float, char and long are supported as well. The best filter/classifier setup can be accessed after the buildClassifier call via the getBestFilter/getBestClassifier methods. Note on the implementation: after the data has been passed through the filter, a default NumericCleaner filter is applied to the data in order to avoid numbers that are getting too small and might produce NaNs in other schemes.

Last Version: 1.0.12

Release Date:

hiddenNaiveBayes

nz.ac.waikato.cms.weka : hiddenNaiveBayes

Contructs Hidden Naive Bayes classification model with high classification accuracy and AUC. For more information refer to: H. Zhang, L. Jiang, J. Su: Hidden Naive Bayes. In: Twentieth National Conference on Artificial Intelligence, 919-924, 2005.

Last Version: 1.0.2

Release Date:

hotSpot

nz.ac.waikato.cms.weka : hotSpot

HotSpot learns a set of rules (displayed in a tree-like structure) that maximize/minimize a target variable/value of interest. With a nominal target, one might want to look for segments of the data where there is a high probability of a minority value occuring (given the constraint of a minimum support). For a numeric target, one might be interested in finding segments where this is higher on average than in the whole data set. For example, in a health insurance scenario, find which health insurance groups are at the highest risk (have the highest claim ratio), or, which groups have the highest average insurance payout.

Last Version: 1.0.14

Release Date:

hyperPipes

nz.ac.waikato.cms.weka : hyperPipes

Class implementing a HyperPipe classifier. For each category a HyperPipe is constructed that contains all points of that category (essentially records the attribute bounds observed for each category). Test instances are classified according to the category that "most contains the instance". Does not handle numeric class, or missing values in test cases. Extremely simple algorithm, but has the advantage of being extremely fast, and works quite well when you have "smegloads" of attributes.

Last Version: 1.0.2

Release Date:

isolationForest

nz.ac.waikato.cms.weka : isolationForest

Class for building and using a classifier built on the Isolation Forest anomaly detection algorithm. For more information see Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou. 2008. Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pages 413-422.

Last Version: 1.0.2

Release Date:

isotonicRegression

nz.ac.waikato.cms.weka : isotonicRegression

Learns an isotonic regression model. Picks the attribute that results in the lowest squared error. Missing values are not allowed. Can only deal with numeric attributes. Considers the monotonically increasing case as well as the monotonically decreasing case.

Last Version: 1.0.2

Release Date:

iterativeAbsoluteErrorRegression

nz.ac.waikato.cms.weka : iterativeAbsoluteErrorRegression

Provides a regression scheme that uses Schlossmacher's iteratively reweighted least squares method to fit a model that minimizes absolute error. The scheme can be used with any base learner in WEKA that performs least-squares regression

Last Version: 1.0.0

Release Date:

javaFXScatter3D

nz.ac.waikato.cms.weka : javaFXScatter3D

A visualization component for displaying a 3D scatter plot of the data using Java 3D. Requires Java 3D to be installed. This version adds built-in sampling controls to the GUI. The default sampling percentage is set so that a maximum of 5000 instances are plotted. The user can adjust this higher or lower to suit their available processing speed and memory.

Last Version: 1.0.0

Release Date:

kernelLogisticRegression

nz.ac.waikato.cms.weka : kernelLogisticRegression

This package contains a classifier that can be used to train a two-class kernel logistic regression model with the kernel functions that are available in WEKA. It optimises the negative log-likelihood with a quadratic penalty. Both, BFGS and conjugate gradient descent, are available as optimisation methods, but the former is normally faster. It is possible to use multiple threads, but the speed-up is generally very marginal when used with BFGS optimisation. With conjugate gradient descent optimisation, greater speed-ups can be achieved when using multiple threads. With the default kernel, the dot product kernel, this method produces results that are close to identical to those obtained using standard logistic regression in WEKA, provided a sufficiently large value for the parameter determining the size of the quadratic penalty is used in both cases.

Last Version: 1.0.0

Release Date:

kfGroovy

nz.ac.waikato.cms.weka : kfGroovy

Knowledge Flow plugin that provides a Knowledge Flow step that wraps around a Groovy script. The plugin generates a fully compilable template Groovy script that implements various Knowledge Flow interfaces. The user can fill in the methods that are necessary to accomplish the desired logic. The script is compiled at runtime and the Groovy component passes incoming events to the script and collects and passes on generated events.

Last Version: 1.0.12

Release Date:

kfKettle

nz.ac.waikato.cms.weka : kfKettle

Knowledge Flow step that provides an entry point for data coming from the Kettle ETL tool.

Last Version: 1.0.5

Release Date:

kfPMMLClassifierScoring

nz.ac.waikato.cms.weka : kfPMMLClassifierScoring

A Knowledge Flow plugin that provides a Knowledge Flow step for scoring test sets or instance streams using a PMML classifier.

Last Version: 1.0.3

Release Date:

largeScaleKernelLearning

nz.ac.waikato.cms.weka : largeScaleKernelLearning

This package provides filters to enable kernel-based learning from large datasets. It currently only contains the Nystroem method.

Last Version: 1.0.1

Release Date:

latentSemanticAnalysis

nz.ac.waikato.cms.weka : latentSemanticAnalysis

Performs latent semantic analysis and transformation of the data. Use in conjunction with a Ranker search. A low-rank approximation of the full data is found by specifying the number of singular values to use. The dataset may be transformed to give the relation of either the attributes or the instances (default) to the concept space created by the transformation.

Last Version: 1.0.5

Release Date:

lazyBayesianRules

nz.ac.waikato.cms.weka : lazyBayesianRules

Lazy Bayesian Rules Classifier. The naive Bayesian classifier provides a simple and effective approach to classifier learning, but its attribute independence assumption is often violated in the real world. Lazy Bayesian Rules selectively relaxes the independence assumption, achieving lower error rates over a range of learning tasks. LBR defers processing to classification time, making it a highly efficient and accurate classification algorithm when small numbers of objects are to be classified. For more information, see: Zijian Zheng, G. Webb (2000). Lazy Learning of Bayesian Rules. Machine Learning. 4(1):53-84.

Last Version: 1.0.2

Release Date:

leastMedSquared

nz.ac.waikato.cms.weka : leastMedSquared

Implements a least median squared linear regression utilizing the existing weka LinearRegression class to form predictions. Least squared regression functions are generated from random subsamples of the data. The least squared regression with the lowest meadian squared error is chosen as the final model. The basis of the algorithm is Peter J. Rousseeuw, Annick M. Leroy (1987). Robust regression and outlier detection.

Last Version: 1.0.2

Release Date:

levenshteinEditDistance

nz.ac.waikato.cms.weka : levenshteinEditDistance

Computes the Levenshtein edit distance between two strings.

Last Version: 1.0.2

Release Date:

linearForwardSelection

nz.ac.waikato.cms.weka : linearForwardSelection

Extension of BestFirst. Takes a restricted number of k attributes into account. Fixed-set selects a fixed number k of attributes, whereas k is increased in each step when fixed-width is selected. The search uses either the initial ordering to select the top k attributes, or performs a ranking (with the same evalutator the search uses later on). The search direction can be forward, or floating forward selection (with opitional backward search steps). For more information see: Martin Guetlein (2006). Large Scale Attribute Selection Using Wrappers. Freiburg, Germany.

Last Version: 1.0.2

Release Date:

localOutlierFactor

nz.ac.waikato.cms.weka : localOutlierFactor

A filter that applies the LOF (Local Outlier Factor) algorithm to compute an outlier score for each instance in the data. Can use multiple cores/cpus to speed up the LOF computation for large datasets. Nearest neighbor search methods and distance functions are pluggable. For more information, see: Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jorg Sander (2000). LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Record. 29(2):93-104.

Last Version: 1.0.4

Release Date:

logarithmicErrorMetrics

nz.ac.waikato.cms.weka : logarithmicErrorMetrics

Provides root mean square logarithmic error and mean absolute logarithmic error for evaluating regression schemes.

Last Version: 1.0.0

Release Date:

metaCost

nz.ac.waikato.cms.weka : metaCost

This metaclassifier makes its base classifier cost-sensitive using the method specified in Pedro Domingos: MetaCost: A general method for making classifiers cost-sensitive. In: Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, 1999. This classifier should produce similar results to one created by passing the base learner to Bagging, which is in turn passed to a CostSensitiveClassifier operating on minimum expected cost. The difference is that MetaCost produces a single cost-sensitive classifier of the base learner, giving the benefits of fast classification and interpretable output (if the base learner itself is interpretable). This implementation uses all bagging iterations when reclassifying training data (the MetaCost paper reports a marginal improvement when only those iterations containing each training instance are used in reclassifying that instance).

Last Version: 1.0.3

Release Date:

multiBoostAB

nz.ac.waikato.cms.weka : multiBoostAB

Class for boosting a classifier using the MultiBoosting method. MultiBoosting is an extension to the highly successful AdaBoost technique for forming decision committees. MultiBoosting can be viewed as combining AdaBoost with wagging. It is able to harness both AdaBoost's high bias and variance reduction with wagging's superior variance reduction. Using C4.5 as the base learning algorithm, Multi-boosting is demonstrated to produce decision committees with lower error than either AdaBoost or wagging significantly more often than the reverse over a large representative cross-section of UCI data sets. It offers the further advantage over AdaBoost of suiting parallel execution. For more information, see Geoffrey I. Webb (2000). MultiBoosting: A Technique for Combining Boosting and Wagging. Machine Learning. Vol.40(No.2).

Last Version: 1.0.2

Release Date:

multiLayerPerceptrons

nz.ac.waikato.cms.weka : multiLayerPerceptrons

This package currently contains classes for training multilayer perceptrons with one hidden layer, where the number of hidden units is user specified. MLPClassifier can be used for classification problems and MLPRegressor is the corresponding class for numeric prediction tasks. The former has as many output units as there are classes, the latter only one output unit. Both minimise a penalised squared error with a quadratic penalty on the (non-bias) weights, i.e., they implement "weight decay", where this penalised error is averaged over all training instances. The size of the penalty can be determined by the user by modifying the "ridge" parameter to control overfitting. The sum of squared weights is multiplied by this parameter before added to the squared error. Both classes use BFGS optimisation by default to find parameters that correspond to a local minimum of the error function. but optionally conjugated gradient descent is available, which can be faster for problems with many parameters. Logistic functions are used as the activation functions for all units apart from the output unit in MLPRegressor, which employs the identity function. Input attributes are standardised to zero mean and unit variance. MLPRegressor also rescales the target attribute (i.e., "class") using standardisation. All network parameters are initialised with small normally distributed random values.

Last Version: 1.0.10

Release Date:

multilayerPerceptronCS

nz.ac.waikato.cms.weka : multilayerPerceptronCS

An extension of the standard MultilayerPerceptron classifier in Weka that adds context-sensitive Multiple Task Learning (csMTL)

Last Version: 1.0.2

Release Date:

naiveBayesTree

nz.ac.waikato.cms.weka : naiveBayesTree

Class for generating a decision tree with naive Bayes classifiers at the leaves. For more information, see Ron Kohavi: Scaling Up the Accuracy of Naive-Bayes Classifiers: A Decision-Tree Hybrid. In: Second International Conference on Knoledge Discovery and Data Mining, 202-207, 1996.

Last Version: 1.0.2

Release Date:

niftiLoader

nz.ac.waikato.cms.weka : nifiLoader

Package for loading a directory containing MRI data in NIfTI format. The directory to be loaded must contain as many subdirectories as there are classes of MRI data. Each subdirectory name will be used as the class label for the corresponding .nii files in that subdirectory. (This is the same strategy as the one used by WEKA's TextDirectoryLoader.) Currently, the package only reads volume information for the first time slot from each .nii file. The readDoubleVol(short ttt) method from the Nifti1Dataset class (http://niftilib.sourceforge.net/java_api_html/Nifti1Dataset.html) is used to read the data for each volume into a sparse WEKA instance (with ttt=0). For an LxMxN volume (the dimensions must be the same for each .nii file in the directory!), the order of values in the generated instance is [(z_1, y_1, x_1), ..., (z_1, y_1, x_L), (z_1, y_2, x_1), ..., (z_1, y_M, x_L), (z_2, y_1, x_1), ..., (z_N, y_M, x_L)]. If the volume is an image, then only x and y coordinates are used.

Last Version: 1.0.1

Release Date:

oneClassClassifier

nz.ac.waikato.cms.weka : oneClassClassifier

Performs one-class classification on a dataset. Classifier reduces the class being classified to just a single class, and learns the datawithout using any information from other classes. The testing stage will classify as 'target'or 'outlier' - so in order to calculate the outlier pass rate the dataset must contain informationfrom more than one class. Also, the output varies depending on whether the label 'outlier' exists in the instances usedto build the classifier. If so, then 'outlier' will be predicted, if not, then the label willbe considered missing when the prediction does not favour the target class. The 'outlier' classwill not be used to build the model if there are instances of this class in the dataset. It cansimply be used as a flag, you do not need to relabel any classes. For more information, see: Kathryn Hempstalk, Eibe Frank, Ian H. Witten: One-Class Classification by Combining Density and Class Probability Estimation. In: Proceedings of the 12th European Conference on Principles and Practice of Knowledge Discovery in Databases and 19th European Conference on Machine Learning, ECMLPKDD2008, Berlin, 505--519, 2008.

Last Version: 1.0.4

Release Date:

ordinalClassClassifier

nz.ac.waikato.cms.weka : ordinalClassClassifier

Meta classifier that allows standard classification algorithms to be applied to ordinal class problems. For more information see: Eibe Frank, Mark Hall: A Simple Approach to Ordinal Classification. In: 12th European Conference on Machine Learning, 145-156, 2001. Robert E. Schapire, Peter Stone, David A. McAllester, Michael L. Littman, Janos A. Csirik: Modeling Auction Price Uncertainty Using Boosting-based Conditional Density Estimation. In: Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), 546-553, 2002.

Last Version: 1.0.5

Release Date:

ordinalLearningMethod

nz.ac.waikato.cms.weka : ordinalLearningMethod

An implementation of the Ordinal Learning Method (OLM). Further information regarding the algorithm and variants can be found in: Arie Ben-David (1992). Automatic Generation of Symbolic Multiattribute Ordinal Knowledge-Based DSSs: methodology and Applications. Decision Sciences. 23:1357-1372.

Last Version: 1.0.2

Release Date:

ordinalStochasticDominance

nz.ac.waikato.cms.weka : ordinalStochasticDominance

An implementation of the Ordinal Stochastic Dominance Learner. Further information regarding the OSDL-algorithm can be found in: S. Lievens, B. De Baets, K. Cao-Van (2006). A Probabilistic Framework for the Design of Instance-Based Supervised Ranking Algorithms in an Ordinal Setting. Annals of Operations Research; Kim Cao-Van (2003). Supervised ranking: from semantics to algorithms; Stijn Lievens (2004). Studie en implementatie van instantie-gebaseerde algoritmen voor gesuperviseerd rangschikken

Last Version: 1.0.2

Release Date:

paceRegression

nz.ac.waikato.cms.weka : paceRegression

Class for building pace regression linear models and using them for prediction. Under regularity conditions, pace regression is provably optimal when the number of coefficients tends to infinity. It consists of a group of estimators that are either overall optimal or optimal under certain conditions. The current work of the pace regression theory, and therefore also this implementation, do not handle: - missing values - non-binary nominal attributes - the case that n - k is small where n is the number of instances and k is the number of coefficients (the threshold used in this implmentation is 20) For more information see: Wang, Y (2000). A new approach to fitting linear models in high dimensional spaces. Hamilton, New Zealand. Wang, Y., Witten, I. H.: Modeling for optimal probability prediction. In: Proceedings of the Nineteenth International Conference in Machine Learning, Sydney, Australia, 650-657, 2002.

Last Version: 1.0.2

Release Date:

percentageErrorMetrics

nz.ac.waikato.cms.weka : percentageErrorMetrics

Provides root mean square percentage error and mean absolute percentage error for evaluating regression schemes.

Last Version: 1.0.1

Release Date:

phmm4weka

nz.ac.waikato.cms.weka : phmm4weka

This Java software implements Profile Hidden Markov Models (PHMMs) for protein classification for the WEKA workbench. Standard PHMMs and newly introduced binary PHMMs are used. In addition the software allows propositionalisation of PHMMs.

Last Version: 1.1.3

Release Date:

prefuseGraphViewer

nz.ac.waikato.cms.weka : prefuseGraphViewer

Knowledge Flow visualization component for displaying tree and graph structures from those schemes that can output them. This component is an alternative to the Knowledge Flow's built-in GraphViewer and uses the PrefuseTree and PrefuseGraph packages which, in turn, use the prefuse visualization library.

Last Version: 1.0.4

Release Date:

probabilisticSignificanceAE

nz.ac.waikato.cms.weka : probabilisticSignificanceAE

Evaluates the worth of an attribute by computing the Probabilistic Significance as a two-way function (attribute-classes and classes-attribute association). For more information see: Amir Ahmad, Lipika Dey (2004). A feature selection technique for classificatory analysis.

Last Version: 1.0.2

Release Date:

probabilityCalibrationTrees

nz.ac.waikato.cms.weka : probabilityCalibrationTrees

Provides probability calibration trees (PCTs) for local calibration of class probability estimates. To achieve calibration of a base learner, the PCT class must be used as the meta learner in the CascadeGeneralization class, which is also included in this package. The classifier to be calibrated must be used as the base learner in the CascadeGeneralization class. The CascadeGeneralization class can also be used independently to perform CascadeGeneralization for ensemble learning. The code for PCTs is largely the same as the LMT code for growing logistic model trees. For more details, see the ACML paper on probability calibration trees.

Last Version: 1.0.0

Release Date: