edu.usc.ir:age-predictor-api

Ensemble Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum using Apache OpenNLP, and Apache Spark.

License

License

GroupId

GroupId

edu.usc.ir
ArtifactId

ArtifactId

age-predictor-api
Last Version

Last Version

1.0
Release Date

Release Date

Type

Type

jar
Description

Description

Ensemble Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum using Apache OpenNLP, and Apache Spark.

Download age-predictor-api

How to add to project

<!-- https://jarcasting.com/artifacts/edu.usc.ir/age-predictor-api/ -->
<dependency>
    <groupId>edu.usc.ir</groupId>
    <artifactId>age-predictor-api</artifactId>
    <version>1.0</version>
</dependency>
// https://jarcasting.com/artifacts/edu.usc.ir/age-predictor-api/
implementation 'edu.usc.ir:age-predictor-api:1.0'
// https://jarcasting.com/artifacts/edu.usc.ir/age-predictor-api/
implementation ("edu.usc.ir:age-predictor-api:1.0")
'edu.usc.ir:age-predictor-api:jar:1.0'
<dependency org="edu.usc.ir" name="age-predictor-api" rev="1.0">
  <artifact name="age-predictor-api" type="jar" />
</dependency>
@Grapes(
@Grab(group='edu.usc.ir', module='age-predictor-api', version='1.0')
)
libraryDependencies += "edu.usc.ir" % "age-predictor-api" % "1.0"
[edu.usc.ir/age-predictor-api "1.0"]

Dependencies

compile (1)

Group / Artifact Type Version
edu.usc.ir : age-predictor-cli jar 1.0

Project Modules

There are no modules declared in this project.

Author Age Prediction

This is a author age categorizer that leverages the Apache OpenNLP Maximum Entropy Classifier. It takes a text sample and classifies it into the following age categories: xx-18|18-24|25-34|35-49|50-64|65-xx.

Usage

How to train an Age Classifier

Note: The training data should be a line-by-line, with each line starting with the age, or age category, followed by a tab and the text associated with the age.

Usage: bin/authorage AgeClassifyTrainer [-factory factoryName] [-featureGenerators featuregens] [-tokenizer tokenizer] -model modelFile [-params paramsFile] -lang language -data sampleData [-encoding charsetName]

Arguments description:
	-factory factoryName
        a sub-class of DoccatFactory where to get implementation and resources.
	-featureGenerators featuregens
	    comma separated feature generator classes. Bag of words default.
	-tokenizer tokenizer
        tokenizer implementation. WhitespaceTokenizer is used if not specified.
	-model modelFile
        output model file.
	-params paramsFile
	    training parameters file.
	-lang language
	    language which is being processed.
	-data sampleData
	    data to be used, usually a file name.
	-encoding charsetName
	    encoding for reading and writing text, if absent the system default is used.

Example Usage:

bin/authorage AgeClassifyTrainer -model model/en-ageClassify.bin -lang en -data data/train.txt -encoding UTF-8

Training data format - Age and text seperated by tab in each line like <AGE><Tab><TEXT>
Sample training data-

12	I am just 12 year old
25	I am little bigger
35	I am mature
45	I am getting old
60	I am old like wine

How to evaluate an Age Classifier Model

Usage: bin/authorage AgeClassifyEvaluator -model model [-misclassified true|false] -data sampleData [-encoding charsetName]

Arguments description:
	-model model
		the model file to be evaluated.
	-misclassified true|false
		if true will print false negatives and false positives.
	-data sampleData
		data to be used, usually a file name.
	-encoding charsetName
		encoding for reading and writing text, if absent the system default is used.

Example Usage:

bin/authorage AgeClassifyEvaluator -model model/en-ageClassify.bin -data data/test.txt -encoding UTF-8

How to run the Age Classifier

Note: Each document must be followed by an empty line to be detected as a separate case from the others.

Usage: bin/authorage AgeClassify model < documents
Usage: bin/authorage AgePredict ./model/classify-unigram.bin ./model/regression-global.bin  data/sample_test.txt

Downloads

For AgePredict to work you need to download en-pos-maxent.bin, en-sent.bin and en-token.bin from http://opennlp.sourceforge.net/models-1.5/ to model/opennlp/

Citation:

If you use this work, please cite:

@article{hong2017ensemble,
  title={Ensemble Maximum Entropy Classification and Linear Regression for Author Age Prediction},
  author={Hong, Joey and Mattmann, Chris and Ramirez, Paul},
  booktitle={Information Reuse and Integration (IRI), 2017 IEEE 18th International Conference on},
  organization={IEEE}
  year={2017}
}

Contributors

  • Chris A. Mattmann, JPL & USC
  • Joey Hong, Caltech
  • Madhav Sharan, JPL & USC

License

Apache License, version 2

edu.usc.ir

USC Information Retrieval & Data Science

USC Information Retrieval and Data Science Group

Versions

Version
1.0