age-predictor

Ensemble Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum using Apache OpenNLP, and Apache Spark.

License

License

GroupId

GroupId

edu.usc.ir
ArtifactId

ArtifactId

age-predictor
Last Version

Last Version

1.0
Release Date

Release Date

Type

Type

pom
Description

Description

age-predictor
Ensemble Age classification from text using PAN16, blogs, Fisher Callhome, and Cancer Forum using Apache OpenNLP, and Apache Spark.
Project URL

Project URL

http://maven.apache.org
Source Code Management

Source Code Management

https://github.com/USCDataScience/AgePredictor.git

Download age-predictor

Filename Size
age-predictor-1.0.pom 5 KB
Browse

How to add to project

<!-- https://jarcasting.com/artifacts/edu.usc.ir/age-predictor/ -->
<dependency>
    <groupId>edu.usc.ir</groupId>
    <artifactId>age-predictor</artifactId>
    <version>1.0</version>
    <type>pom</type>
</dependency>
// https://jarcasting.com/artifacts/edu.usc.ir/age-predictor/
implementation 'edu.usc.ir:age-predictor:1.0'
// https://jarcasting.com/artifacts/edu.usc.ir/age-predictor/
implementation ("edu.usc.ir:age-predictor:1.0")
'edu.usc.ir:age-predictor:pom:1.0'
<dependency org="edu.usc.ir" name="age-predictor" rev="1.0">
  <artifact name="age-predictor" type="pom" />
</dependency>
@Grapes(
@Grab(group='edu.usc.ir', module='age-predictor', version='1.0')
)
libraryDependencies += "edu.usc.ir" % "age-predictor" % "1.0"
[edu.usc.ir/age-predictor "1.0"]

Dependencies

There are no dependencies for this project. It is a standalone project that does not depend on any other jars.

Project Modules

  • age-predictor-opennlp
  • age-predictor-cli
  • age-predictor-api
  • age-predictor-assembly

Author Age Prediction

This is a author age categorizer that leverages the Apache OpenNLP Maximum Entropy Classifier. It takes a text sample and classifies it into the following age categories: xx-18|18-24|25-34|35-49|50-64|65-xx.

Usage

How to train an Age Classifier

Note: The training data should be a line-by-line, with each line starting with the age, or age category, followed by a tab and the text associated with the age.

Usage: bin/authorage AgeClassifyTrainer [-factory factoryName] [-featureGenerators featuregens] [-tokenizer tokenizer] -model modelFile [-params paramsFile] -lang language -data sampleData [-encoding charsetName]

Arguments description:
	-factory factoryName
        a sub-class of DoccatFactory where to get implementation and resources.
	-featureGenerators featuregens
	    comma separated feature generator classes. Bag of words default.
	-tokenizer tokenizer
        tokenizer implementation. WhitespaceTokenizer is used if not specified.
	-model modelFile
        output model file.
	-params paramsFile
	    training parameters file.
	-lang language
	    language which is being processed.
	-data sampleData
	    data to be used, usually a file name.
	-encoding charsetName
	    encoding for reading and writing text, if absent the system default is used.

Example Usage:

bin/authorage AgeClassifyTrainer -model model/en-ageClassify.bin -lang en -data data/train.txt -encoding UTF-8

Training data format - Age and text seperated by tab in each line like <AGE><Tab><TEXT>
Sample training data-

12	I am just 12 year old
25	I am little bigger
35	I am mature
45	I am getting old
60	I am old like wine

How to evaluate an Age Classifier Model

Usage: bin/authorage AgeClassifyEvaluator -model model [-misclassified true|false] -data sampleData [-encoding charsetName]

Arguments description:
	-model model
		the model file to be evaluated.
	-misclassified true|false
		if true will print false negatives and false positives.
	-data sampleData
		data to be used, usually a file name.
	-encoding charsetName
		encoding for reading and writing text, if absent the system default is used.

Example Usage:

bin/authorage AgeClassifyEvaluator -model model/en-ageClassify.bin -data data/test.txt -encoding UTF-8

How to run the Age Classifier

Note: Each document must be followed by an empty line to be detected as a separate case from the others.

Usage: bin/authorage AgeClassify model < documents
Usage: bin/authorage AgePredict ./model/classify-unigram.bin ./model/regression-global.bin  data/sample_test.txt

Downloads

For AgePredict to work you need to download en-pos-maxent.bin, en-sent.bin and en-token.bin from http://opennlp.sourceforge.net/models-1.5/ to model/opennlp/

Citation:

If you use this work, please cite:

@article{hong2017ensemble,
  title={Ensemble Maximum Entropy Classification and Linear Regression for Author Age Prediction},
  author={Hong, Joey and Mattmann, Chris and Ramirez, Paul},
  booktitle={Information Reuse and Integration (IRI), 2017 IEEE 18th International Conference on},
  organization={IEEE}
  year={2017}
}

Contributors

  • Chris A. Mattmann, JPL & USC
  • Joey Hong, Caltech
  • Madhav Sharan, JPL & USC

License

Apache License, version 2

edu.usc.ir

USC Information Retrieval & Data Science

USC Information Retrieval and Data Science Group

Versions

Version
1.0