Corpus Statistics

Corpus Statistics plugin for GATE: provides processing resources for calculating word statistics like tf, df, tf*idf, and for calculating pairwise collocation statistics like PMI over a corpus. Can be used with GCP.

License

License

GroupId

GroupId

uk.ac.gate.plugins
ArtifactId

ArtifactId

corpusstats
Last Version

Last Version

1.3
Release Date

Release Date

Type

Type

jar
Description

Description

Corpus Statistics
Corpus Statistics plugin for GATE: provides processing resources for calculating word statistics like tf, df, tf*idf, and for calculating pairwise collocation statistics like PMI over a corpus. Can be used with GCP.
Project URL

Project URL

https://gatenlp.github.io/gateplugin-CorpusStats/
Project Organization

Project Organization

GATE Team
Source Code Management

Source Code Management

https://github.com/GateNLP/gateplugin-CorpusStats

Download corpusstats

How to add to project

<!-- https://jarcasting.com/artifacts/uk.ac.gate.plugins/corpusstats/ -->
<dependency>
    <groupId>uk.ac.gate.plugins</groupId>
    <artifactId>corpusstats</artifactId>
    <version>1.3</version>
</dependency>
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/corpusstats/
implementation 'uk.ac.gate.plugins:corpusstats:1.3'
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/corpusstats/
implementation ("uk.ac.gate.plugins:corpusstats:1.3")
'uk.ac.gate.plugins:corpusstats:jar:1.3'
<dependency org="uk.ac.gate.plugins" name="corpusstats" rev="1.3">
  <artifact name="corpusstats" type="jar" />
</dependency>
@Grapes(
@Grab(group='uk.ac.gate.plugins', module='corpusstats', version='1.3')
)
libraryDependencies += "uk.ac.gate.plugins" % "corpusstats" % "1.3"
[uk.ac.gate.plugins/corpusstats "1.3"]

Dependencies

compile (1)

Group / Artifact Type Version
org.apache.commons : commons-math3 jar 3.6.1

provided (1)

Group / Artifact Type Version
uk.ac.gate : gate-core jar 8.5

test (1)

Group / Artifact Type Version
uk.ac.gate : gate-plugin-test-utils jar 8.5

Project Modules

There are no modules declared in this project.

gateplugin-CorpusStats

A plugin for the GATE language technology framework for calculating various term and term pair statistics over a corpus.

The plugin implements the following PRs:

  • CorpusStatsiTfIdfPR for processing a whole corpus and creating files that contain corpus statistics like document frequency, term frequency, total number of documents etc.
  • AssignStatsTfIdfPR for processing a corpus and using the corpus statistics file created with the CorpusStatsPR to add featires to terms in each document of the corpus. This can be used to create features for scores like tf (term frequency), wtf (weighted term frequency), ltfidf (logarithmic term frequency times inverse document frequency), and others.
  • CorpusStatsCollocationsPR for processing a corpus and creating TSV files that contain corpus statistics like PMI, Chi-Squared and others for all pairs of terms.

More documentation:

uk.ac.gate.plugins

GateNLP

GATE - General Architecture for Text Engineering

Versions

Version
1.3
1.2