ClearWSD CLI

Command line interfaces for non-programmatic training and experimentation.

License

License

Categories

Categories

CLI User Interface
GroupId

GroupId

io.github.clearwsd
ArtifactId

ArtifactId

clearwsd-cli
Last Version

Last Version

0.12.1
Release Date

Release Date

Type

Type

jar
Description

Description

ClearWSD CLI
Command line interfaces for non-programmatic training and experimentation.

Download clearwsd-cli

How to add to project

<!-- https://jarcasting.com/artifacts/io.github.clearwsd/clearwsd-cli/ -->
<dependency>
    <groupId>io.github.clearwsd</groupId>
    <artifactId>clearwsd-cli</artifactId>
    <version>0.12.1</version>
</dependency>
// https://jarcasting.com/artifacts/io.github.clearwsd/clearwsd-cli/
implementation 'io.github.clearwsd:clearwsd-cli:0.12.1'
// https://jarcasting.com/artifacts/io.github.clearwsd/clearwsd-cli/
implementation ("io.github.clearwsd:clearwsd-cli:0.12.1")
'io.github.clearwsd:clearwsd-cli:jar:0.12.1'
<dependency org="io.github.clearwsd" name="clearwsd-cli" rev="0.12.1">
  <artifact name="clearwsd-cli" type="jar" />
</dependency>
@Grapes(
@Grab(group='io.github.clearwsd', module='clearwsd-cli', version='0.12.1')
)
libraryDependencies += "io.github.clearwsd" % "clearwsd-cli" % "0.12.1"
[io.github.clearwsd/clearwsd-cli "0.12.1"]

Dependencies

compile (5)

Group / Artifact Type Version
io.github.clearwsd : clearwsd-stanford jar 0.12.1
org.mapdb : mapdb jar 3.0.7
com.google.guava : guava jar 27.0-jre
org.slf4j : slf4j-api jar 1.7.25
ch.qos.logback : logback-classic jar 1.2.3

provided (1)

Group / Artifact Type Version
org.projectlombok : lombok jar 1.18.4

test (1)

Group / Artifact Type Version
junit : junit jar 4.12

Project Modules

There are no modules declared in this project.

ClearWSD

Maven Central Build Status License

ClearWSD is a word sense disambiguation tool for the JVM, with core modules available under an Apache 2.0 license. It provides simple APIs for integration with other libraries, as well as a command-line interface (CLI) for non-programmatic use. It is modular, allowing for alternative implementations of sub-components such as parsers or resources used for feature extraction.

It is meant for use in both research and production settings. Main features include

  • State-of-the-art results in verb sense disambiguation over VerbNet classes
  • Automatic optimization of feature subsets and hyperparameters
  • Production-ready pre-trained models
  • Easy training of new models using CLI
  • 1000+ sense predictions per second on a 2014 MacBook Pro

API

The easiest way to make use of ClearWSD in your project is through Maven, by simply adding corresponding ClearWSD dependencies to your project's pom.xml.

Releases are distributed through Maven Central.

To try out ClearWSD in your project, you will need to include three modules, the first being clearwsd-core:

<dependency>
  <groupId>io.github.clearwsd</groupId>
  <artifactId>clearwsd-core</artifactId>
  <version>0.12.1</version>
</dependency>

and the second being a parser module, used for pre-processing and feature extraction. A wrapper for the NLP4J dependency parser is provided:

<dependency>
  <groupId>io.github.clearwsd</groupId>
  <artifactId>clearwsd-nlp4j</artifactId>
  <version>0.12.1</version>
</dependency>

Finally, to use pre-trained word sense disambiguation models (compatible with NLP4J), just add the following:

<dependency>
  <groupId>io.github.clearwsd</groupId>
  <artifactId>clearwsd-models</artifactId>
  <version>0.12.1</version>
</dependency>

You can then try out a pre-trained model (from OntoNotes) with the following:

import java.util.List;

import io.github.clearwsd.DefaultSensePredictor;
import io.github.clearwsd.SensePrediction;
import io.github.clearwsd.corpus.ontonotes.OntoNotesSense;
import io.github.clearwsd.parser.Nlp4jDependencyParser;

public class Test {
    public static void main(String[] args) {
        Nlp4jDependencyParser parser = new Nlp4jDependencyParser(); // load dependency parser
        DefaultSensePredictor<OntoNotesSense> wsd = DefaultSensePredictor.loadFromResource(
                "models/nlp4j-ontonotes.bin", parser); // load WSD model

        String sentence = "Mary took the bus to school (which " // 8 --> travel by means of
                + "took about 30 minutes), and studiously "     // 3 --> require or necessitate
                + "took notes about the Bolsheviks "            // 2 --> light verb usage
                + "taking over the Winter Palace";              // 9 --> claim or conquer, become in control of

        List<String> tokens = parser.tokenize(sentence); // split sentence into tokens

        // display sense predictions and their definitions
        for (SensePrediction<OntoNotesSense> prediction : wsd.predict(tokens)) {
            System.out.println(prediction.sense().getNumber() + " --> " + prediction.sense().getName());
        }
    }
}

Command Line Interface

ClearWSD provides a command-line interface for training, evaluation, and application of word sense disambiguation models.

To build ClearWSD, you will need Java 8 or above and Apache Maven.

On OS X/Linux, you can then build the project for CLI use:

git clone https://github.com/clearwsd/clearwsd.git
cd clearwsd
mvn package -DskipTests -P build-nlp4j-cli

To use the Stanford Parser wrapper module (GPL licensed) instead, use build-stanford-cli:

mvn package -DskipTests -P build-stanford-cli

You can see a help message and available options with the following command (assuming you have already followed the CLI setup instructions):

java -jar clearwsd-cli-*.jar --help
Usage: WordSenseCLI [options]
  Options:
    -model, -m
      Path to classifier model (for loading or saving)
    -input, -i
      Path to unlabeled input file for new predictions
    -train, -t
      Path to training data (required for training)
    -valid, -dev, -v
      Path to validation data
    -cv, -folds
      Number of cross-validation folds
      Default: 0
    -test
      Path to test data
    --itl, --interactive, --loop
      Start an interactive test session on provided model (after training 
      and/or testing)
      Default: false
    --om
      Output misses on evaluation data in separate files
      Default: false
    --reparse
      Reparse, even if a parsed file of the same name already exists
      Default: false
    --help, --usage
      Display usage
    -corpus
      Training/evaluation corpus type
      Default: Semlink
      Possible Values: [Semeval, Semlink]
    -dataExt
      Extension for training data file (only needed for Semeval XML corpora)
      Default: .data.xml
    -ext
      Parse file extension, appended to input file names to save parses
      Default: .dep
    -inventory, -inv
      Sense inventory
      Possible Values: [VerbNet, WordNet, OntoNotes, Counting]
    -inventoryPath
      Sense inventory path (optional)
    -keyExt
      Extension for sense key file (only needed for Semeval XML corpora)
      Default: .gold.key.txt
    -output, -o
      Path to output file where predictions on the input file are stored

Training

To train a new model, you must specify the path to a training data file with -train, as well as a path for the resulting saved model, using -model:

java -jar clearwsd-cli-*.jar -train path/to/training/file.txt -model path/to/save/model.bin

The default corpus (Semlink) expects files with an instance per line in the following format:

document_id <space> sentence_id <space> token# <space> lemma <space> sense_label <tab> sentence_text

sentence_text should be a single sentence containing the instance, with tokens separated by spaces:

example.txt 25 3 get comprehend-87.2-1	Oh , I get it .
example.txt 57 2 get get-13.5.1-1	Did you get that part ?

Evaluation

The CLI provides several modes of evaluation/application. You can perform cross-validation, test on a specific dataset, apply a trained model to raw text, or try out a model interactively by typing in test sentences.

Cross Validation

Specify the number of folds with -cv. -cv 5, for example, can be used for 5-fold cross validation.:

java -jar clearwsd-cli-*.jar -train path/to/training/file.txt -cv 5
Test Dataset

Specify a test file with -test:

java -jar clearwsd-cli-*.jar -test path/to/test/file.txt -model path/to/trained/model.bin
Application

To apply a trained model to new (raw) data, specify a path with -input. Optionally specify an output path with -output:

java -jar clearwsd-cli-*.jar -input path/to/raw/data.txt -output path/to/predictions.txt \
-model clearwsd-models/src/main/resources/models/nlp4j-ontonotes.bin
Interactive Testing

--loop or --itl can be used to start an interactive command line test loop, where you can input sentences and see predictions.

java -jar clearwsd-cli-*.jar --loop -model clearwsd-models/src/main/resources/models/nlp4j-verbnet-3.3.bin

After the parser and model finish loading, you should then be able to enter test sentences and see predicted senses:

Enter test input ("EXIT" to quit).
> please take notes

Please
take[25.2]
notes

> Take the train home.

Take[51.4.3]
the
train
home

> Take on the government

Take[98]
on
the
government

> Take the money out of the vault

Take[13.5.1]
the
money
out
of
the
vault

License

Please refer to the LICENSE.txt in individual modules.

io.github.clearwsd

ClearWSD

Versions

Version
0.12.1
0.12.0
0.10.0