concrete-stanford
Provides a library that can take Communication
objects, containing Section
s, and annotate them using the Stanford CoreNLP framework. This produces Communication
objects with Tokenization
objects, and optionally EntityMention
and Entity
objects.
Maven dependency
<dependency>
<groupId>edu.jhu.hlt</groupId>
<artifactId>concrete-stanford</artifactId>
<version>x.y.z</version>
</dependency>
See pom.xml for latest version.
Sectioned Communications are required
All examples assume the input files contain Communication
objects with, at minimum, Section
objects underneath them and the text
field set. This library will not produce useful output if there are no Section
objects underneath the Communication
objects that are run. There are two primary drivers --- one that processes Tokenized Concrete
files, and one that does not. Each has its own requirements, described below.
Running over a directory of tar gz files with qsub
If you have a directory of .tar.gz
files you want to run through stanford, see this script on how to do that via qsub
.
Make sure to build the project before running the script.
Quick start / API Usage
Load in a Communication
with Section
s with TextSpan
s:
// Sections are required for useful output
Communication withSections = ...;
// You need to know what language the Communication is written in
String language = "en";
Then create an annotator object and the language of the Communication
. The following example shows the AnnotateNonTokenizedConcrete
tool.
PipelineLanguage lang = PipelineLanguage.getEnumeration(language);
AnnotateNonTokenizedConcrete analytic = new AnnotateNonTokenizedConcrete(lang);
Run over the Communication
:
// Option 1: Wrap the Communication in an appropriate wrapper to ensure pre-reqs are handled
// Below throws a MiscommunicationException if there are no Sections or there are Sentences
// within the Sections.
NonSentencedSectionedCommunication wc = new NonSentencedSectionedCommunication(withSections);
StanfordPostNERCommunication annotated = annotatedWithStanford = analytic.annotate(wc);
// Call 'getRoot()' to get the root, unwrapped Communication.
Communication unwrapped = annotated.getRoot();
// Option 2: Do not wrap the Communication, and handle the possible exception.
// Below will throw if the passed in Communication 'withSections' is invalid
// for the analytic.
StanfordPostNERCommunication annotated = annotatedWithStanford = analytic.annotate(withSections);
Communication unwrapped = annotated.getRoot();
annotated
is a Communication
with the output of the system. This includes sentences and tokenizations, and DEPENDING on the annotator, entity mentions and entities as well.
StanfordPostNERCommunication
is a utility wrapper that allows easier access to members; see here for the implementations.
Running as a command-line program
You can also run this tool as a command line program: both AnnotateTokenizedConcrete
and AnnotateNonTokenizedConcrete
can be run via the command line.
- Argument 1: a path to a file on disk that is either a serialized Concrete
Communication
(ending with.concrete
), a.tar
file of serialized ConcreteCommunication
objects, or a.tar.gz
file with serialized ConcreteCommunication
objects. Recall that eachCommunication
must haveSection
objects and must havetext
fields set. - Argument 2: a path that represents the desired output. The below are supported:
Input | Result |
---|---|
.concrete or .comm file |
Produces a single new .concrete or .comm file |
.tar file with Communication objects |
Produces a single .tar file with annotated Communication s |
.tar.gz ... |
Produces a single .tar.gz file with annotated Communication s |
Alternatively, you can pass in a directory as output. If only a directory is used as output, the file name from the input will be used and extension mirrored (e.g., if .tar
is input, .tar.gz
will be output).
- Argument 3 (optional): The language to use. Currently supported are
en
andcn
(for English and Chinese). The default isen
.
Known Annotators
concrete-stanford
can annotate text that is both pre-tokenized and text that is not.
By default, all annotators add named entity recognition, part-of-speech, lemmatization, a constituency parse and three dependency parses (converted deterministically from the constituency parse).
Non-Tokenized Input
The main annotator for non-tokenized input is AnnotateNonTokenizedConcrete
. It requires sectioned data, and each section must have valid textSpans
set.
In addition to the above added annotations, AnnotateNonTokenizedConcrete
will add entity mention identification and coreference.
Tokenized Input
The main annotator for non-tokenized input is AnnotateTokenizedConcrete
. It requires fully Tokenized data; each {Section
,Sentence
,Token
} must have valid textSpans
set.
Running the tool
Prepare
Replace the environment variables in the code below with directories that represent your input and output.
TLDR
The following should be compliant in any sh
-like shell.
Be sure to change [en | cn]
to either en
or cn
, depending on what language your documents are in.
export CONC_STAN_INPUT_FILE=/path/to/.concrete/or/.tar/or/.tar.gz
export CONC_STAN_OUTPUT_DIR=/path/to/output/dir
mvn clean compile assembly:single
java -cp target/*.jar edu.jhu.hlt.concrete.stanford.AnnotateNonTokenizedConcrete \
$CONC_STAN_INPUT_FILE \
$CONC_STAN_OUTPUT_DIR \
[en | cn]
Using Dockerized AnnotateCommunicationService
The Dockerfile stands up a server implementing Concrete's AnnotateCommunicationService
. An image built from this Dockerfile is available on Docker Hub as hltcoe/concrete-stanford, and can be pulled using:
docker pull hltcoe/concrete-stanford
To see what command line flags are supported, run:
docker run hltcoe/concrete-stanford --help
At minimum, you must specify a language (currently, either en
or cn
) using the --language
flag, e.g.:
docker run hltcoe/concrete-stanford --language en
The concrete-stanford AnnotateCommunicationService
requires Communications that have been at least section-segmented. See the "Known Annotators" section above for more details about the type of data concrete-stanford expects.