Concrete - Agiga Converter

Library providing utilities for converting Annotated Gigaword documents to the Concrete NLP data schema.

License

License

GroupId

GroupId

edu.jhu.hlt
ArtifactId

ArtifactId

concrete-agiga
Last Version

Last Version

4.4.0
Release Date

Release Date

Type

Type

jar
Description

Description

Concrete - Agiga Converter
Library providing utilities for converting Annotated Gigaword documents to the Concrete NLP data schema.
Project URL

Project URL

https://github.com/hltcoe/concrete-agiga
Project Organization

Project Organization

Johns Hopkins University HLTCOE
Source Code Management

Source Code Management

https://github.com/hltcoe/concrete-agiga

Download concrete-agiga

How to add to project

<!-- https://jarcasting.com/artifacts/edu.jhu.hlt/concrete-agiga/ -->
<dependency>
    <groupId>edu.jhu.hlt</groupId>
    <artifactId>concrete-agiga</artifactId>
    <version>4.4.0</version>
</dependency>
// https://jarcasting.com/artifacts/edu.jhu.hlt/concrete-agiga/
implementation 'edu.jhu.hlt:concrete-agiga:4.4.0'
// https://jarcasting.com/artifacts/edu.jhu.hlt/concrete-agiga/
implementation ("edu.jhu.hlt:concrete-agiga:4.4.0")
'edu.jhu.hlt:concrete-agiga:jar:4.4.0'
<dependency org="edu.jhu.hlt" name="concrete-agiga" rev="4.4.0">
  <artifact name="concrete-agiga" type="jar" />
</dependency>
@Grapes(
@Grab(group='edu.jhu.hlt', module='concrete-agiga', version='4.4.0')
)
libraryDependencies += "edu.jhu.hlt" % "concrete-agiga" % "4.4.0"
[edu.jhu.hlt/concrete-agiga "4.4.0"]

Dependencies

compile (4)

Group / Artifact Type Version
edu.stanford.nlp : stanford-corenlp jar 3.4
edu.jhu.agiga : agiga jar 1.4
edu.jhu.hlt : concrete-util jar 4.4.3
edu.jhu.hlt : concrete-validation jar 4.4.3

test (1)

Group / Artifact Type Version
junit : junit jar 4.11

Project Modules

There are no modules declared in this project.

concrete-agiga

concrete-agiga is a Java library that maps Annotated Gigaword documents to Concrete.

Maven dependency

<dependency>
  <groupId>edu.jhu.hlt</groupId>
  <artifactId>concrete-agiga</artifactId>
  <version>4.4.0</version>
</dependency>

TLDR / Quick start

mvn clean compile assembly:single
java -cp target/concrete-agiga-4.4.0-jar-with-dependencies.jar \
    edu.jhu.hlt.concrete.agiga.AgigaConverter \
    path/to/output/dir \
    drop-annotations \
    path/to/xml/or/xml/gz/file

Arguments:

  • path/to/output/dir - where annotated files will end up
  • drop-annotations - boolean - whether or not to drop annotations that are in the .xml files
    • for RAW files, set to true, for ANNOTATED files, set to false
  • path/to/xml/or/xml/gz/file - path to one or more .xml or .xml.gz files to process

Requirements:

  • java >= 1.8
  • mvn >= 3.0.4

Notes

One implementation detail to be aware of: The anno-pipeline outputs tokens that contain strings rather than character offsets. So we are not able to perfectly recreate the original document. The rule this uses is to one space between tokens and a newline after every sentence. This will only affect you if you rely on character distances and you use Concrete's TextSpan (e.g. "Mike's house" => Token("Mike") Token("'s") Token("house") => "Mike 's house"))

edu.jhu.hlt

JHU Human Language Technology Center of Excellence

Versions

Version
4.4.0
4.3.3
4.3.2
4.3.1
4.2.1