Document Normalizer

Tools for normalizing documents before processing

License

License

Categories

Categories

ORM Data
GroupId

GroupId

uk.ac.gate.plugins
ArtifactId

ArtifactId

document-normalizer
Last Version

Last Version

8.5
Release Date

Release Date

Type

Type

jar
Description

Description

Document Normalizer
Tools for normalizing documents before processing
Project Organization

Project Organization

GATE
Source Code Management

Source Code Management

https://github.com/GateNLP/gateplugin-DocumentNormalizer

Download document-normalizer

How to add to project

<!-- https://jarcasting.com/artifacts/uk.ac.gate.plugins/document-normalizer/ -->
<dependency>
    <groupId>uk.ac.gate.plugins</groupId>
    <artifactId>document-normalizer</artifactId>
    <version>8.5</version>
</dependency>
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/document-normalizer/
implementation 'uk.ac.gate.plugins:document-normalizer:8.5'
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/document-normalizer/
implementation ("uk.ac.gate.plugins:document-normalizer:8.5")
'uk.ac.gate.plugins:document-normalizer:jar:8.5'
<dependency org="uk.ac.gate.plugins" name="document-normalizer" rev="8.5">
  <artifact name="document-normalizer" type="jar" />
</dependency>
@Grapes(
@Grab(group='uk.ac.gate.plugins', module='document-normalizer', version='8.5')
)
libraryDependencies += "uk.ac.gate.plugins" % "document-normalizer" % "8.5"
[uk.ac.gate.plugins/document-normalizer "8.5"]

Dependencies

provided (1)

Group / Artifact Type Version
uk.ac.gate : gate-core jar 8.5

test (1)

Group / Artifact Type Version
uk.ac.gate : gate-plugin-test-utils jar 8.5

Project Modules

There are no modules declared in this project.

A simple PR to allow for basic document normalization. Should usually be run as the first PR in a pipeline after Document Reset. The PR edits the document content and so once it has been run over a document once, future executions will have no effect although will require processing time.

The PR works from a file of replacements. Essentially this file consists of pairs of lines. The first line specifics the text to replace, while the second line signifies what will be substituted in its place. The first line can be a regular expression, but back references cannot be used within the second line.

The most common use for this PR is to normalise punctuation symbols as WYSIWYG editors often automatically replace standard apostrophe and hyphen symbols with more fancy versions. This makes processing text difficult as gazetteer lists, JAPE grammars and other resources usually assume the use of the standard symbols, i.e. the ones on the keyboard. The default config file is aimed at normalizing such cases.

uk.ac.gate.plugins

GateNLP

GATE - General Architecture for Text Engineering

Versions

Version
8.5
8.5-alpha1