Format: ALTO

Document Format plugin to support reading ALTO XML files

License

License

Categories

Categories

ORM Data
GroupId

GroupId

uk.ac.gate.plugins
ArtifactId

ArtifactId

format-alto
Last Version

Last Version

1.1
Release Date

Release Date

Type

Type

jar
Description

Description

Format: ALTO
Document Format plugin to support reading ALTO XML files
Project Organization

Project Organization

GATE
Source Code Management

Source Code Management

https://github.com/GateNLP/gateplugin-Format_ALTO

Download format-alto

How to add to project

<!-- https://jarcasting.com/artifacts/uk.ac.gate.plugins/format-alto/ -->
<dependency>
    <groupId>uk.ac.gate.plugins</groupId>
    <artifactId>format-alto</artifactId>
    <version>1.1</version>
</dependency>
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/format-alto/
implementation 'uk.ac.gate.plugins:format-alto:1.1'
// https://jarcasting.com/artifacts/uk.ac.gate.plugins/format-alto/
implementation ("uk.ac.gate.plugins:format-alto:1.1")
'uk.ac.gate.plugins:format-alto:jar:1.1'
<dependency org="uk.ac.gate.plugins" name="format-alto" rev="1.1">
  <artifact name="format-alto" type="jar" />
</dependency>
@Grapes(
@Grab(group='uk.ac.gate.plugins', module='format-alto', version='1.1')
)
libraryDependencies += "uk.ac.gate.plugins" % "format-alto" % "1.1"
[uk.ac.gate.plugins/format-alto "1.1"]

Dependencies

provided (1)

Group / Artifact Type Version
uk.ac.gate : gate-core jar 8.6

test (1)

Group / Artifact Type Version
uk.ac.gate : gate-plugin-test-utils jar 8.6

Project Modules

There are no modules declared in this project.

GATE Support for ALTO XML documents

This plugin provides support for reading documents stored as ALTO XML. The format is usually used to store OCR based transcriptions of documents and hence contains information on the position within the page of the text as well as the text itself. It's popular among libraries and museums as a way of providing digital copies of scanned document and manuscripts. For example, the British Libray offers a number of collections of digitised books in this format.

The code provided by this plugin focuses purely on the text content of ALTO XML files and completely ignores the positional information. Specifically it reads the String elements that appear within TextBlock elements that are within the PrintSpace of each page. This means that text in the header, footer, and margins are ignored. This is based on previous experiance with processing multi-page formats (such as PDFs) where the header and footer make the processing of text which flows across pages exceptionally problematic. This may change in future versions.

To activate the plugin (once loaded) set the mime type to application/xml+alto when loading documents.

uk.ac.gate.plugins

GateNLP

GATE - General Architecture for Text Engineering

Versions

Version
1.1
1.0