dap

Document Analysis Platform

License

License

GroupId

GroupId

com.github.document-analysis
ArtifactId

ArtifactId

dap
Last Version

Last Version

0.1.1
Release Date

Release Date

Type

Type

jar
Description

Description

dap
Document Analysis Platform
Project URL

Project URL

https://github.com/document-analysis/dap
Source Code Management

Source Code Management

https://github.com/document-analysis/dap

Download dap

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.document-analysis/dap/ -->
<dependency>
    <groupId>com.github.document-analysis</groupId>
    <artifactId>dap</artifactId>
    <version>0.1.1</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.document-analysis/dap/
implementation 'com.github.document-analysis:dap:0.1.1'
// https://jarcasting.com/artifacts/com.github.document-analysis/dap/
implementation ("com.github.document-analysis:dap:0.1.1")
'com.github.document-analysis:dap:jar:0.1.1'
<dependency org="com.github.document-analysis" name="dap" rev="0.1.1">
  <artifact name="dap" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.document-analysis', module='dap', version='0.1.1')
)
libraryDependencies += "com.github.document-analysis" % "dap" % "0.1.1"
[com.github.document-analysis/dap "0.1.1"]

Dependencies

test (1)

Group / Artifact Type Version
junit : junit jar 4.12

Project Modules

There are no modules declared in this project.

Document Analysis Platform

What it is:

The Document-Analysis Platform, or DAP, is a programming platform for integrating several NLP tools, making them:

  • interact with each other, and
  • conform to the same interface.

DAP is a lightweight, simple and easy-to-use alternative to UIMA. While UIMA is a revolutionary and strong platform, it suffers from significant drawbacks, which turned into high barriers for new-comers.

The need for a simple, easy-to-learn and easy-to-use alternative, which preserves only the core ideas of UIMA, is the motivation behind DAP development.

The advantages of DAP over UIMA are:

  • UIMA takes several weeks to learn, and requires reading of hundreds of user-manuals pages. Getting started with DAP takes no longer than 5-10 minutes. Learning DAP 100% A-to-Z takes only 20 minutes.
  • UIMA requires long and hard-to-maintain XML files. DAP requires nothing but pure-Java programming.
  • UIMA employs unusual paradigms for exception throwing, logging, constructing objects, etc. DAP follows normal Java conventions.

The core idea

NLP tools tend to depend on each other. Part-of-speech taggers operate over tokenized texts. Syntactic parsers operate over part-of-speech annotations. Coreference-resolvers operate over syntactic analyses. etc. In short, higher level tools rely on the output of lower-level ones.

This brings up the challenge of integration. Both the syntactic-parser and the part-of-speech tagger should agree on the data-structures and the format of a POS-tagged text. In other words, the POS-tagger output should be what the syntactic-parser expects. This requirement applies to every set of tools with dependencies between them.

Moreover, if all POS-taggers conform to the same format, then replacing one tagger by another is transparent to the syntactic-parser. Similarly, if all the parsers conform to the same format, then replacing one parser by another is transparent to the coreference-resolver.

The goal of DAP is to target this integration challenge. DAP provides data-structures with characteristics and utilities that make them fit for virtually every standard NLP tool. The main two data-structures are document and annotation. The output of every NLP tool can be stored as annotations in documents, with features, attributes, and inter-annotation relations.

In addition to data-structures, an actual set of part-of-speech tags, syntactic phrases types, syntactic-dependency-relations, etc. is required. The project DAP-DKPro_1_8 provides a standard set of NLP types, borrowing them from the DKPro project.

Batteries included

Users can start working with DAP right-away with dozens of state-of-the-art NLP tools for several languages, by using the DAP-DKPro_1_8 library, which wraps DKPro tools inside DAP.

A demo is provided in DAP-DKPro_1_8-demo.

Usage in Maven

The project has been uploaded to Maven central repository.

In a Maven project, add the following:

<dependency>
  <groupId>com.github.document-analysis</groupId>
  <artifactId>dap</artifactId>
  <version>0.1.1</version>
</dependency>

To get started, related projects should be imported as well. See:

  1. dap-uimafit
  2. dap-dkpro_1_8
  3. dap-dkpro_1_8-demo

Your first steps

Start by reading the 20-minutes-tutorial.

Then jump to the demo.

License

DAP is licensed under Apache 2.0 license, which is a permissive license that is good also for commercial use.

Note that DAP-DKPro_1_8-demo depends on external libraries, which have more restrictive licenses.

Versions

Version
0.1.1
0.1