BioMedICUS Tokenizer

A lightweight (small and dependency-free) Java 8 library for Penn-like tokenization. This was developed as a stand-alone component of BioMedICUS, a biomedical and clinical NLP engine developed by the NLP-IE Group at the University of Minnesota Institute for Health Informatics.

License

License

GroupId

GroupId

edu.umn.biomedicus
ArtifactId

ArtifactId

biomedicus-tokenizer
Last Version

Last Version

0.0.3
Release Date

Release Date

Type

Type

jar
Description

Description

BioMedICUS Tokenizer
A lightweight (small and dependency-free) Java 8 library for Penn-like tokenization. This was developed as a stand-alone component of BioMedICUS, a biomedical and clinical NLP engine developed by the NLP-IE Group at the University of Minnesota Institute for Health Informatics.
Project URL

Project URL

https://github.com/nlpie/biomedicus-tokenizer
Project Organization

Project Organization

University of Minnesota Institute for Health Informatics NLP/IE Program
Source Code Management

Source Code Management

https://github.com/nlpie/biomedicus-tokenizer

Download biomedicus-tokenizer

How to add to project

<!-- https://jarcasting.com/artifacts/edu.umn.biomedicus/biomedicus-tokenizer/ -->
<dependency>
    <groupId>edu.umn.biomedicus</groupId>
    <artifactId>biomedicus-tokenizer</artifactId>
    <version>0.0.3</version>
</dependency>
// https://jarcasting.com/artifacts/edu.umn.biomedicus/biomedicus-tokenizer/
implementation 'edu.umn.biomedicus:biomedicus-tokenizer:0.0.3'
// https://jarcasting.com/artifacts/edu.umn.biomedicus/biomedicus-tokenizer/
implementation ("edu.umn.biomedicus:biomedicus-tokenizer:0.0.3")
'edu.umn.biomedicus:biomedicus-tokenizer:jar:0.0.3'
<dependency org="edu.umn.biomedicus" name="biomedicus-tokenizer" rev="0.0.3">
  <artifact name="biomedicus-tokenizer" type="jar" />
</dependency>
@Grapes(
@Grab(group='edu.umn.biomedicus', module='biomedicus-tokenizer', version='0.0.3')
)
libraryDependencies += "edu.umn.biomedicus" % "biomedicus-tokenizer" % "0.0.3"
[edu.umn.biomedicus/biomedicus-tokenizer "0.0.3"]

Dependencies

compile (2)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.25
com.google.code.findbugs : jsr305 Optional jar 3.0.2

test (3)

Group / Artifact Type Version
org.junit.jupiter : junit-jupiter-engine jar 5.3.2
org.mockito : mockito-core jar 2.23.4
org.slf4j : slf4j-nop jar 1.7.25

Project Modules

There are no modules declared in this project.

BioMedICUS Tokenizer

A lightweight (small and dependency-free) Java 8 library for Penn-like tokenization. This was developed as a stand-alone component of BioMedICUS, a biomedical and clinical NLP engine developed by the NLP-IE Group at the University of Minnesota Institute for Health Informatics.

Using in your project

To use in a maven project, include the following in your pom:

<dependencies>
  <dependency>
    <groupId>edu.umn.biomedicus</groupId>
    <artifactId>biomedicus-tokenization</artifactId>
    <version>0.0.3</version>
  </dependency>
</dependencies>

Alternatively, download the .jar and include that in your libraries.

Detecting tokens from strings

Iteratively

import edu.umn.biomedicus.tokenization.Tokenizer;
import edu.umn.biomedicus.tokenization.TokenResult;

public class Example {
  public void example() {
    String text = "An example sentence.";
    for (TokenResult result : Tokenizer.tokenize(text)) {
      CharSequence tokenText = result.text(text);
    }
  }
}

All at once

import edu.umn.biomedicus.tokenization.Tokenizer;
import edu.umn.biomedicus.tokenization.TokenResult;

public class Example {
  public void example() {
    String text = "An example sentence.";
    List<TokenResult> results = Tokenizer.allTokens(text);
    for (TokenResult result : results) {
      CharSequence tokenText = result.text(text);
    }
  }
}

Javadoc

You can find the api documentation for this project here

Contact and Support

For issues or enhancement requests, feel free to submit to the Issues tab on GitHub.

BioMedICUS has a gitter chat and a Google Group for contacting developers with questions, suggestions or feedback.

About Us

BioMedICUS is developed by the University of Minnesota Institute for Health Informatics NLP/IE Group with assistance from the Open Health Natural Language Processing (OHNLP) Consortium.

Contributing

Anyone is welcome and encouraged to contribute. If you discover a bug, or think the project could use an enhancement, follow these steps:

  1. Create an issue and offer to code a solution. We can discuss the issue and decide whether any code would be a good addition to the project.
  2. Fork the project. [https://github.com/nlpie/biomedicus-tokenizer/fork]
  3. Create Feature branch (git checkout -b feature-name)
  4. Code your solution.
  • Follow the Google style guide for Java. There are IDE profiles available here.
  • Write unit tests for any non-trivial aspects of your code. If you are fixing a bug write a regression test: one that confirms the behavior you fixed stays fixed.
  1. Commit to branch. (git commit -am 'Summary of changes')
  2. Push to GitHub (git push origin feature-name)
  3. Create a pull request on this repository from your forked project. We will review and discuss your code and merge it.
edu.umn.biomedicus

Natural Language Processing / Information Extraction (NLP/IE) Program

The Natural Language Processing / Information Extraction (NLP/IE) Program at the University of Minnesota Institute for Health Informatics

Versions

Version
0.0.3
0.0.2
0.0.1