Boilerpipe -- Boilerplate Removal and Fulltext Extraction from HTML pages

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.

License

License

Apache License 2.0
GroupId

GroupId

de.l3s.boilerpipe
ArtifactId

ArtifactId

boilerpipe
Last Version

Last Version

1.1.0
Release Date

Release Date

Type

Type

jar
Description

Description

Boilerpipe -- Boilerplate Removal and Fulltext Extraction from HTML pages
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page. The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings. Extracting content is very fast (milliseconds), just needs the input document (no global or site-level information required) and is usually quite accurate. Boilerpipe is a Java library written by Christian Kohlschütter. It is released under the Apache License 2.0. The algorithms used by the library are based on (and extending) some concepts of the paper "Boilerplate Detection using Shallow Text Features" by Christian Kohlschütter et al., presented at WSDM 2010 -- The Third ACM International Conference on Web Search and Data Mining New York City, NY USA.

Download boilerpipe

How to add to project

<!-- https://jarcasting.com/artifacts/de.l3s.boilerpipe/boilerpipe/ -->
<dependency>
    <groupId>de.l3s.boilerpipe</groupId>
    <artifactId>boilerpipe</artifactId>
    <version>1.1.0</version>
</dependency>
// https://jarcasting.com/artifacts/de.l3s.boilerpipe/boilerpipe/
implementation 'de.l3s.boilerpipe:boilerpipe:1.1.0'
// https://jarcasting.com/artifacts/de.l3s.boilerpipe/boilerpipe/
implementation ("de.l3s.boilerpipe:boilerpipe:1.1.0")
'de.l3s.boilerpipe:boilerpipe:jar:1.1.0'
<dependency org="de.l3s.boilerpipe" name="boilerpipe" rev="1.1.0">
  <artifact name="boilerpipe" type="jar" />
</dependency>
@Grapes(
@Grab(group='de.l3s.boilerpipe', module='boilerpipe', version='1.1.0')
)
libraryDependencies += "de.l3s.boilerpipe" % "boilerpipe" % "1.1.0"
[de.l3s.boilerpipe/boilerpipe "1.1.0"]

Dependencies

There are no dependencies for this project. It is a standalone project that does not depend on any other jars.

Project Modules

There are no modules declared in this project.

Versions

Version
1.1.0
1.0.4