essence

Automatically extracts the main text content (and much more) from an HTML document

License

License

GroupId

GroupId

io.github.cdimascio
ArtifactId

ArtifactId

essence
Last Version

Last Version

0.13.0
Release Date

Release Date

Type

Type

jar
Description

Description

essence
Automatically extracts the main text content (and much more) from an HTML document
Project URL

Project URL

https://github.com/cdimascio/essence
Project Organization

Project Organization

Carmine DiMascio OSS
Source Code Management

Source Code Management

https://github.com/cdimascio/essence

Download essence

How to add to project

<!-- https://jarcasting.com/artifacts/io.github.cdimascio/essence/ -->
<dependency>
    <groupId>io.github.cdimascio</groupId>
    <artifactId>essence</artifactId>
    <version>0.13.0</version>
</dependency>
// https://jarcasting.com/artifacts/io.github.cdimascio/essence/
implementation 'io.github.cdimascio:essence:0.13.0'
// https://jarcasting.com/artifacts/io.github.cdimascio/essence/
implementation ("io.github.cdimascio:essence:0.13.0")
'io.github.cdimascio:essence:jar:0.13.0'
<dependency org="io.github.cdimascio" name="essence" rev="0.13.0">
  <artifact name="essence" type="jar" />
</dependency>
@Grapes(
@Grab(group='io.github.cdimascio', module='essence', version='0.13.0')
)
libraryDependencies += "io.github.cdimascio" % "essence" % "0.13.0"
[io.github.cdimascio/essence "0.13.0"]

Dependencies

compile (4)

Group / Artifact Type Version
org.jetbrains.kotlin : kotlin-stdlib jar 1.3.0
org.jsoup : jsoup jar 1.11.3
com.fasterxml.jackson.core : jackson-core jar 2.9.7
com.fasterxml.jackson.module : jackson-module-kotlin jar 2.9.7

test (2)

Group / Artifact Type Version
junit : junit jar 4.12
org.jetbrains.kotlin : kotlin-test-junit jar 1.3.0

Project Modules

There are no modules declared in this project.

essence

Maven Central All Contributors

An automatic web page content extractor for Kotlin and Java.

Given an HTML document, essence automatically extracts the main text content (and much more).

Try out the demo - a simple webapp to demonstrate essence.

This library is inspired by node-unfluff and its lineage

Usage

Java

import io.github.cdimascio.essence.Essence;

EssenceResult data = Essence.extract(html);
System.out.println(data.getText());

Kotlin

val data = Essence.extract(html)
println(data.text)

See Extracted data elements for additional extracted metadata.

Install

Maven

<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency>

Gradle

compile 'io.github.cdimascio:essence:0.13.0'

Try the Essence web demo

Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.

The essence web project lives here

Extracted data elements

essence attempts to extract the following content:

  • title - The document's title
  • softTitle - A version of title with less truncation
  • date - The document's publication date
  • copyright - The document's copyright line, if present
  • author - The document's author
  • publisher - The document's publisher (website name)
  • text - The main text of the document with all the junk thrown away
  • image - The main image for the document (what's used by facebook, etc.)
  • (coming soon...)videos - An array of videos that were embedded in the article. Each video has src, width and height.
  • tags- Any tags or keywords that could be found by checking <rel> tags or by looking at href urls.
  • canonicalLink - The canonical url of the document, if given.
  • lang - The language of the document, either detected or supplied by you.
  • description - The description of the document, from <meta> tags
  • favicon - The url of the document's favicon.
  • links - An array of links embedded within the article text. (text and href for each)

Credits

License

Apache 2.0

Buy Me A Coffee

Contributors

Thanks goes to these wonderful people (emoji key):


Clément P.

💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Versions

Version
0.13.0
0.12.6