co.mailtarget.durian

durian web extractor

License

License

GroupId

GroupId

co.mailtarget
ArtifactId

ArtifactId

durian
Last Version

Last Version

0.0.10
Release Date

Release Date

Type

Type

jar
Description

Description

co.mailtarget.durian
durian web extractor
Project URL

Project URL

https://github.com/mailtarget/durian
Source Code Management

Source Code Management

https://github.com/mailtarget/durian

Download durian

How to add to project

<!-- https://jarcasting.com/artifacts/co.mailtarget/durian/ -->
<dependency>
    <groupId>co.mailtarget</groupId>
    <artifactId>durian</artifactId>
    <version>0.0.10</version>
</dependency>
// https://jarcasting.com/artifacts/co.mailtarget/durian/
implementation 'co.mailtarget:durian:0.0.10'
// https://jarcasting.com/artifacts/co.mailtarget/durian/
implementation ("co.mailtarget:durian:0.0.10")
'co.mailtarget:durian:jar:0.0.10'
<dependency org="co.mailtarget" name="durian" rev="0.0.10">
  <artifact name="durian" type="jar" />
</dependency>
@Grapes(
@Grab(group='co.mailtarget', module='durian', version='0.0.10')
)
libraryDependencies += "co.mailtarget" % "durian" % "0.0.10"
[co.mailtarget/durian "0.0.10"]

Dependencies

compile (4)

Group / Artifact Type Version
org.jetbrains.kotlin : kotlin-stdlib-jdk8 jar 1.3.70
org.jsoup : jsoup jar 1.13.1
org.apache.commons : commons-text jar 1.8
junit : junit jar 4.12

test (1)

Group / Artifact Type Version
org.jetbrains.kotlin : kotlin-test-junit jar 1.3.70

Project Modules

There are no modules declared in this project.

Durian Extractor

Web page extractor and readability using Jsoup.

Prerequisites:

Install

because this project not pushed to any public maven repos, you should install it first locally

    mvn clean install

add this project as dependency of your project

    <dependency>
        <groupId>co.mailtarget</groupId>
        <artifactId>durian</artifactId>
        <version>0.0.10</version>
    </dependency>

Usage

###kotin

    val extractor = WebExtractor.Builder
                    .strategy(Strategy.HYBRID)
                    .build()
    
    val webData = extractor.extract(url)

or

    val forceJavascript = false
    WebData webData = extractor.extract(url, forceJavacript)

###Java

    WebExtractor extractor = new WebExtractor.Builder()
                    .strategy(Strategy.HYBRID)
                    .build();
    WebData webData = extractor.extract(url);

or

    boolean forceJavascript = false;
    WebData webData = extractor.extract(url, forceJavacript);

Options

###Extract Strategy

  • META : fastest method, just parse content from meta
  • CONTENT : prefer using content as source of extraction
  • HYBRID : fetch from meta first, if not found search deeper from content

###System Config

tried in MAC OS machine and work well, on centos machine, please install

    yum groupinstall -y "Fonts"
    yum install gtk2 

optional : gtkhtml3 libXtst libxslt alsa-lib

co.mailtarget

MTARGET

MTARGET Developer

Versions

Version
0.0.10