lt.tokenmill.crawling:elasticsearch

Framework to simplify news crawling

License	License The Apache License, Version 2.0
Categories	Categories Search Business Logic Libraries Elasticsearch
GroupId	GroupId lt.tokenmill.crawling
ArtifactId	ArtifactId elasticsearch
Last Version	Last Version 0.2.0
Release Date	Release Date Sep 19, 2017
Type	Type jar
Description	Description Framework to simplify news crawling
Source Code Management	Source Code Management https://github.com/tokenmill/crawling-framework/tree/master/elasticsearch

Download elasticsearch

Filename	Size
elasticsearch-0.2.0.pom
elasticsearch-0.2.0.jar	38 KB
elasticsearch-0.2.0-tests.jar	22 KB
elasticsearch-0.2.0-sources.jar	16 KB
elasticsearch-0.2.0-javadoc.jar	68 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/lt.tokenmill.crawling/elasticsearch/ -->
<dependency>
    <groupId>lt.tokenmill.crawling</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>0.2.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/lt.tokenmill.crawling/elasticsearch/
implementation 'lt.tokenmill.crawling:elasticsearch:0.2.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/lt.tokenmill.crawling/elasticsearch/
implementation ("lt.tokenmill.crawling:elasticsearch:0.2.0")

Apache Buildr

'lt.tokenmill.crawling:elasticsearch:jar:0.2.0'

Apache Ivy

<dependency org="lt.tokenmill.crawling" name="elasticsearch" rev="0.2.0">
  <artifact name="elasticsearch" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='lt.tokenmill.crawling', module='elasticsearch', version='0.2.0')
)

Scala SBT

libraryDependencies += "lt.tokenmill.crawling" % "elasticsearch" % "0.2.0"

Leiningen

[lt.tokenmill.crawling/elasticsearch "0.2.0"]

Dependencies

compile (4)

Group / Artifact	Type	Version
lt.tokenmill.crawling : data-model	jar	0.2.0
org.elasticsearch : elasticsearch	jar	5.5.1
org.elasticsearch.client : transport	jar	5.5.1
com.google.guava : guava	jar	19.0

provided (3)

Group / Artifact	Type	Version
org.apache.logging.log4j : log4j-api	jar	2.7
org.apache.logging.log4j : log4j-core	jar	2.7
org.slf4j : slf4j-log4j12	jar	1.7.12

test (2)

Group / Artifact	Type	Version
org.elasticsearch.plugin : transport-netty4-client	jar	5.5.1
junit : junit	jar	4.12

Project Modules

There are no modules declared in this project.

Crawling Framework

Crawling Framework aims at providing instruments to configure and run your Storm Crawler based crawler. It mainly aims at easing crawling of article content publishing sites like news portals or blog sites. With the help of GUI tool Crawling Framework provides you can:

Specify which sites to crawl.
Configure URL inclusion and exclusion filters, thus controlling which sections of the site will be fetched.
Specify which elements of the page provide information about article publication name, its title and main body.
Define tests which validate that extraction rules are working.

Once configuration is done the Crawling Framework runs Storm Crawler based crawling following the rules specified in the configuration.

Introduction

We have recorded a video on how to setup and use Crawling Framework. Click on the image below to watch in on Youtube.

Requirements

Framework writes its configuration and stores crawled data to ElasticSearch. Before starting crawl project install ElasticSearch (Crawling Framework is tested to work with Elastic v7.x).

Crawling Framework is a Java lib which will have to be extended to run Storm Crawler topology, thus Java (JDK8, Maven) infrastructure will be needed.

Using password protected ElasticSearch

Some providers hide ElasticSearch under authentification step (Which makes sense). Just set environment variables ES_USERNAME and ES_PASSWORD accordingly, everything else can remain the same. Authentification step will be done implicitly if proper credentials are there

Configuring and Running a crawl

See Crawling Framework Example project's documentation.

License

Distributed under the The Apache License, Version 2.0.

TokenMill

We can help you with your natural language generation and processing projects

Versions

Version
0.2.0 Sep 19, 2017

lt.tokenmill.crawling:elasticsearch

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Source Code Management

Download elasticsearch

How to add to project

Dependencies

compile (4)

provided (3)

test (2)

Project Modules

Crawling Framework

Introduction

Requirements

Using password protected ElasticSearch

Configuring and Running a crawl

License

TokenMill

Versions