com.amihaiemil.web:charles

Charles is a smart web crawling library.

License

License

GroupId

GroupId

com.amihaiemil.web
ArtifactId

ArtifactId

charles
Last Version

Last Version

1.1.1
Release Date

Release Date

Type

Type

jar
Description

Description

com.amihaiemil.web:charles
Charles is a smart web crawling library.
Project URL

Project URL

http://www.amihaiemil.com
Source Code Management

Source Code Management

https://github.com/amihaiemil/charles

Download charles

How to add to project

<!-- https://jarcasting.com/artifacts/com.amihaiemil.web/charles/ -->
<dependency>
    <groupId>com.amihaiemil.web</groupId>
    <artifactId>charles</artifactId>
    <version>1.1.1</version>
</dependency>
// https://jarcasting.com/artifacts/com.amihaiemil.web/charles/
implementation 'com.amihaiemil.web:charles:1.1.1'
// https://jarcasting.com/artifacts/com.amihaiemil.web/charles/
implementation ("com.amihaiemil.web:charles:1.1.1")
'com.amihaiemil.web:charles:jar:1.1.1'
<dependency org="com.amihaiemil.web" name="charles" rev="1.1.1">
  <artifact name="charles" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.amihaiemil.web', module='charles', version='1.1.1')
)
libraryDependencies += "com.amihaiemil.web" % "charles" % "1.1.1"
[com.amihaiemil.web/charles "1.1.1"]

Dependencies

compile (8)

Group / Artifact Type Version
org.glassfish.jaxb : jaxb-runtime jar 2.2.11
org.slf4j : slf4j-log4j12 jar 1.7.21
org.seleniumhq.selenium : selenium-remote-driver jar 2.41.0
org.seleniumhq.selenium : selenium-java jar 2.41.0
com.jcabi : jcabi-http jar 1.16
com.fasterxml.jackson.core : jackson-core jar 2.7.4
com.fasterxml.jackson.core : jackson-databind jar 2.7.4
org.glassfish : javax.json jar 1.0.4

test (5)

Group / Artifact Type Version
com.github.detro : phantomjsdriver jar 1.2.0
junit : junit jar 4.12
com.sun.grizzly : grizzly-servlet-webserver jar 1.9.64
org.mockito : mockito-all jar 1.9.5
org.hamcrest : hamcrest-all jar 1.3

Project Modules

There are no modules declared in this project.

charles

Smart web crawler.

Build Status PDD status Coverage Status

DevOps By Rultor.com We recommend IntelliJ IDEA

A smart web crawler that fetches data from a website and stores it in some way (writes it in files on the disk or POSTs it to an http endpoint etc) .

More options for crawling:

  1. crawl the links from a sitemap.xml

  2. crawl the website as a graph starting from a given url (the index)

  3. crawl with retrial if any RuntimeException happens etc

More details in this post.

Maven dependency

Get it using Maven:

<dependency>
    <groupId>com.amihaiemil.web</groupId>
    <artifactId>charles</artifactId>
    <version>1.1.1</version>
</dependency>

or take the fat jar.

Under the hood

Charles is powered by Selenium WebDriver. Any WebDriver implementation can be used to build a WebCrawl Examples:

Since it uses a web driver to render the pages, also any dynamic content will be crawled (e.g. content generated by javascript)

How to contribute

Read this post.

  1. Open an issue regarding an improvement you thought of, or a bug you noticed.
  2. If the issue is confirmed, fork the repository, do the changes on a sepparate branch and make a Pull Request.
  3. After review and acceptance, the PR is merged and closed.
  4. You are automatically listed as a contributor on the project's site

Make sure the maven build

$ mvn clean install -Dgoogle.chrome={path/to/chrome} -Pitcases

passes before making a PR.

Google Chrome has to have a version >=59, in order to support headless mode.

Integration tests

Integration tests are performed with Google Chrome run in headless mode. You also need to install chromedriver in order for everything to work.

You can skip the integration tests by omitting -Pitcases from the build command.

Versions

Version
1.1.1
1.1.0
1.0.2
1.0.1
1.0.0