scriptspider

ScriptSpider(SS) is a Java distributed crawler framework that supports hot swappable components. SS是一个java版本的分布式的通用爬虫，可以热插拔各个组件（提供默认的），自动切换代理，自动结构化数据与存储。使用redis，分布式调度等技术。

License	License The Apache Software License, Version 2.0
Categories	Categories IDE Development Tools
GroupId	GroupId com.github.xjtushilei
ArtifactId	ArtifactId scriptspider
Last Version	Last Version 0.3
Release Date	Release Date Mar 30, 2018
Type	Type jar
Description	Description scriptspider ScriptSpider(SS) is a Java distributed crawler framework that supports hot swappable components. SS是一个java版本的分布式的通用爬虫，可以热插拔各个组件（提供默认的），自动切换代理，自动结构化数据与存储。使用redis，分布式调度等技术。
Project URL	Project URL https://github.com/xjtushilei/ScriptSpider
Source Code Management	Source Code Management https://github.com/xjtushilei/ScriptSpider

Download scriptspider

Filename	Size
scriptspider-0.3.pom
scriptspider-0.3.jar	51 KB
scriptspider-0.3-sources.jar	34 KB
scriptspider-0.3-javadoc.jar	238 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.github.xjtushilei/scriptspider/ -->
<dependency>
    <groupId>com.github.xjtushilei</groupId>
    <artifactId>scriptspider</artifactId>
    <version>0.3</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.github.xjtushilei/scriptspider/
implementation 'com.github.xjtushilei:scriptspider:0.3'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.github.xjtushilei/scriptspider/
implementation ("com.github.xjtushilei:scriptspider:0.3")

Apache Buildr

'com.github.xjtushilei:scriptspider:jar:0.3'

Apache Ivy

<dependency org="com.github.xjtushilei" name="scriptspider" rev="0.3">
  <artifact name="scriptspider" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.github.xjtushilei', module='scriptspider', version='0.3')
)

Scala SBT

libraryDependencies += "com.github.xjtushilei" % "scriptspider" % "0.3"

Leiningen

[com.github.xjtushilei/scriptspider "0.3"]

Dependencies

compile (9)

Group / Artifact	Type	Version
ch.qos.logback : logback-classic	jar	1.2.3
ch.qos.logback : logback-access	jar	1.2.3
ch.qos.logback : logback-core	jar	1.2.3
org.slf4j : slf4j-api	jar	1.7.25
org.jsoup : jsoup	jar	1.11.2
redis.clients : jedis	jar	2.9.0
org.apache.httpcomponents : httpclient	jar	4.5.3
com.google.code.gson : gson	jar	2.8.2
commons-io : commons-io	jar	2.6

Project Modules

There are no modules declared in this project.

ScriptSpider

ScriptSpider（以下简称SS），做一个好用的爬虫框架。

目前的功能已经够大多数情况下使用，ScriptSpider会朝着易用、高度、最新技术的方向发展！

欢迎 Star 和 Fork 我的项目！

项目主页

国外：github
国内：coding.net

特点

Java开发（学习java的良方）
易理解（中文注释，多样例代码）
易用性（最短一行代码就可以开始爬虫）
代码少（已经默认实现了大部分功能）
基于Jsoup（个性化解析网页方便）
高度扩展性（热插拔组件，可定制每一个流程）
速度快（多线程爬虫，线程池管理，线程池下载，分布式）
分布式（基于redis，mq等，部署简单，速度很快）

使用情况

近12个月maven中央仓库使用情况。

安装

使用maven

<dependency>
    <groupId>com.github.xjtushilei</groupId>
    <artifactId>scriptspider</artifactId>
    <version>0.3</version>
    <!--请尽量使用最新版本. update time：2018年3月29日18:37:44-->
</dependency>

关于版本

请尽量使用最新版本，http://search.maven.org,中央仓库搜索最新版本

因为文档都是根据最新版本来及时更新的。

离线使用jar包

在项目主页的 releases目录

在最新的release下面，下载相应的所有的依赖包集合zip：dependency.zip。

打开自己的工程，导入即可！

如何开始

在开始之前，你应该先了解该框架是如何工作的。

流程图

基本上，你只需要提供“解析器”，“下载器”两个模块就好啦。

因为SS也不知道您想要哪一部分内容，不知道您想存到哪里~

如果您对上图很了解，那么可以直接开始编程了。或者您可以先看一下下面的简单用法介绍。

在src/main/java/com/github/xjtushilei/example中可以查看所有的样例程序

最小Spider

    //爬取《交大新闻网》开始的所有页面信息，并将信息打印到控制台！

    Spider.build().addUrlSeed("http://news.xjtu.edu.cn").run();

一句话，就能实现一个爬虫！

因为，我们给您默认提供了好多组件。

最小多线程Spider

    //爬取《交大新闻网》开始的所有页面信息，并将信息打印到控制台！
    Spider.build()
          .thread(10)   //设置多少个线程
          .addUrlSeed("http://news.xjtu.edu.cn")
          .run();

如果您没有设置thread选项，默认是5个线程

当然了，您可以使用.thread(1)来使用单线程。虽然我们不建议您这么做。

甚至您可以使用.thread(-100)来启动线程（呵呵，默认是5）

正常的机器，我们推荐您使用10个以上的线程进行尝试！

感受

设计一个框架需要考虑的东西需要很多，自己能力有限，第一次设计，不妥之处欢迎大家提issue。
多线程调bug好忧伤。
开源项目，需要花费很多的精力，自己有时候也挺疯狂的，各种折腾。回首一看，还是挺开心的。
如果你有兴趣，可以加入ScriptSpider，我们一起构建更美好的JAVA爬虫框架！

背景

因背景有失大雅，故放在后面。

无意之中看到了一个软件设计大赛，看到一个题目有兴趣，结果工作人员迟迟不给示例文件密码，破解失败，无奈就随手选了个题目，那就爬虫吧。

欢迎加入

联系个人主页的邮箱、QQ等即可。

版本更新记录

V-0.2
- 完成基本的说明文档和样例程序。修复已知bug。
V-0.1.1
- 完成基于redis的分布式调度
V-0.0.1
- 基本的爬虫功能

Versions

Version
0.3 Mar 30, 2018
0.2 Apr 13, 2017
0.1.1 Apr 12, 2017
0.0.1 Apr 12, 2017

scriptspider

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management