Gecco-Redis

Gecco Crawler Support Distributed By Redis

License

License

Categories

Categories

Redis Data Databases
GroupId

GroupId

com.geccocrawler
ArtifactId

ArtifactId

gecco-redis
Last Version

Last Version

1.3.0
Release Date

Release Date

Type

Type

jar
Description

Description

Gecco-Redis
Gecco Crawler Support Distributed By Redis
Project URL

Project URL

https://github.com/xtuhcy/gecco-redis
Source Code Management

Source Code Management

https://github.com/xtuhcy/gecco-redis

Download gecco-redis

How to add to project

<!-- https://jarcasting.com/artifacts/com.geccocrawler/gecco-redis/ -->
<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco-redis</artifactId>
    <version>1.3.0</version>
</dependency>
// https://jarcasting.com/artifacts/com.geccocrawler/gecco-redis/
implementation 'com.geccocrawler:gecco-redis:1.3.0'
// https://jarcasting.com/artifacts/com.geccocrawler/gecco-redis/
implementation ("com.geccocrawler:gecco-redis:1.3.0")
'com.geccocrawler:gecco-redis:jar:1.3.0'
<dependency org="com.geccocrawler" name="gecco-redis" rev="1.3.0">
  <artifact name="gecco-redis" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.geccocrawler', module='gecco-redis', version='1.3.0')
)
libraryDependencies += "com.geccocrawler" % "gecco-redis" % "1.3.0"
[com.geccocrawler/gecco-redis "1.3.0"]

Dependencies

compile (2)

Group / Artifact Type Version
org.redisson : redisson jar 2.8.2
com.geccocrawler : gecco jar 1.3.0

Project Modules

There are no modules declared in this project.

gecco-redis

Gecko crawler supports distributed by redis

gecco爬虫通过将初始的start request放入redis的队列进行分布式抓取,sprider线程首先从redis中获取初始抓取请求,在线程内完成子请求的抓取任务。完成后再从redis中获取下一个初始请求。这样实现了初始请求的深度优先策略,子请求的广度优先策略。

建议分布式抓取策略下,抓取线程的数量要小于等于初始请求(start request)的数量。

##Download

<dependency>
    <groupId>com.geccocrawler</groupId>
    <artifactId>gecco-redis</artifactId>
    <version>1.0.6</version>
</dependency>

##QuickStart

//每台服务器启动1个GeccoEngine服务
//启动GeccoEngine时,设置RedisStartScheduler
GeccoEngine.create()
.scheduler(new RedisStartScheduler("127.0.0.1:6379"))
.classpath("com.geccocrawler.gecco.demo")
//开始抓取的页面地址
.start("https://github.com/xtuhcy/gecco")
//开启几个爬虫线程
.thread(1)
//单个爬虫每次抓取完一个请求后的间隔时间
.interval(2000)
.run();

##DEMO 参考源代码中测试用例src/test

Versions

Version
1.3.0
1.2.8
1.1.3
1.0.5
1.0.4
1.0.3
1.0.0