logtrix

Parses and summarises Heritrix crawl logs

License	License Apache License, Version 2.0
Categories	Categories Net
GroupId	GroupId org.netpreserve
ArtifactId	ArtifactId logtrix
Last Version	Last Version 0.1.0
Release Date	Release Date May 15, 2019
Type	Type jar
Description	Description logtrix Parses and summarises Heritrix crawl logs
Project URL	Project URL https://github.com/iipc/logtrix
Source Code Management	Source Code Management https://github.com/iipc/logtrix

Download logtrix

Filename	Size
logtrix-0.1.0.pom
logtrix-0.1.0.jar	22 KB
logtrix-0.1.0-sources.jar	13 KB
logtrix-0.1.0-javadoc.jar	49 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.netpreserve/logtrix/ -->
<dependency>
    <groupId>org.netpreserve</groupId>
    <artifactId>logtrix</artifactId>
    <version>0.1.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.netpreserve/logtrix/
implementation 'org.netpreserve:logtrix:0.1.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.netpreserve/logtrix/
implementation ("org.netpreserve:logtrix:0.1.0")

Apache Buildr

'org.netpreserve:logtrix:jar:0.1.0'

Apache Ivy

<dependency org="org.netpreserve" name="logtrix" rev="0.1.0">
  <artifact name="logtrix" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.netpreserve', module='logtrix', version='0.1.0')
)

Scala SBT

libraryDependencies += "org.netpreserve" % "logtrix" % "0.1.0"

Leiningen

[org.netpreserve/logtrix "0.1.0"]

Dependencies

compile (4)

Group / Artifact	Type	Version
org.slf4j : slf4j-api	jar	1.7.25
com.fasterxml.jackson.core : jackson-databind	jar	2.9.8
com.fasterxml.jackson.datatype : jackson-datatype-jsr310	jar	2.9.8
com.google.guava : guava	jar	27.1-jre

test (2)

Group / Artifact	Type	Version
junit : junit	jar	4.12
org.slf4j : slf4j-simple	jar	1.7.25

Project Modules

There are no modules declared in this project.

logtrix

Examples

Parsing a log file

try (CrawlLogIterator log = new CrawlLogIterator(Paths.get("crawl.log"))) {
    for (CrawlDataItem line : log) {
        System.out.println(line.getStatusCode());
        System.out.println(line.getURL());
    }
}

Grouping the summary by various things

CrawlSummary.byRegisteredDomain(log);
CrawlSummary.byHost(log);
CrawlSummary.byKey(log, item -> item.getCaptureBegan().toString().substring(0, 4)); // by year

Limit top N results

CrawlSummary.build(log).topN(10); // top 10 status codes, mime-types etc

Working with status codes

StatusCodes.describe(404);      // "Not found"
StatusCodes.describe(-4);       // "HTTP timeout"
StatusCodes.isError(-4);        // true
StatusCodes.isServerError(503); // true

Command-line interface

Output a JSON crawl summary grouped by registered domain:

java -jar target/*.jar -g registered-domain crawl.log

For more options:

java -jar target/*.jar --help

Compiling

Install Maven and then run:

mvn package

IIPC

International Internet Preservation Consortium

Versions

Version
0.1.0 May 15, 2019

logtrix

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download logtrix

How to add to project

Dependencies

compile (4)

test (2)

Project Modules

logtrix

Examples

Parsing a log file

Grouping the summary by various things

Limit top N results

Working with status codes

Command-line interface

Compiling

IIPC

Versions