Html2Sax

HTML parser that maps to the Java SAX interface.

License	License GNU Lesser General Public License 2.1
GroupId	GroupId de.sfuhrm
ArtifactId	ArtifactId html2sax
Last Version	Last Version 2.1.4
Release Date	Release Date Sep 10, 2017
Type	Type jar
Description	Description Html2Sax HTML parser that maps to the Java SAX interface.
Project URL	Project URL https://github.com/sfuhrm/html2sax
Source Code Management	Source Code Management https://github.com/sfuhrm/html2sax

Download html2sax

Filename	Size
html2sax-2.1.4.pom
html2sax-2.1.4.jar	49 KB
html2sax-2.1.4-sources.jar	28 KB
html2sax-2.1.4-javadoc.jar	83 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/de.sfuhrm/html2sax/ -->
<dependency>
    <groupId>de.sfuhrm</groupId>
    <artifactId>html2sax</artifactId>
    <version>2.1.4</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/de.sfuhrm/html2sax/
implementation 'de.sfuhrm:html2sax:2.1.4'

Gradle Kotlin

// https://jarcasting.com/artifacts/de.sfuhrm/html2sax/
implementation ("de.sfuhrm:html2sax:2.1.4")

Apache Buildr

'de.sfuhrm:html2sax:jar:2.1.4'

Apache Ivy

<dependency org="de.sfuhrm" name="html2sax" rev="2.1.4">
  <artifact name="html2sax" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='de.sfuhrm', module='html2sax', version='2.1.4')
)

Scala SBT

libraryDependencies += "de.sfuhrm" % "html2sax" % "2.1.4"

Leiningen

[de.sfuhrm/html2sax "2.1.4"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.slf4j : slf4j-api	jar	1.7.25

test (1)

Group / Artifact	Type	Version
junit : junit Optional	jar	4.12

Project Modules

There are no modules declared in this project.

html2sax

html2sax is a parser for html documents. It reads HTML-documents and creates callback calls using the Java(tm) SAX API.

Background

There are many partly-malformed HTML documents on the web. Many page authors don't care about correctness when their browser manages to display their page. Web browsers have specially adjusted parsers that can repair malformed HTML pages to a certain degree. Unfortunately there is no standard for correcting mistakes. So for wrong pages every browser behaves differently. To take a tool into the web you need to parse web pages. There is a lot of fancy XML technology out there, but HTML is actually no XML. XHTML is XML, but not very wide-spread today.

Purpose

The intention for the original development was to have a really simple HTML parser that just splits up the lexical parts of a HTML document (tags, attributes, and text). The parser should handle errors gracefully and continue after them. It should not try to repair documents because the intention was to extract certain parts using XPath-queries. There is also a Java version of HTML tidy. It does its best at repairing malformed HTML documents. It works quite ok, but in my opinion it's too much for some applications.

The library

html2sax is designed to be the frontend for a web-spider reading websites. It can handle (almost?) all error situations, but will not try to correct problematic HTML pages. It operates on a very low level and is quite fast. Tests showed that it is twice as fast as Html-Tidy. html2sax works as a SAX parser. Usually SAX is just designed to handle real XML. html2sax will emit malformed XML which may confuse SAX-using code. Don't expect many SAX-code to work with this 'weak' parser. The parser was written using a simple DFA.

Features

The parser supports the following features:

Speed.
Simple.
Pure Java(tm).
Using the well-understood and fast SAX API.
JUnit test cases.
HTML entity expansion (example: "&" gets "&").
Handles errors gracefully and will continue close to the error.
No DTD logic that won't work with many documents anyway.
No structure-level repairing effors that may fail.
Full source provided.

Restrictions

There are several restrictions for html2sax that you should be aware of:

HTML is not XML. This means SAX is an API for this kind of callbacks, but some existing tools having a SAX input interface will fail with html2sax input. Reason: The documents are not well-formed.
No DTD-support. You need to do your HTML-thinking for yourself.
Won't protect your parser callback from senseless trash if documents are really weird.
Won't repair corrupt documents.

Requirements

The only requirements for the parser is a Java(tm) 1.6 JRE.

Example

Usage is quite simple. The following example runs the parser:

    SAXParserFactory factory = 
        SAXParserFactory.newInstance(
        "de.sfuhrm.htmltosax.HtmlToSaxParserFactory",
        null);
    SAXParser parser = factory.newSAXParser();
    YourCallback s = new YourCallback();
    parser.parse(new InputSource(
        new URL(args[0]).openStream()), s);

A working example is in the file Sample.java in the source distribution.

Download

You can either download the library in the release section of github

https://github.com/sfuhrm/html2sax/releases

or add this dependency to your Maven pom:

    <dependency>
        <groupId>de.sfuhrm</groupId>
        <artifactId>html2sax</artifactId>
        <version>2.1.3</version>
    </dependency>

Author & License

Author

html2sax was written by Stephan Fuhrmann. You can reach my at s (at) sfuhrm.de.

License

The library is in the LPGL 2.1 license.

Versions

Version
2.1.4 Sep 10, 2017
2.1.3 Jan 26, 2017

Html2Sax

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download html2sax

How to add to project

Dependencies

compile (1)

test (1)

Project Modules

html2sax

Background

Purpose

The library

Features

Restrictions

Requirements

Example

Download

Author & License

Author

License

Versions