Html2Sax

HTML parser that maps to the Java SAX interface.

License

License

GroupId

GroupId

de.sfuhrm
ArtifactId

ArtifactId

html2sax
Last Version

Last Version

2.1.4
Release Date

Release Date

Type

Type

jar
Description

Description

Html2Sax
HTML parser that maps to the Java SAX interface.
Project URL

Project URL

https://github.com/sfuhrm/html2sax
Source Code Management

Source Code Management

https://github.com/sfuhrm/html2sax

Download html2sax

How to add to project

<!-- https://jarcasting.com/artifacts/de.sfuhrm/html2sax/ -->
<dependency>
    <groupId>de.sfuhrm</groupId>
    <artifactId>html2sax</artifactId>
    <version>2.1.4</version>
</dependency>
// https://jarcasting.com/artifacts/de.sfuhrm/html2sax/
implementation 'de.sfuhrm:html2sax:2.1.4'
// https://jarcasting.com/artifacts/de.sfuhrm/html2sax/
implementation ("de.sfuhrm:html2sax:2.1.4")
'de.sfuhrm:html2sax:jar:2.1.4'
<dependency org="de.sfuhrm" name="html2sax" rev="2.1.4">
  <artifact name="html2sax" type="jar" />
</dependency>
@Grapes(
@Grab(group='de.sfuhrm', module='html2sax', version='2.1.4')
)
libraryDependencies += "de.sfuhrm" % "html2sax" % "2.1.4"
[de.sfuhrm/html2sax "2.1.4"]

Dependencies

compile (1)

Group / Artifact Type Version
org.slf4j : slf4j-api jar 1.7.25

test (1)

Group / Artifact Type Version
junit : junit Optional jar 4.12

Project Modules

There are no modules declared in this project.

html2sax

Travis CI Codacy Badge Maven Central

html2sax is a parser for html documents. It reads HTML-documents and creates callback calls using the Java(tm) SAX API.

Background

There are many partly-malformed HTML documents on the web. Many page authors don't care about correctness when their browser manages to display their page. Web browsers have specially adjusted parsers that can repair malformed HTML pages to a certain degree. Unfortunately there is no standard for correcting mistakes. So for wrong pages every browser behaves differently. To take a tool into the web you need to parse web pages. There is a lot of fancy XML technology out there, but HTML is actually no XML. XHTML is XML, but not very wide-spread today.

Purpose

The intention for the original development was to have a really simple HTML parser that just splits up the lexical parts of a HTML document (tags, attributes, and text). The parser should handle errors gracefully and continue after them. It should not try to repair documents because the intention was to extract certain parts using XPath-queries. There is also a Java version of HTML tidy. It does its best at repairing malformed HTML documents. It works quite ok, but in my opinion it's too much for some applications.

The library

html2sax is designed to be the frontend for a web-spider reading websites. It can handle (almost?) all error situations, but will not try to correct problematic HTML pages. It operates on a very low level and is quite fast. Tests showed that it is twice as fast as Html-Tidy. html2sax works as a SAX parser. Usually SAX is just designed to handle real XML. html2sax will emit malformed XML which may confuse SAX-using code. Don't expect many SAX-code to work with this 'weak' parser. The parser was written using a simple DFA.

Features

The parser supports the following features:

  • Speed.
  • Simple.
  • Pure Java(tm).
  • Using the well-understood and fast SAX API.
  • JUnit test cases.
  • HTML entity expansion (example: "&amp;" gets "&").
  • Handles errors gracefully and will continue close to the error.
  • No DTD logic that won't work with many documents anyway.
  • No structure-level repairing effors that may fail.
  • Full source provided.

Restrictions

There are several restrictions for html2sax that you should be aware of:

  • HTML is not XML. This means SAX is an API for this kind of callbacks, but some existing tools having a SAX input interface will fail with html2sax input. Reason: The documents are not well-formed.
  • No DTD-support. You need to do your HTML-thinking for yourself.
  • Won't protect your parser callback from senseless trash if documents are really weird.
  • Won't repair corrupt documents.

Requirements

The only requirements for the parser is a Java(tm) 1.6 JRE.

Example

Usage is quite simple. The following example runs the parser:


    SAXParserFactory factory = 
        SAXParserFactory.newInstance(
        "de.sfuhrm.htmltosax.HtmlToSaxParserFactory",
        null);
    SAXParser parser = factory.newSAXParser();
    YourCallback s = new YourCallback();
    parser.parse(new InputSource(
        new URL(args[0]).openStream()), s);

A working example is in the file Sample.java in the source distribution.

Download

You can either download the library in the release section of github

https://github.com/sfuhrm/html2sax/releases

or add this dependency to your Maven pom:


    <dependency>
        <groupId>de.sfuhrm</groupId>
        <artifactId>html2sax</artifactId>
        <version>2.1.3</version>
    </dependency>

Author & License

Author

html2sax was written by Stephan Fuhrmann. You can reach my at s (at) sfuhrm.de.

License

The library is in the LPGL 2.1 license.

Versions

Version
2.1.4
2.1.3