java-rdfa

A SAX-based java RDFa parser.

License

License

BSD
Categories

Categories

Java Languages Net
GroupId

GroupId

net.rootdev
ArtifactId

ArtifactId

java-rdfa
Last Version

Last Version

1.0.0-BETA1
Release Date

Release Date

Type

Type

jar
Description

Description

java-rdfa
A SAX-based java RDFa parser.
Project URL

Project URL

http://rootdev.net/maven/projects/java-rdfa/
Source Code Management

Source Code Management

https://github.com/iteggmbh/java-rdfa

Download java-rdfa

How to add to project

<!-- https://jarcasting.com/artifacts/net.rootdev/java-rdfa/ -->
<dependency>
    <groupId>net.rootdev</groupId>
    <artifactId>java-rdfa</artifactId>
    <version>1.0.0-BETA1</version>
</dependency>
// https://jarcasting.com/artifacts/net.rootdev/java-rdfa/
implementation 'net.rootdev:java-rdfa:1.0.0-BETA1'
// https://jarcasting.com/artifacts/net.rootdev/java-rdfa/
implementation ("net.rootdev:java-rdfa:1.0.0-BETA1")
'net.rootdev:java-rdfa:jar:1.0.0-BETA1'
<dependency org="net.rootdev" name="java-rdfa" rev="1.0.0-BETA1">
  <artifact name="java-rdfa" type="jar" />
</dependency>
@Grapes(
@Grab(group='net.rootdev', module='java-rdfa', version='1.0.0-BETA1')
)
libraryDependencies += "net.rootdev" % "java-rdfa" % "1.0.0-BETA1"
[net.rootdev/java-rdfa "1.0.0-BETA1"]

Dependencies

compile (3)

Group / Artifact Type Version
org.apache.jena : jena-iri jar 3.16.0
org.slf4j : slf4j-api jar 1.7.30
net.rootdev : java-rdfa-htmlparser Optional jar 1.0.0-BETA1

provided (3)

Group / Artifact Type Version
org.apache.jena : jena-core jar 3.16.0
org.apache.jena : jena-arq jar 3.16.0
org.slf4j : slf4j-log4j12 jar 1.7.30

test (1)

Group / Artifact Type Version
junit : junit jar 4.5

Project Modules

There are no modules declared in this project.

Welcome to java-rdfa

The cruftiest RDFa parser in the world, I'll bet. Apologies that there isn't much documentation. Things may explode: you have been warned.

Currently passing all conformance tests for XHTML, and the HTML 4 and 5 tests with one exception.

This was written by Damian Steer. It is an offshoot of the Stars Project which was funded by JISC

Useful Links

Basic Use

$ ls
htmlparser-1.4.16.jar	java-rdfa-1.0.0-BETA1.jar

$ java -jar java-rdfa-1.0.0-BETA1.jar http://examples.tobyinkster.co.uk/hcard
<http://examples.tobyinkster.co.uk/hcard> <http://xmlns.com/foaf/0.1/primaryTopic> <http://examples.tobyinkster.co.uk/hcard#jack> .
...

or (equivalent):

$ java -cp '*' rdfa.simpleparse http://examples.tobyinkster.co.uk/hcard
<http://examples.tobyinkster.co.uk/hcard> <http://xmlns.com/foaf/0.1/primaryTopic> <http://examples.tobyinkster.co.uk/hcard#jack> .
...

For HTML sources add the format argument, and you will need the validator.nu parser:

$ java -cp '*' rdfa.simpleparse --format HTML http://www.slideshare.net/intdiabetesfed/world-diabetes-day-2009
<http://www.slideshare.net/intdiabetesfed/world-diabetes-day-2009> <http://www.w3.org/1999/xhtml/vocab#stylesheet> <http://public.slidesharecdn.com/v3/styles/combined.css?1265372095> .
...

The output of simpleparse is n-triples, and hard to read. If you have jena try adding it to you classpath and using rdfa.parse instead:

$ java -cp '*:/path/to/jena/lib/*' rdfa.parse --format HTML http://www.slideshare.net/intdiabetesfed/world-diabetes-day-2009
@prefix dc:      <http://purl.org/dc/terms/> .
@prefix hx:      <http://purl.org/NET/hinclude> .
... nice turtle output ...

Java Use

To use the parser directly, without the assistance of an RDF toolkit (a bold choice) implement a StatementSink to collect the triples, then use a parser from the Factory to make a reader:

XMLReader reader = ParserFactory.createReaderForFormat(sink, Format.XHTML); // or HTML, still an XMLReader
reader.parse(source); // Your sink will be sent triples

java-rdfa can be used from jena. Simply invoke:

Class.forName("net.rootdev.javardfa.RDFaReader");

Which will hook the two readers in to jena, then you will be able to:

model.read(url, "XHTML"); // xml parsing
model.read(other, "HTML"); // html parsing

java-rdfa is available in the maven central repositories. Note that it does not depend on jena.

A sesame reader provided by Henry Story is also available.

Open Graph Protocol

A very simple OGP reader is provided. This follows what (I think) Toby Inkster did:

    Map<String, String> prop =
        OGPReader.getOGP("http://uk.rottentomatoes.com/m/1217700-kick_ass",
                         Format.HTML);

Result:

    title => 'Kick-Ass'
    http://www.facebook.com/2008/fbml#app_id => '326803741017'
    http://www.w3.org/1999/xhtml/vocab#icon => 'http://images.rottentomatoes.com/images/icons/favicon.ico'
    http://www.w3.org/1999/xhtml/vocab#stylesheet => 'http://images.rottentomatoes.com/files/inc_beta/generated/css/mob.css'
    image => 'http://images.rottentomatoes.com/images/movie/custom/00/1217700.jpg'
    site_name => 'Rotten Tomatoes'
    type => 'movie'
    url => 'http://www.rottentomatoes.com/m/1217700-kick_ass/'
    http://www.facebook.com/2008/fbml#admins => '1106591'

Form Mode

There is a secret form mode (that prompted the development of this parser). In this mode you can generate basic graph patterns by including ?variables where curies are allowed, and INPUT tags generate @name variables.

Simple example (from the tests) and the query that results.

Changes

1.0.0

  • Port to jena-3, finally using jena-3.16.0 (courtesy of user yevster)
  • Make test cases operational again.
  • Introduce modern maven standards.
  • Compile for java 8 onwards to encompass the move of jena-3 to java 8.
  • Restructure the project to make it working with maven-release-plugin.
  • Hopefully preserving all the valuable work of shellac for the next decade!

0.4

  • (Finally) support overlapping literals. No one noticed this didn't work!
  • Added turtle-ish output. Slightly less nasty than N-Triples.
  • Bug fixes...
  • Turned OFF html 5 streaming. Such a bad idea on my part.
  • Started RDFa 1.1 support.
  • Added simple OGP reader.

0.3

  • Updated to current conformance tests
  • Switched validator.nu to streaming mode (may live to regret this).
  • Created very simple n-triple and rdf/xml streaming serialisers.
  • Usual bug fixes etc.
  • Jena is now a provided maven dependency. Using java-rdfa won't pull in jena.
  • Sesame reader create by Henry Story added. Can't be added to central maven repository since Sesame isn't available, so spun out in small module.
  • Tests for query, and some utilities.
net.rootdev

ITEG IT-Engineers GmbH

Versions

Version
1.0.0-BETA1
0.4.2
0.4.2-RC2
0.4.2-RC1
0.4.1
0.4
0.3
0.2.1
0.2
0.1.1
0.1