Graph Mapper

Tools for mapping structured data (e.g. a CSV file) into a network graph format

License	License Apache Software License v2.0
GroupId	GroupId uk.gov.nca.graph
ArtifactId	ArtifactId mapper
Last Version	Last Version 1.1
Release Date	Release Date Nov 7, 2018
Type	Type jar
Description	Description Graph Mapper Tools for mapping structured data (e.g. a CSV file) into a network graph format
Project URL	Project URL https://github.com/NationalCrimeAgency/graph-mapper
Source Code Management	Source Code Management https://github.com/NationalCrimeAgency/graph-mapper

Download mapper

Filename	Size
mapper-1.1.pom
mapper-1.1.jar	48 KB
mapper-1.1-sources.jar	31 KB
mapper-1.1-shaded.jar	36 MB
mapper-1.1-javadoc.jar	498 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/uk.gov.nca.graph/mapper/ -->
<dependency>
    <groupId>uk.gov.nca.graph</groupId>
    <artifactId>mapper</artifactId>
    <version>1.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/uk.gov.nca.graph/mapper/
implementation 'uk.gov.nca.graph:mapper:1.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/uk.gov.nca.graph/mapper/
implementation ("uk.gov.nca.graph:mapper:1.1")

Apache Buildr

'uk.gov.nca.graph:mapper:jar:1.1'

Apache Ivy

<dependency org="uk.gov.nca.graph" name="mapper" rev="1.1">
  <artifact name="mapper" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='uk.gov.nca.graph', module='mapper', version='1.1')
)

Scala SBT

libraryDependencies += "uk.gov.nca.graph" % "mapper" % "1.1"

Leiningen

[uk.gov.nca.graph/mapper "1.1"]

Dependencies

compile (11)

Group / Artifact	Type	Version
uk.gov.nca.graph : utils	jar	1.1
commons-cli : commons-cli	jar	1.4
commons-io : commons-io	jar	2.6
org.elasticsearch.client : elasticsearch-rest-high-level-client	jar	6.4.2
org.apache.tinkerpop : gremlin-core	jar	3.3.4
org.apache.tinkerpop : tinkergraph-gremlin	jar	3.3.4
com.fasterxml.jackson.core : jackson-databind	jar	2.9.7
net.andreinc.mockneat » mockneat	jar	0.2.4
com.opencsv : opencsv	jar	4.3.2
com.google.re2j : re2j	jar	1.2
org.slf4j : slf4j-api	jar	1.7.25

test (2)

Group / Artifact	Type	Version
junit : junit	jar	4.12
com.h2database : h2	jar	1.4.197

Project Modules

There are no modules declared in this project.

This project is currently unmaintained. Please use with caution.

Graph Mapper

This tool maps from a structured data source (e.g. CSV or JSON-lines) to a graph format (e.g. GraphML, or directly into a Gremlin database). It is fully configurable, and designed to be schema-agnostic.

Using the tool

To run the tool, use the following command:

java -cp mapper-1.1.jar uk.gov.nca.graph.mapper.cli.MapDataToGraph

This will display a help message detailing the different command line options. Some of these command line options are required, and the help message is displayed if these aren't provided. The full list of options is as follows.

Short Option	Long Option	Default Value	Required	Description
c	config		Yes	The mapping configuration file with which to parse the data. See below for full details.
d	data		Yes	The input data file to process and convert into a graph, or the JDBC connection string if using SQL.
f	format	CSV	No	The format that the data file is in. Possible options are CSV, TSV, JSON, JSONL (for JSON-Lines) or SQL (case-insensitive).
g	graph		Yes	The Tinkerpop graph configuration file (follows the standard Tinkerpop format). Examples of this file for GraphML and OrientDB are provided in the `examples/` folder.
h	headers	false	No	The CSV/TSV file has a header row as the first row.
t	table		If using SQL format	The SQL table to process
u	username		No	The username for the SQL database (authentication will not be used if this isn't supplied)
p	password		No	The password for the SQL database (authentication will not be used if this isn't supplied)
q	query		No	The SQL query to use to select data (if provided, `table` will be ignored)
	prov		No	If provided, then the given prov key will be added to every element

An example full command would therefore be as follows:

java -cp mapper-1.1.jar uk.gov.nca.graph.mapper.cli.MapDataToGraph -c example/companies/companies.map -d example/companies/companies.jsonl -g example/graphml.properties -f JSONL -prov abc123

If you are using the SQL format, you must include the necessary SQL driver on the classpath. For example:

java -classpath .:mapper-1.1.jar:postgresql-42.1.4.jar uk.gov.nca.graph.mapper.cli.MapDataToGraph -c example/companies/companies.map -d jdbc:postgresql://localhost:5432/example -t companies -g example/graphml.properties -f SQL

Generating Files

An additional tool is provided to create sample graphs using a mapping file, for testing purposes.

To run the tool, use the following command:

java -cp mapper-1.1.jar uk.gov.nca.graph.mapper.cli.GenerateGraphFromMap

This will display a help message detailing the available options.

Mapping Configuration

The mapping configuration file describes how the structured data should be transformed into a graph. It is flexible enough to describe most graph schemas, although support for complex operations such as manipulating data prior to mapping it into the graph is still limited. The configuration file is a YAML file, and standard YAML formatting rules apply.

The configuration file has two required top level objects, vertices and edges. These objects consist of a list of sub-objects, with each sub-object describing either a vertex (node) or an edge (link) respectively.

A third top level object, filters, is optional and can be used to filter the data prior to converting it to a graph.

You can also specify lenient: true to make the parsing of fields more lenient. If true, then rather than skipping any un-parseable data fields, their String value will be used instead.

When processing the data, each row in the data file will be processed separately and the mapping configuration will be used to map that row of data to a sub-graph and insert it into the main graph via the Gremlin interface.

Vertex Mappings

Each sub-object under the vertices object represents a single vertex in the output graph. Some properties are 'reserved', and used by the tool. Any other property will be added to the vertex. The reserved properties are:

_except allows you to skip a particular vertex for a data row. The value should be a map of field names to value, where if any of the values map then the vertex is skipped.
_id sets an internal ID for the vertex, which is used when creating edges between vertices. The value can be anything, and is not copied across onto the graph.
_merge - if true, then the vertex will be merged with the first existing vertex in the graph for which all the properties on the new vertex exist and have the same value (additional properties on the existing vertex are ignored). If no matching vertex is found, or false, then a new vertex will be created. Note that all vertices are merged on the identifier property regardless of this setting.
_type sets the type or label of the vertex

The value of each property can be one of the following to read data from the data file.

_BOOLEAN(*field name*)
_DATETIME(*field name*)
_DATE(*field name*)
_DOUBLE(*field name*)
_INTEGER(*field name*)
_IPADDRESS(*field name*)
_STRING(*field name*)
_TIME(*field name*)
_URL(*field name*)
_LITERAL(*literal value*)

The field name is either the JSON field name (or CSV/TSV column header if -h was specified), or the one-based column number for CSV/TSV files. If the field can't be found, or the value can't be parsed to the correct type, then the property will be skipped for that data row. Any value not in the format above will be interpretted as a literal value.

If a list of values is provided, then the parsed outputs of each of these will be concatenated into a single string. However, the list can be case to a specific type by including an empty type as the first item in the list. For instance, the following would create a URL rather than a String:

url:
  - _URL()
  - _LITERAL(http://www.example.com/users/)
  - _INTEGER(user_id)

Edge Mappings

Edges are simpler than vertices, and only support three properties: _type, _src, and _tgt.

_type sets the type or label of the edge.
_src sets the source vertex of the edge, and should be the _id of the vertex.
_tgt sets the target vertex of the edge, and should be the _id of the vertex.

Any other properties are ignored - they are not added to the edge. As with vertices, property values can be pulled from the data using the same format.

Filters

Filters can be used to filter the data before processing. The optional filters object is used to configure this filtering.

The filters object has one reserved property, _exists, which takes a list of fields. If these fields don't exist or they are empty, then the data row will be skipped.

Any other properties listed must be present with the specified value for the data row to be processed.

Annotated Example

The following example (taken from the unit tests) demonstrates all of the above features. It has been annotated to explain what each line is doing:

filters:			# Create a filters object to filter our data
  _exists: name			# Only accept data rows that have a non-empty name property
  surname:			# Only accept data rows that have a surname of Jones or Smith
  - Jones
  - Smith

vertices:			# Create a vertices object to configure our vertex mapping
- _id: 1			# Create a vertex with internal ID 1
  _type: Person			# Set the type to Person
  name: _STRING(1)		# Read the name from the first column in the data as a String
  dateOfBirth: _DATE(3)		# Read the date of birth from the third column in the data and parse it to a Date
  gender: _STRING(2)		# Read the gender from the second column in the data as a String
  source: example.txt		# Set the source property to "example.txt"
- _id: 2			# Create a vertex with internal ID 2
  _type: Person			# Set the type to Person
  _merge: true			# Merge this vertex with any Person entity that has the same name
  _except:
    3: Sam			# If the third column has the value "Sam", then it will be skipped and this vertex (and any edges linking it) won't be created
  name:				# Set the name property to be the concatenated value of the fourth and fifth columns
  - _STRING(4)
  - " "
  - _STRING(5)

edges:				# Create an edges object to configure our edge mapping
- _type: parentOf		# Set the type to parentOf
  _src: 1			# Link from the node with internal ID 1...
  _tgt: 2			# ...to the node with internal ID 2

If we were to feed in the following CSV data

name,gender,dob,firstName,surname
Amy,female,1980-02-12,Paul,Jones
Peter,male,1982-11-30,Paul,Jones
Mary,female,1981-04-12,Bob,Edwards
,male,1981-05-01,Bob,Jones

Then we would end up with a graph that looks like the following (note that the third and fourth rows have been filtered out):

(name: Amy, gender: female, dateOfBirth: 1980-02-12, source: example.txt) -- parentOf --> (name: Paul Jones) <-- parentOf -- (name: Peter, gender: male, dateOfBirth: 1982-11-30, source: example.txt)

National Crime Agency

National Crime Agency. Leading the UK's fight to cut serious and organised crime.

Versions

Version
1.1 Nov 7, 2018
1.0 Nov 6, 2018

Graph Mapper

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management

Download mapper

How to add to project

Dependencies

compile (11)

test (2)

Project Modules

This project is currently unmaintained. Please use with caution.

Graph Mapper

Using the tool

Generating Files

Mapping Configuration

Vertex Mappings

Edge Mappings

Filters

Annotated Example

National Crime Agency

Versions