arc

License	License MIT
GroupId	GroupId ai.tripl
ArtifactId	ArtifactId arc_2.11
Last Version	Last Version 2.14.0
Release Date	Release Date Jun 15, 2020
Type	Type jar
Description	Description arc arc
Project URL	Project URL https://arc.tripl.ai
Project Organization	Project Organization ai.tripl
Source Code Management	Source Code Management https://github.com/tripl-ai/arc

Download arc_2.11

Filename	Size
arc_2.11-2.14.0.pom
arc_2.11-2.14.0.jar	2 MB
arc_2.11-2.14.0-sources.jar	171 KB
arc_2.11-2.14.0-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/ai.tripl/arc_2.11/ -->
<dependency>
    <groupId>ai.tripl</groupId>
    <artifactId>arc_2.11</artifactId>
    <version>2.14.0</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/ai.tripl/arc_2.11/
implementation 'ai.tripl:arc_2.11:2.14.0'

Gradle Kotlin

// https://jarcasting.com/artifacts/ai.tripl/arc_2.11/
implementation ("ai.tripl:arc_2.11:2.14.0")

Apache Buildr

'ai.tripl:arc_2.11:jar:2.14.0'

Apache Ivy

<dependency org="ai.tripl" name="arc_2.11" rev="2.14.0">
  <artifact name="arc_2.11" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='ai.tripl', module='arc_2.11', version='2.14.0')
)

Scala SBT

libraryDependencies += "ai.tripl" % "arc_2.11" % "2.14.0"

Leiningen

[ai.tripl/arc_2.11 "2.14.0"]

Dependencies

compile (5)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.11.12
com.typesafe : config	jar	1.3.1
org.apache.hadoop : hadoop-common	jar	2.9.2
org.apache.hadoop : hadoop-aws	jar	2.9.2
com.databricks : spark-xml_2.11	jar	0.9.0

provided (5)

Group / Artifact	Type	Version
org.apache.spark : spark-core_2.11	jar	2.4.6
org.apache.spark : spark-sql_2.11	jar	2.4.6
org.apache.spark : spark-hive_2.11	jar	2.4.6
org.apache.spark : spark-mllib_2.11	jar	2.4.6
org.apache.spark : spark-avro_2.11	jar	2.4.6

test (3)

Group / Artifact	Type	Version
org.scalatest : scalatest_2.11	jar	3.0.7
org.mortbay.jetty : jetty	jar	6.1.26
org.postgresql : postgresql	jar	42.2.8

Project Modules

There are no modules declared in this project.

Arc is an opinionated framework for defining data pipelines which are predictable, repeatable and manageable.

Documentation

Full documentation is available here: https://arc.tripl.ai

Index

What is Arc?

Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;

predictable in that data is used to define transformations - not code.
repeatable in that if a job is executed multiple times it will produce the same result.
manageable in that execution considerations and logging have been baked in from the start.

Getting Started

Arc has an interactive Jupyter Notebook extension to help with rapid development of jobs. Start by cloning https://github.com/tripl-ai/arc-starter and running through the tutorial.

This extension is available at https://github.com/tripl-ai/arc-jupyter.

Principles

Many of these principles have come from 12factor with the aim of deploying predictable, repeatable and manageable data transformation pipelines:

single responsibility components/stages with first-class support for extending.
stateless jobs where possible and use of immutable datasets.
precise logging to allow management of jobs at scale.
library dependencies limited or avoided where possible.

Not just for data engineers

The intent of the pipeline is to provide a simple way of creating Extract-Transform-Load (ETL) pipelines which are able to be maintained in production, and captures the answers to simple operational questions transparently to the user.

monitoring: is it working each time it's run? and how much resource was consumed in creating it?
devops: is packaged as a Docker image to allow rapid deployment on ephemeral compute.

These concerns are supported at run time to ensure that as deployment grows in uses and complexity it does not become opaque and unmanageable.

Why abstract from code?

From experience a very high proportion of data pipelines perform very similar extract, transform and load actions on datasets. Unfortunately, whilst the desired outcomes are largely similar, the implementations are vastly varied resulting in higher maintenance costs, lower test-coverage and high levels of rework.

The intention of this project is to define and implement an opinionated standard approach for declaring data pipelines which is open and extensible. Abstraction from underlying code allows rapid deployment, a consistent way of defining transformation tasks (such as data typing) and allows abstraction of the pipeline definition from the pipeline execution (to support changing of the underlying execution engines) - see declarative programming.

Currently it is tightly coupled to Apache Spark due to its fault-tolerance, performance and solid API for standard data engineering tasks but the definitions are human and machine readable HOCON (a JSON derivative) allowing the transformation definitions to be implemented against future execution engines.

Why SQL first?

SQL first (copied from the Mobile First UX principle) is an approach where, if possible, transformations are done using Structured Query Language (SQL) as a preference. This is because SQL is a very good way of expressing standard data transformation intent in a declarative way. SQL is so widely known and taught that finding people who are able to understand the business context and able to write basic SQL is much easier than finding a Scala developer who also understands the business context (for example).

Currently the HIVE dialect of SQL is supported as Spark SQL uses the same SQL dialect and has a lot of the same functions that would be expected from other SQL dialects. This could change in the future.

Example pipeline

This is an example of a fairly standard pipeline:

First load a set of CSV files from an input directory. Separator is a comma and the file has a header.
Convert the data to the correct datatypes using metadata defined in a separate JSON.
Execute a SQL statement that will perform custom validation to ensure the data conversion in the previous step resulted in an acceptable data conversion error rate. If not successful the job will terminate.
Write out the aggreate resultset to a Parquet target.

{
  "stages": [
    {
      "type": "DelimitedExtract",
      "name": "extract data from green_tripdata/0",
      "environments": [
        "production",
        "test"
      ],
      "inputURI": ${ETL_CONF_BASE_URL}"/data/green_tripdata/0/*",
      "outputView": "green_tripdata0_raw",
      "delimiter": "Comma",
      "quote": "DoubleQuote",
      "header": true
    },
    {
      "type": "TypingTransform",
      "name": "apply green_tripdata/0 data types",
      "environments": [
        "production",
        "test"
      ],
      "inputURI": ${ETL_CONF_BASE_URL}"/meta/green_tripdata/0/green_tripdata.json",
      "inputView": "green_tripdata0_raw",
      "outputView": "green_tripdata0",
      "persist": true
    },
    {
      "type": "SQLValidate",
      "name": "ensure no errors exist after data typing",
      "environments": [
        "production",
        "test"
      ],
      "inputURI": ${ETL_CONF_BASE_URL}"/job/0/sqlvalidate_errors.sql",
      "sqlParams": {
        "table_name": "green_tripdata0"
      }
    },
    {
      "type": "ParquetLoad",
      "name": "write trips back to filesystem",
      "environments": [
        "production",
        "test"
      ],
      "inputView": "green_tripdata0",
      "outputURI": ${ETL_CONF_BASE_URL}"/data/output/green_tripdata0.parquet",
      "saveMode": "Overwrite"
    }
  ]
}

A full worked example job is available here. It is for the Australian Geocoded National Address File G-NAF and will load the pipe-separated data, apply data types, ensure there are no errors in applying data types and write the output to Parquet. This could easily be modified to write to a JDBC target.

Building

Library

To compile the main library run:

sbt +package

To build a library to use with a Databricks Runtime environment it is easiest to assembly Arc with all the dependencies into a single JAR to simplify the deployment.

Example:

sbt assembly

If you are having problems compiling it is likely due to environment setup. This command is executed in CICD and uses a predictable build environment pulled from Dockerhub:

docker run --rm -v $(pwd):/app -w /app mozilla/sbt:8u212_1.2.8 sbt +package

Tests

To run unit tests:

sbt +test

To run integration tests (which have external service depenencies):

./it.sh

License Report

To rebuild the license report run:

sbt dumpLicenseReport

Documentation

To generate the documentation you need to download Hugo to /docs-src and run ./hugo in that directory. The /docs directory is the output of the docuementation generation and should not be edited by hand. The /docs directory is automatically published by the Github Pages process on commit.

You can also generate documentation for previous version by checking out a previous tag (git checkout tags/1.3.0) and running hugo server from the /docs-src directory.

Contributing

If you have suggestions of additional components or find issues that you believe need fixing then please raise an issue. An issue with a test case is even more appreciated.

When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project’s open source license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project’s open source license and warrant that you have the legal authority to do so.

Support

For questions around use (which are not clear from the documentation) post a new 'issue' in the questions repository. This repository acts as a forum where questions can be posted, discussed and searched.

For commercial support requests please contact us via email.

Authors/Contributors

Attribution

Thanks to the following projects:

AGL Energy for the original development.
Apache Spark for the underlying framework that has made this library possible.
slf4j-json-logger Copyright (c) 2016 Savoir Technologies released under the Apache 2.0 License. We have slightly altered their library to change the default logging format.
nyc-taxi-data for preparing an easy to use set of real-world data for the tutorial.

License

Arc is released under the MIT License.

tripl.ai

Versions

Version
2.14.0 Jun 15, 2020
2.13.0 Jun 11, 2020
2.12.5 May 30, 2020
2.12.4 May 21, 2020
2.12.3 May 20, 2020
2.12.2 May 9, 2020
2.12.1 May 7, 2020
2.12.0 May 6, 2020
2.11.0 May 3, 2020
2.10.2 Apr 26, 2020
2.10.1 Apr 25, 2020
2.10.0 Apr 19, 2020
2.9.0 Apr 8, 2020
2.8.1 Mar 25, 2020
2.8.0 Mar 19, 2020
2.7.0 Feb 1, 2020
2.6.0 Jan 9, 2020
2.5.0 Dec 14, 2019
2.4.0 Nov 23, 2019
2.3.1 Oct 29, 2019
2.3.0 Oct 28, 2019
2.2.0 Oct 4, 2019
2.1.0 Sep 23, 2019
2.0.1 Sep 2, 2019
2.0.0 Jul 16, 2019
1.15.0 Jun 20, 2019
1.14.0 Jun 18, 2019
1.13.3 Jun 6, 2019

arc

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download arc_2.11

How to add to project

Dependencies

compile (5)

provided (5)

test (3)

Project Modules

Documentation

Index

What is Arc?

Getting Started

Principles

Not just for data engineers

Why abstract from code?

Why SQL first?

Example pipeline

Building

Library

Tests

License Report

Documentation

Contributing

Support

Authors/Contributors

Attribution

License

tripl.ai

Versions