Lumio ETL

This library provides utilities for building lumio-flow based ETLs

License	License MIT License
GroupId	GroupId com.lumiomedical
ArtifactId	ArtifactId lumio-etl
Last Version	Last Version 0.5
Release Date	Release Date Feb 15, 2021
Type	Type jar
Description	Description Lumio ETL This library provides utilities for building lumio-flow based ETLs
Project URL	Project URL https://github.com/lumio-medical/lumio-etl
Project Organization	Project Organization Lumio Medical
Source Code Management	Source Code Management https://github.com/lumio-medical/lumio-etl

Download lumio-etl

Filename	Size
lumio-etl-0.5.pom
lumio-etl-0.5.jar	156 KB
lumio-etl-0.5-sources.jar	90 KB
lumio-etl-0.5-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.lumiomedical/lumio-etl/ -->
<dependency>
    <groupId>com.lumiomedical</groupId>
    <artifactId>lumio-etl</artifactId>
    <version>0.5</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.lumiomedical/lumio-etl/
implementation 'com.lumiomedical:lumio-etl:0.5'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.lumiomedical/lumio-etl/
implementation ("com.lumiomedical:lumio-etl:0.5")

Apache Buildr

'com.lumiomedical:lumio-etl:jar:0.5'

Apache Ivy

<dependency org="com.lumiomedical" name="lumio-etl" rev="0.5">
  <artifact name="lumio-etl" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.lumiomedical', module='lumio-etl', version='0.5')
)

Scala SBT

libraryDependencies += "com.lumiomedical" % "lumio-etl" % "0.5"

Leiningen

[com.lumiomedical/lumio-etl "0.5"]

Dependencies

compile (9)

Group / Artifact	Type	Version
com.lumiomedical : lumio-flow	jar	[0.12,)
com.lumiomedical : lumio-vault	jar	[0.9,)
tech.tablesaw : tablesaw-core	jar	0.38.1
com.amazonaws : aws-java-sdk-s3	jar	1.11.721
org.apache.commons : commons-compress	jar	1.20
org.jsoup : jsoup	jar	1.13.1
com.noleme : noleme-json	jar	[0.9,)
com.noleme : noleme-commons	jar	[0.17,)
org.slf4j : slf4j-api	jar	1.7.30

test (3)

Group / Artifact	Type	Version
com.noleme : noleme-mongodb-test	jar	[0.2,)
org.slf4j : slf4j-simple	jar	1.7.30
org.junit.jupiter : junit-jupiter	jar	5.6.0

Project Modules

There are no modules declared in this project.

Lumio ETL

This library provides utilities for building lumio-flow based ETLs.

It is important to note that this library was very much a work in progress in both spirit and body. The general concept is here: it is an aggregate of general-purpose implementations of lumio-flow actors with the intended goal of enabling the implementation of simple yet flexible ETL programs. However, much was still being worked on, or left to be worked on later, and so there are many shortcomings to keep in mind when reading what is here.

The "best" level of abstraction to design actors is not decided: sometimes it is good to have small atomic operations (eg. reading a file on a filesystem or from AWS), sometimes it is good to have high-level operations encompassing a diversity of actions. So far, ETL programs implemented with this project have been a mix of both what I would consider successes as well as design failures. My belief was that this API needed a higher-level abstraction API for lumio-flow DAG building in combination with well-designed actors ; sometimes the issue is not the scope of an actor itself as much as how fragmented its manipulation can be. Maybe some kind of "actor recipes" could also be a part of the solution.
The implementations of actors can be minimalist at times, reflecting the level of sophistication that was needed in the project for which it was developed, this is generally not really an issue, new features can be added.
As it stands, there are also a handful of basic implementations for things that aren't exactly ETL-related (eg. most things in transformer.text, or transformer.jsoup for instance), these could end up being scrapped at a later date.

Implementations found in this package shouldn't be tied to any specific Lumio project.

Note: This library is considered as "in beta" and as such significant API changes may occur without prior warning.

I. Installation

Add the following in your pom.xml:

<dependency>
    <groupId>com.lumiomedical</groupId>
    <artifactId>lumio-etl</artifactId>
    <version>0.5</version>
</dependency>

II. Notes on Structure and Design

There are four facilities in this library:

The ETL class is a higher level abstraction for lumio-flow DAGs meant to be a space for declaring DAG structure, host configuration parameters, as well as specify which FlowCompiler to use. All ETL pipelines used throughout lumio-core use this class as a base.
Anything in the dataframe package pertains to the manipulation of tech.tablesaw dataframes. It provides notably helper functions, and the TableProcessor feature, which was meant to be used in conjunction with a higher abstraction for lumio-flow (essentially, a mean to specify a specific dataframe refinement stage). It is possible that the final implementation for that idea would have ended up rejecting the TableProcessor contract in favour of lumio-flow actor contracts (it would have certainly been preferable at least).
The extractor, generator, loader and transformer packages provide lumio-flow actor implementations
The vault package provides a handful of VaultModule implementations for handling custom configuration. It should be noted that as it stands, these were also very much work in progress: the general idea was to start with satisfying the need for conf-based ETL pipeline definition (at least their entry-points, in order to allow swapping between loading files from an AWS instance or a filesystem for instance) and tackle "prettiness" later. As it stands it is fairly verbose for pipelines with many data sources, I believe just building a more opinionated abstraction on top of it could go a long way.

TODO

III. Usage

Note that two sample "toy" programs are also provided: sample-nlp here and sample-crawl there. None of them leverage lumio-vault configuration features, but their structure could be simplified and made more resilient to changes with a bit of lumio-vault sprinkled in.

We'll also write down a basic example of ETL pipeline leveraging some features found in this library, we won't touch on the ETL classes, these are covered in the sample project.

Most of the syntax is actually from lumio-flow, it could be a good idea to start by having a look at it there.

Let us start by imagining we have a tiny CSV dataset like this:

key,value,metadata
0,234,interesting
1,139,not_interesting
3,982,interesting

Here is what a pipeline for manipulating this could look like:

var flow = Flow
    .from(new FileStreamer(), "path/to/my.csv") //We open an inpustream from the CSV file
    .pipe(new TablesawCSVParser()) //We interpret it as CSV and transform it into a tablesaw dataframe
    .pipe(Tablesaw::print) // We print the dataframe to stdout
;

Flow.runAsPipeline(flow);

Running the above should display the following, granted a logger configured for printing INFO level information:

[main] INFO com.lumiomedical.etl - Initializing stream from filesystem at data/my.csv
[main] INFO com.lumiomedical.etl - Extracting CSV data into dataframe...
[main] INFO com.lumiomedical.etl - Extracted 3 lines into dataframe.
                                               
 index  |  key  |  value  |     metadata      |
-----------------------------------------------
     0  |    0  |    234  |      interesting  |
     1  |    1  |    139  |  not_interesting  |
     2  |    3  |    982  |      interesting  |
(row_count=3)

Note that it added an index column, we can remove it by specifying a TableProperties object with setAddRowIndex(false). Let's also add a filter, and a persistence operation:

var tableProperties = new TableProperties().setAddRowIndex(false);

var flow = Flow
    .from(new FileStreamer(), "path/to/my.csv")
    .pipe(new TablesawCSVParser(tableProperties))
    .pipe(Criterion.whereIsEqualTo("metadata", "interesting")) //We use a helper query feature, note that there are many other ways to do that, notably using the tablesaw API
    .sink(new TablesawCSVWrite("path/to/my-filtered.csv")) //We dump the dataframe as CSV into another file
;

Flow.runAsPipeline(flow);

Upon running, the above should produce a CSV file like this one:

key,value,metadata
0,234,interesting
3,982,interesting

Will wrap-up this very simple example by replacing the source by one loading the file from AWS:

var tableProperties = new TableProperties().setAddRowIndex(false);

var flow = Flow
    .from(new AmazonS3Streamer(s3, "my-bucket", "my.csv")) // Given a properly configured AmazonS3 instance
    .pipe(new TablesawCSVParser(tableProperties))
    .pipe(Criterion.whereIsEqualTo("metadata", "interesting"))
    .sink(new TablesawCSVWrite("path/to/my-filtered.csv")) // We still write the output to the filesystem
;

Flow.runAsPipeline(flow);

As the reader can guess, the general idea is to define the execution plan (general structure and type transitions) separately from the choice of implementation used for performing the transformations. For instance, here, we would likely make the Extractor and Loader swappable, while retaining the interpretation as a CSV and subsequent filtering. Some situations may call for entire little pipelines with remote extracting, unzipping, streaming, etc. The goal was to make it possible to focus on the core logic and retain control over how the pipeline interacts with the outside world.

TODO

IV. Dev Installation

This project will require you to have the following:

Java 11+
Git (versioning)
Maven (dependency resolving, publishing and packaging)

License

Lumio Medical

Versions

Version
0.5 Feb 15, 2021
0.4 Dec 26, 2020
0.3 Dec 26, 2020
0.2 Dec 26, 2020
0.1 Dec 24, 2020

Lumio ETL

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download lumio-etl

How to add to project

Dependencies

compile (9)

test (3)

Project Modules

Lumio ETL

I. Installation

II. Notes on Structure and Design

III. Usage

IV. Dev Installation

License

Lumio Medical

Versions