Smart Data Lake

Build your data lake the smart way.

License	License GNU General Public License (GPL) version 3
Categories	Categories Data
GroupId	GroupId io.smartdatalake
ArtifactId	ArtifactId smartdatalake_2.11
Last Version	Last Version 1.1.1
Release Date	Release Date Sep 8, 2020
Type	Type jar
Description	Description Smart Data Lake Build your data lake the smart way.
Project URL	Project URL http://www.smartdatalake.io
Project Organization	Project Organization ELCA Informatique SA
Source Code Management	Source Code Management http://github.com/smart-data-lake/smart-data-lake/tree/master

Download smartdatalake_2.11

Filename	Size
smartdatalake_2.11-1.1.1.pom
smartdatalake_2.11-1.1.1.jar	3 MB
smartdatalake_2.11-1.1.1-sources.jar	268 KB
smartdatalake_2.11-1.1.1-javadoc.jar	982 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/io.smartdatalake/smartdatalake_2.11/ -->
<dependency>
    <groupId>io.smartdatalake</groupId>
    <artifactId>smartdatalake_2.11</artifactId>
    <version>1.1.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/io.smartdatalake/smartdatalake_2.11/
implementation 'io.smartdatalake:smartdatalake_2.11:1.1.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/io.smartdatalake/smartdatalake_2.11/
implementation ("io.smartdatalake:smartdatalake_2.11:1.1.1")

Apache Buildr

'io.smartdatalake:smartdatalake_2.11:jar:1.1.1'

Apache Ivy

<dependency org="io.smartdatalake" name="smartdatalake_2.11" rev="1.1.1">
  <artifact name="smartdatalake_2.11" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='io.smartdatalake', module='smartdatalake_2.11', version='1.1.1')
)

Scala SBT

libraryDependencies += "io.smartdatalake" % "smartdatalake_2.11" % "1.1.1"

Leiningen

[io.smartdatalake/smartdatalake_2.11 "1.1.1"]

Dependencies

compile (36)

Group / Artifact	Type	Version
org.slf4j : slf4j-api	jar	1.7.30
com.databricks : dbutils-api_2.11	jar	0.0.4
log4j : log4j	jar	1.2.17
com.github.scopt : scopt_2.11	jar	3.7.1
org.apache.hadoop : hadoop-common	jar	2.6.5
org.apache.zookeeper : zookeeper	jar	3.4.13
org.apache.spark : spark-core_2.11	jar	2.4.4
org.apache.spark : spark-sql_2.11	jar	2.4.4
org.apache.spark : spark-catalyst_2.11	jar	2.4.4
io.smartdatalake : spark-extensions_2.11	jar	1.0.0
org.apache.spark : spark-sql-kafka-0-10_2.11	jar	2.4.4
org.apache.kafka : kafka-clients	jar	2.4.1
io.confluent » kafka-schema-registry-client	jar	5.4.1
org.apache.spark : spark-avro_2.11	jar	2.4.4
commons-io : commons-io	jar	2.4
com.healthmarketscience.jackcess : jackcess	jar	2.1.11
org.apache.avro : avro	jar	1.8.2
com.hierynomus : sshj	jar	0.21.1
org.scala-lang : scala-library	jar	2.11.12
org.scala-lang : scala-reflect	jar	2.11.12
org.scala-lang : scala-compiler	jar	2.11.12
org.scala-lang.modules : scala-xml_2.11	jar	1.0.5
com.typesafe : config	jar	1.3.4
com.jsuereth : scala-arm_2.11	jar	2.0
com.splunk » splunk	jar	1.6.5.0
org.scalaj : scalaj-http_2.11	jar	2.3.0
javax.jms : jms	jar	1.1
org.keycloak : keycloak-core	jar	4.5.0.Final
org.keycloak : keycloak-admin-client	jar	4.5.0.Final
com.github.kxbmap : configs_2.11	jar	0.4.4
io.monix : monix-eval_2.11	jar	3.1.0
io.monix : monix-execution_2.11	jar	3.1.0
com.github.mutcianm : ascii-graphs_2.11	jar	0.0.6
org.apache.commons : commons-pool2	jar	2.8.0
joda-time : joda-time	jar	2.9.3
io.delta : delta-core_2.11	jar	0.5.0

runtime (8)

Group / Artifact	Type	Version
org.apache.spark : spark-hive_2.11	jar	2.4.4
com.databricks : spark-xml_2.11	jar	0.9.0
org.apache.commons : commons-lang3	jar	3.5
com.crealytics : spark-excel_2.11	jar	0.12.0
org.apache.poi : poi	jar	4.0.0
org.apache.poi : poi-ooxml	jar	4.0.0
net.sf.ucanaccess : ucanaccess	jar	4.0.4
org.jboss.resteasy : resteasy-client	jar	3.1.3.Final

test (10)

Group / Artifact	Type	Version
org.apache.spark : spark-mllib_2.11	jar	2.4.4
org.scalatest : scalatest_2.11	jar	3.0.1
org.scalactic : scalactic_2.11	jar	3.0.1
org.scalacheck : scalacheck_2.11	jar	1.14.3
org.apache.sshd : sshd-sftp	jar	2.3.0
org.apache.sshd : sshd-common	jar	2.3.0
org.apache.sshd : sshd-core	jar	2.3.0
com.github.tomakehurst : wiremock-standalone	jar	2.25.1
net.i2p.crypto : eddsa	jar	0.3.0
io.github.embeddedkafka : embedded-kafka_2.11	jar	2.4.1

Project Modules

There are no modules declared in this project.

Smart Data Lake

Smart Data Lake Builder is a data lake automation framework that makes loading and transforming data a breeze. It is implemented in Scala and builds on top of open-source big data technologies like Apache Hadoop and Apache Spark, including connectors for diverse data sources (HadoopFS, Hive, DeltaLake, JDBC, Splunk, Webservice, SFTP, JMS, Excel, Access) and file formats.

A Data Lake

is a central raw data store for analytics
facilitates cheap raw storage to handle growing volumes of data
enables topnotch artificial intelligence (AI) and machine learning (ML) technologies for data-driven enterprises

The Smart Data Lake adds

a layered data architecture to provide not only raw data, but prepared, secured, high quality data according to business entities, ready to use for analytical use cases, also called «Smart Data». This is comparable to Databricks Lake House architecture, in fact Smart Data Lake Builder is a very good choice to automate a Lake House, also on Databricks.
a declarative, configuration-driven approach to creating data pipelines. Metadata about data pipelines allows for efficient operations, maintenance and more business self-service.

Benefits of Smart Data Lake Builder

Cheaper implementation of data lakes
Increased productivity of data scientists
Higher level of self-service
Decreased operations and maintenance costs
Fully open source, no vendor lock-in

When should you consider using Smart Data Lake Builder ?

Some common use cases include:

Building Data Lakes, drastically increasing productivity and usability
Data Apps - building complex data processing apps
DWH automation - reading and writing to relational databases via SQL
Data migration - Efficiently create one-time data pipelines
Data Catalog / Data Lineage - Generated automatically from metadata

See Features for a comprehensive list of Smart Data Lake Builder features.

How it works

The following diagram shows the core concepts:

Data object

A data object defines the location and format of data. Some data objects require a connection to access remote data (e.g. a database connection).

Action

The "data processors" are called actions. An action requires at least one input and output data object. An action reads the data from the input data object, processes and writes it to the output data object. Many actions are predefined e.g. transform data from json to csv but you can also define your custom transformer action.

Feed

Actions connect different Data Object and implicitly define a directed acyclic graph, as they model the dependencies needed to fill a Data Object. This automatically generated, arbitrary complex data flow can be divided up into Feed's (subgraphs) for execution and monitoring.

Configuration

All metadata i.e. connections, data objects and actions are defined in a central configuration file, usually called application.conf. The file format used is HOCON which makes it easy to edit.

Getting Started

To see how all this works in action, head over to the Getting Started page.

Major Contributors

www.sbb.ch : Provided the previously developed software as a foundation for the open source project

www.elca.ch : Did the comprehensive revision and provision as open source project

Additional Documentation

Getting Started
Reference
Architecture
Testing
Glossary
Troubleshooting
FAQ
Contributing
Running in the Public Cloud

Versions

Version
1.1.1 Sep 8, 2020
1.1.0 Aug 8, 2020
1.0.7 Aug 4, 2020
1.0.6 Jul 4, 2020
1.0.5 Jun 25, 2020
1.0.4 Jun 10, 2020
1.0.3 May 18, 2020
1.0.2 Apr 23, 2020
1.0.1 Apr 22, 2020

Smart Data Lake

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download smartdatalake_2.11

How to add to project

Dependencies

compile (36)

runtime (8)

test (10)

Project Modules

Smart Data Lake

A Data Lake

The Smart Data Lake adds

Benefits of Smart Data Lake Builder

When should you consider using Smart Data Lake Builder ?

How it works

Data object

Action

Feed

Configuration

Getting Started

Major Contributors

Additional Documentation

Versions