za.co.absa:enceladus-menas

Enceladus is a Dynamic Conformance Engine which allows data from different formats to be standardized to parquet and conformed to group-accepted common reference.

License	License Apache License, Version 2.0
GroupId	GroupId za.co.absa
ArtifactId	ArtifactId enceladus-menas
Last Version	Last Version 1.7.1
Release Date	Release Date Apr 7, 2020
Type	Type war
Description	Description Enceladus is a Dynamic Conformance Engine which allows data from different formats to be standardized to parquet and conformed to group-accepted common reference.
Project Organization	Project Organization ABSA Group Limited

Download enceladus-menas

Filename	Size
enceladus-menas-1.7.1.pom
enceladus-menas-1.7.1.war	169 MB
enceladus-menas-1.7.1-sources.jar	99 KB
enceladus-menas-1.7.1-javadoc.jar	703 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/za.co.absa/enceladus-menas/ -->
<dependency>
    <groupId>za.co.absa</groupId>
    <artifactId>enceladus-menas</artifactId>
    <version>1.7.1</version>
    <type>war</type>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/za.co.absa/enceladus-menas/
implementation 'za.co.absa:enceladus-menas:1.7.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/za.co.absa/enceladus-menas/
implementation ("za.co.absa:enceladus-menas:1.7.1")

Apache Buildr

'za.co.absa:enceladus-menas:war:1.7.1'

Apache Ivy

<dependency org="za.co.absa" name="enceladus-menas" rev="1.7.1">
  <artifact name="enceladus-menas" type="war" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='za.co.absa', module='enceladus-menas', version='1.7.1')
)

Scala SBT

libraryDependencies += "za.co.absa" % "enceladus-menas" % "1.7.1"

Leiningen

[za.co.absa/enceladus-menas "1.7.1"]

Dependencies

compile (36)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.11.12
org.scala-lang.modules : scala-xml_2.11	jar	1.0.4
org.apache.httpcomponents : httpclient	jar	4.4.1
org.apache.spark : spark-core_2.11	jar	2.4.4
org.apache.spark : spark-sql_2.11	jar	2.4.4
org.apache.hadoop : hadoop-client	jar	2.7.7
com.fasterxml.jackson.module : jackson-module-scala_2.11	jar	2.9.8
com.fasterxml.jackson.datatype : jackson-datatype-jsr310	jar	2.9.8
com.fasterxml.jackson.core : jackson-databind	jar	2.9.8
com.google.code.gson : gson	jar	2.8.2
org.mongodb.scala : mongo-scala-driver_2.11	jar	2.7.0
io.github.cbartosiak : bson-codecs-jsr310	jar	3.5.4
org.springframework.boot : spring-boot-starter-web	jar	2.0.0.RELEASE
org.springframework.boot : spring-boot-starter-actuator	jar	2.0.0.RELEASE
org.springframework.boot : spring-boot-starter-security	jar	2.0.0.RELEASE
org.springframework.security : spring-security-ldap	jar	5.0.3.RELEASE
org.springframework.security.kerberos : spring-security-kerberos-web	jar	1.0.1.RELEASE
org.springframework.security.kerberos : spring-security-kerberos-client	jar	1.0.1.RELEASE
io.jsonwebtoken : jjwt-api	jar	0.10.7
io.jsonwebtoken : jjwt-jackson	jar	0.10.7
org.apache.htrace : htrace-core	jar	3.1.0-incubating
za.co.absa : enceladus-dataModel	jar	1.7.1
za.co.absa : enceladus-migrations	jar	1.7.1
za.co.absa : enceladus-utils	jar	1.7.1
org.scala-lang.modules : scala-java8-compat_2.11	jar	0.9.0
africa.absa » openui5	jar	1.67.1
africa.absa » cronstrue-webjar	jar	1.79.0
org.webjars.bower : lodash	jar	4.17.10
org.webjars : momentjs	jar	2.22.2
org.webjars : webjars-locator-core	jar	0.35
org.apache.oozie : oozie-client	jar	4.3.0
org.webjars.bower : chart.js	jar	2.7.3
za.co.absa.cobrix : spark-cobol	jar	0.5.3
org.apache.parquet : parquet-hadoop	jar	1.10.0
za.co.absa : spark-hofs	jar	0.3.1
za.co.absa : atum	jar	0.2.5

runtime (1)

Group / Artifact	Type	Version
io.jsonwebtoken : jjwt-impl	jar	0.10.7

test (5)

Group / Artifact	Type	Version
org.springframework.boot : spring-boot-starter-test	jar	2.0.0.RELEASE
org.specs2 : specs2-core_2.11	jar	2.4.16
org.scalatest : scalatest_2.11	jar	3.0.5
junit : junit	jar	4.11
org.mockito : mockito-core	jar	2.10.0

Project Modules

There are no modules declared in this project.

Enceladus

Build Status

master	develop

What is Enceladus?

Enceladus is a Dynamic Conformance Engine which allows data from different formats to be standardized to parquet and conformed to group-accepted common reference (e.g. data for country designation which are DE in one source system and Deutschland in another, can be conformed to Germany).

The project is comprised of three main components:

Menas

This is the user-facing web client, used to specify the standardization schema, and define the steps required to conform a dataset.
There are three models used to do this:

Dataset: Specifies where the dataset will be read from on HDFS (RAW), the conformance rules that will be applied to it, and where it will land on HDFS once it is conformed (PUBLISH)
Schema: Specifies the schema towards which the dataset will be standardized
Mapping Table: Specifies where tables with master reference data can be found (parquet on HDFS), which are used when applying Mapping conformance rules (e.g. the dataset uses Germany, which maps to the master reference DE in the mapping table)

Standardization

This is a Spark job which reads an input dataset in any of the supported formats and produces a parquet dataset with the Menas-specified schema as output.

Conformance

This is a Spark job which applies the Menas-specified conformance rules to the standardized dataset.

Standardization and Conformance

This is a Spark job which executes both Standardization and Conformance together in the same job

How to build

Build requirements:

Maven 3.5.4+
Java 8

Each module provides configuration file templates with reasonable default values. Make a copy of the *.properties.template and *.conf.template files in each module's src/resources directory removing the .template extension. Ensure the properties there fit your environment.

Build commands:

Without tests: mvn clean package -DskipTests
With unit tests: mvn clean package
With integration tests: mvn clean package -Pintegration
With component preload file generated: mvn clean package -PgenerateComponentPreload

Test coverage:

Test coverage: mvn scoverage:report

The coverage reports are written in each module's target directory and aggregated in the root target directory.

How to run

Menas requirements:

Tomcat 8.5/9.0 installation
MongoDB 4.0 installation
Spline UI deployment - place the spline.war in your Tomcat webapps directory (rename after downloading to spline.war); NB! don't forget to set up the spline.mongodb.url configuration for the war
HADOOP_CONF_DIR environment variable, pointing to the location of your hadoop configuration (pointing to a hadoop installation)

The Spline UI can be omitted; in such case the Menas spline.urlTemplate setting should be set to empty string.

Deploying Menas

Simply copy the menas.war file produced when building the project into Tomcat's webapps directory.

Speed up initial loading time of menas

Build the project with the generateComponentPreload profile. Component preload will greatly reduce the number of HTTP requests required for the initial load of Menas
Enable the HTTP compression
Configure spring.resources.cache.cachecontrol.max-age in application.properties of Menas for caching of static resources

Standardization and Conformance requirements:

Spark 2.4.4 (Scala 2.11) installation
Hadoop 2.7 installation
Menas running instance
Menas Credentials File in your home directory or on HDFS (a configuration file for authenticating the Spark jobs with Menas)
- Use with in-memory authentication e.g. ~/menas-credential.properties:

username=user
password=changeme

Menas Keytab File in your home directory or on HDFS
- Use with kerberos authentication, see link for details on creating keytab files
Directory structure for the RAW dataset should follow the convention of <path_to_dataset_in_menas>/<year>/<month>/<day>/v<dataset_version>. This date is specified with the --report-date option when running the Standardization and Conformance jobs.
_INFO file must be present along with the RAW data on HDFS as per the above directory structure. This is a file tracking control measures via Atum, an example can be found here.

Running Standardization

<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=<menas_api_uri:port> -Dstandardized.hdfs.path=<path_for_standardized_output>-{0}-{1}-{2}-{3} -Dspline.mongodb.url=<mongo_url_for_spline> -Dspline.mongodb.name=<spline_database_name> -Dhdp.version=<hadoop_version>" \
--class za.co.absa.enceladus.standardization.StandardizationJob \
<spark-jobs_<build_version>.jar> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

Here row-tag is a specific option for raw-format of type XML. For more options for different types please see our WIKI.
In case Menas is configured for in-memory authentication (e.g. in dev environments), replace --menas-auth-keytab with --menas-credentials-file

Running Conformance

<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf 'spark.ui.port=29000' \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=<menas_api_uri:port> -Dstandardized.hdfs.path=<path_of_standardized_input>-{0}-{1}-{2}-{3} -Dconformance.mappingtable.pattern=reportDate={0}-{1}-{2} -Dspline.mongodb.url=<mongo_url_for_spline> -Dspline.mongodb.name=<spline_database_name>" -Dhdp.version=<hadoop_version> \
--packages za.co.absa:enceladus-parent:<version>,za.co.absa:enceladus-conformance:<version> \
--class za.co.absa.enceladus.conformance.DynamicConformanceJob \
<spark-jobs_<build_version>.jar> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version>

Running Standardization and Conformance together

<spark home>/spark-submit \
--num-executors <num> \
--executor-memory <num>G \
--master yarn \
--deploy-mode <client/cluster> \
--driver-cores <num> \
--driver-memory <num>G \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=<menas_api_uri:port> -Dstandardized.hdfs.path=<path_for_standardized_output>-{0}-{1}-{2}-{3} -Dspline.mongodb.url=<mongo_url_for_spline> -Dspline.mongodb.name=<spline_database_name> -Dhdp.version=<hadoop_version>" \
--class za.co.absa.enceladus.standardization_conformance.StandardizationAndConformanceJob \
<spark-jobs_<build_version>.jar> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

In case Menas is configured for in-memory authentication (e.g. in dev environments), replace --menas-auth-keytab with --menas-credentials-file

Helper scripts for running Standardization, Conformance or both together

The Scripts in scripts folder can be used to simplify command lines for running Standardization and Conformance jobs.

Steps to configure the scripts are as follows (Linux):

Copy all the scripts in scripts/bash directory to a location in your environment.
Copy enceladus_env.template.sh to enceladus_env.sh.
Change enceladus_env.sh according to your environment settings.
Use run_standardization.sh and run_conformance.sh scripts instead of directly invoking spark-submit to run your jobs.

Similar scripts exist for Windows in directory scripts/cmd.

The syntax for running Standardization and Conformance is similar to running them using spark-submit. The only difference is that you don't have to provide environment-specific settings. Several resource options, like driver memory and driver cores also have default values and can be omitted. The number of executors is still a mandatory parameter.

The basic command to run Standardization becomes:

<path to scripts>/run_standardization.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

The basic command to run Conformance becomes:

<path to scripts>/run_conformance.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version>

The basic command to run Standardization and Conformance combined becomes:

<path to scripts>/run_standardization_conformance.sh \
--num-executors <num> \
--deploy-mode <client/cluster> \
--menas-auth-keytab <path_to_keytab_file> \
--dataset-name <dataset_name> \
--dataset-version <dataset_version> \
--report-date <date> \
--report-version <data_run_version> \
--raw-format <data_format> \
--row-tag <tag>

Similarly for Windows:

<path to scripts>/run_standardization.cmd ^
--num-executors <num> ^
--deploy-mode <client/cluster> ^
--menas-auth-keytab <path_to_keytab_file> ^
--dataset-name <dataset_name> ^
--dataset-version <dataset_version> ^
--report-date <date> ^
--report-version <data_run_version> ^
--raw-format <data_format> ^
--row-tag <tag>

Etc...

The list of options for configuring Spark deployment mode in Yarn and resource specification:

Option	Description
--deploy-mode cluster/client	Specifies a Spark Application deployment mode when Spark runs on Yarn. Can be either `client` or `cluster`.
--num-executors n	Specifies the number of executors to use.
--executor-memory mem	Specifies an amount of memory to request for each executor. See memory specification syntax in Spark. Examples: `4g`, `8g`.
--executor-cores mem	Specifies a number of cores to request for each executor (default=1).
--driver-cores n	Specifies a number of CPU cores to allocate for the driver process.
--driver-memory mem	Specifies an amount of memory to request for the driver process. See memory specification syntax in Spark. Examples: `4g`, `8g`.
--persist-storage-level level	Advanced Specifies the storage level to use for persisting intermediate results. Can be one of `NONE`, `DISK_ONLY`, `MEMORY_ONLY`, `MEMORY_ONLY_SER`, `MEMORY_AND_DISK` (default), `MEMORY_AND_DISK_SER`, etc. See more here.
--conf-spark-executor-memoryOverhead mem	Advanced. The amount of off-heap memory to be allocated per executor, in MiB unless otherwise specified. Sets `spark.executor.memoryOverhead` Spark configuration parameter. See the detailed description here. See memory specification syntax in Spark. Examples: `4g`, `8g`.
--conf-spark-memory-fraction value	Advanced. Fraction of (heap space - 300MB) used for execution and storage (default=`0.6`). Sets `spark.memory.fraction` Spark configuration parameter. See the detailed description here.

For more information on these options see the official documentation on running Spark on Yarn: https://spark.apache.org/docs/latest/running-on-yarn.html

The list of all options for running Standardization, Conformance and the combined Standardization And Conformance jobs:

Option	Description
--menas-auth-keytab filename	A keytab file used for Kerberized authentication to Menas. Cannot be used together with `--menas-credentials-file`.
--menas-credentials-file filename	A credentials file containing a login and a password used to authenticate to Menas. Cannot be used together with `--menas-auth-keytab`.
--dataset-name name	A dataset name to be standardized or conformed.
--dataset-version version	A version of a dataset to be standardized or conformed.
--report-date YYYY-mm-dd	A date specifying a day for which a raw data is landed.
--report-version version	A version of the data for a particular day.
--std-hdfs-path path	A path pattern where to put standardized data. The following tokens are expending in the pattern: `{0}` - dataset name, `{1}` - dataset version, `{2}`- report date, `{3}`- report version.

The list of additional options available for running Standardization:

Option	Description
--raw-format format	A format for input data. Can be one of `parquet`, `json`, `csv`, `xml`, `cobol`, `fixed-width`.
--charset charset	Specifies a charset to use for `csv`, `json` or `xml`. Default is `UTF-8`.
--cobol-encoding encoding	Specifies the encoding of a mainframe file (`ascii` or `ebcdic`). Code page can be specified using `--charset` option.
--cobol-is-text true/false	Specifies if the mainframe file is ASCII text file
--cobol-trimming-policy policy	Specifies the way leading and trailing spaces should be handled. Can be `none` (do not trim spaces), `left`, `right`, `both`(default).
--copybook string	Path to a copybook for COBOL data format
--csv-escape character	Specifies a character to be used for escaping other characters. By default '\' (backslash) is used. ^*
--csv-quote character	Specifies a character to be used as a quote for creating fields that might contain delimiter character. By default `"` is used. ^*
--debug-set-raw-path path	Override the path of the raw data (used for testing purposes).
--delimiter character	Specifies a delimiter character to use for CSV format. By default `,` is used. ^*
--empty-values-as-nulls true/false	If `true` treats empty values as `null`s
--folder-prefix prefix	Adds a folder prefix before the date tokens.
--header true/false	Indicates if in the input CSV data has headers as the first row of each file.
--is-xcom true/false	If `true` a mainframe input file is expected to have XCOM RDW headers.
--null-value string	Defines how null values are represented in a `csv` and `fixed-width` file formats
--row-tag tag	A row tag if the input format is `xml`.
--strict-schema-check true/false	If `true` processing ends the moment a row not adhering to the schema is encountered, `false` (default) proceeds over it with an entry in errCol
--trimValues true/false	Indicates if string fields of fixed with text data should be trimmed.

Most of these options are format specific. For details see the documentation.

^* Can also be specified as a unicode value in the following ways: U+00A1, u00a1 or just the code 00A1. In case empty string option needs to be applied, the keyword none can be used.

The list of additional options available for running Conformance:

Option	Description
--mapping-table-pattern pattern	A pattern to look for mapping table for the specified date. The list of possible substitutions: `{0}` - year, `{1}` - month, `{2}` - day of month. By default the pattern is `reportDate={0}-{1}-{2}`. Special symbols in the pattern need to be escaped. For example, an empty pattern can be be specified as `\'\'` (single quotes are escaped using a backslash character).
--experimental-mapping-rule true/false	If `true`, the experimental optimized mapping rule implementation is used. The default value is build-specific and is set in 'application.properties'.
--catalyst-workaround true/false	Turns on (`true`) or off (`false`) workaround for Catalyst optimizer issue. It is `true` by default. Turn this off only is you encounter timing freeze issues when running Conformance.
--autoclean-std-folder true/false	If `true`, the standardized folder will be cleaned automatically after successful execution of a Conformance job.

All the additional options valid for both Standardization and Conformance can also be specified when running the combined StandardizationAndConformance job

Plugins

Standardization and Conformance support plugins that allow executing additional actions at certain times of the computation. To learn how plugins work, when and how their logic is executed, please refer to the documentation.

Built-in Plugins

The purpose of this module is to provide some plugins of additional but relatively elementary functionality. And also to serve as an example how plugins are written: detailed description

Examples

A module containing examples of the project usage.

How to contribute

Please see our Contribution Guidelines.

Documentation

Please see the documentation pages.

ABSA OSS

ABSA Open Source

Versions

Version
1.7.1 Apr 7, 2020
1.7.0 Jan 10, 2020
1.6.3 Jan 6, 2020
1.6.2 Dec 13, 2019
1.6.1 Dec 3, 2019
1.6.0 Nov 29, 2019
1.5.0 Nov 15, 2019
1.4.0 Nov 1, 2019
1.3.1 Oct 22, 2019
1.3.0 Oct 18, 2019
1.2.2 Oct 3, 2019
1.2.1 Sep 30, 2019
1.2.0 Sep 20, 2019
1.1.1 Sep 4, 2019
1.1.0 Aug 30, 2019
1.0.0 Aug 8, 2019
1.0.0-RC5 Aug 5, 2019
1.0.0-RC4 Aug 5, 2019
1.0.0-RC3 Jul 23, 2019
1.0.0-RC2 Jul 12, 2019
1.0.0-RC1 Jul 1, 2019
1.0.0-RC0 Jun 28, 2019