DataFlattener
This project allows us to transform a JSON record into one or more records using a simple mapping configuration DLS called "mapper" file. This project provides the parser for files with .mapper
extension. Following is the data flow - given a source JSON and mapping configuration, Data Flattener produces a list of Table
object records.
Table
:Table
is a scala case class that holdstableName
for name of the table androws
for list ofRow
objects.
case class Table(tableName: String, rows: List[Row])
Row
:Row
contains a listColumn
.
case class Row(columns: List[Column])
Column
:Column
contains the name of the column and value associated with it.
case class Column(mappingContext: StraightMappingContext, value: JValue])
Motivation
Before we aggregate a single record, we often times have to apply some transformation logic of following types:
- normalize/clean,
- encrypt some field (PII or PHI security concerns),
- shrink the size of the record by only selecting fields required downstream, or
- we need to "flatten" the JSON structure into relational tables so that they can be inserted into a database downstream.
All of these transformations can be communicated using some sort of language mechanism.
Installation
libraryDependencies += "com.github.deprosun" % "dataflattener_2.11" % "7.0"
Example 1
Create a FoodDetail table such that description
field is lowercase and rename the field name fdcId
to id
.
{
"fdcId": 563346,
"description": "BAKING POWDER"
}
Table FoodDetail (
Mapping (
fdcId = id INT NOT NULL
lower(description) = description VARCHAR NULL
)
)
In the above configuration language we have defined the table called FoodDetail
which contains two columns:
id
of typeINT
, which gets the value fromfdcId
field in JSON. This field cannot benull
.description
of typeVARCHAR
which gets the value from applying a functionlower
ondescription
field in JSON. This field can benull
.
Example 1 Output
Example 2
I'd like to create two tables - one table called Donut
and a child table called Batter
. Donut
table will have two columns id
and donutName
where the value is retrieved from donutUniqueId
and name
fields in the JSON, respectively. The child table Batter
will be created from batters.batter
array field in the JSON. Batter
table will contain columns id
, and batterType
where values come from each element in batters.batter
. We also want to add donutId
so that the table has a foreign key referencing the parent id. Therefore, in this case the value of donutId
should be 0001
.
{
"donutUniqueId": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters": {
"batter": [
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
TOPIC Donut (
MAPPING (
donutUniqueId = donutUID VARCHAR NOT NULL PK
name = donutName VARCHAR NOT NULL
)
TOPIC Batter FROM batters.batter WITH (donutUniqueId = donutId) (
MAPPING (
id = batterUID VARCHAR NOT NULL PK
donutId = donutParentId VARCHAR NOT NULL FK
type = batterType VARCHAR NOT NULL
)
)
)
/*
--Parent/Child Table--
you can create child tables simply by adding another TOPIC inside the parent TOPIC clause
by telling which field you want to create rows FROM. Sometimes we need to add extra
information (or rather need extra values) needed to generate child rows. We depict such
language with `WITH (donutUniqueId = donutId)` where `donutUniqueId` is a field key that
evaluate the value of it in the parent JSON and `donutId` is a variable name holding the
value for it. This value is then accessed using the variable you defined, here variable
is `donutId`. As we can the second mapping using `donutId` as a mapping rule for column
name `donutParentId`.
--Why is it not "batters.batter.id"?--
when we say "FROM batters.batter", in our mapping we use the array's element's JSON schema
as the root for each row. In other words, that is why you don't see `batters.batter.id` -
simply `id` will result in access the value.
*/
Example 2 Output
In this case, Donut
table will yield 1 row with id
= "0001". Table Batter
will yield 4 rows referring to "0001" in donutId
column
//Donut Table
| id | donutName |
|------|-----------|
| 0001 | Cake |
//Batter Table
| id | donutId | batterType |
|------|---------|--------------|
| 1001 | 0001 | Regular |
| 1002 | 0001 | Chocolate |
| 1003 | 0001 | Blueberry |
| 1004 | 0001 | Devil's Food |
Example 3
I'd like to create a Table Topping
from the topping
field. Moreover, I'd like to create a "flat" topping and donut info where each row contains a unique topping along side the name of the donut that topping is used on. The table should end up looking like:
//Topping Table
| id | name | donutName |
|------|--------------------------|-----------|
| 5001 | None | Cake |
| 5002 | Glazed | Cake |
| 5005 | Sugar | Cake |
| 5007 | Powdered Sugar | Cake |
| 5006 | Chocolate with Sprinkles | Cake |
| 5003 | Chocolate | Cake |
| 5004 | Maple | Cake |
{
"donutUniqueId": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters": {
"batter": [
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping": [
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}
TOPIC Topping (
MAPPING (
type = donutName VARCHAR NOT NULL
EXPLODE topping (
id = id VARCHAR (100) NOT NULL
type = toppingName VARCHAR (100) NOT NULL
)
)
)
Example 3 Output
Let's look at EXPLODE
. So far we have only seen 1 row getting created in the above example for the parent table. Sometimes, there are requirements, like above, where 1 row actually yield multiple. EXPLODE
tells DataFlattener exactly just that. It says:
the total number of rows yielded in Topping table will be the number of rows in
topping
fields.
Any mappings defined outside of EXPLODE
clause, simply "get added" to each row came by "exploding" the topping
field. That is why you see "Cake" multiple times in the final output.
//Topping Table
| id | name | donutName |
|------|--------------------------|-----------|
| 5001 | None | Cake |
| 5002 | Glazed | Cake |
| 5005 | Sugar | Cake |
| 5007 | Powdered Sugar | Cake |
| 5006 | Chocolate with Sprinkles | Cake |
| 5003 | Chocolate | Cake |
| 5004 | Maple | Cake |
Let's try it out!
There is nothing better than the actual demo to learn how DataFlattener behaves in the actual environment. Since this is JVM based library, you can add the code dependency based on your language from maven. Use version 6.0
. DataFlattener is leveraging Scala as the language so we will perform a demo using sbt
REPL - specifically, sbt console
. If you do not want to use SBT, you can simply write this piece of code in a test!
- Go ahead and start up a
sbt shell
from the project root directory.
$ sbt console
scala>
- Let's create our source JSON from above. Imports some stuff and create your sample json
import com.github.deprosun.dataflattener.transform.Transformer
import com.github.deprosun.dataflattener.model.MapperContext
import org.json4s
import org.json4s.JValue
import org.json4s.JsonAST.{JNothing, JNull, JString}
import org.json4s.native.JsonMethods._
import org.slf4j.{Logger, LoggerFactory}
val sourceJSON: String =
"""{
"donutUniqueId": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters": {
"batter": [
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
}""".stripMargin
- We will be performing Example 2. Lets create our mapper config:
val mapperConfig =
"""
TOPIC Donut (
MAPPING (
donutUniqueId = donutUID VARCHAR NOT NULL PK
name = donutName VARCHAR NOT NULL
)
TOPIC Batter FROM batters.batter (
MAPPING (
id = batterUID VARCHAR NOT NULL PK
donutId = donutId VARCHAR NOT NULL FK
type = batterType VARCHAR NOT NULL
)
)
)
""".stripMargin
- Since this project also provides a function to read the
mapper
configuration, let's use it. importimport com.github.deprosun.dataflattener.model.MapperContext
. Then create a mapper object usingmapperConfig
.
val mappers: List[MapperContext] = MapperContext.getMappers(mapperConfig) //get mapper objects that holds the above configuration information
- We use Json4s for parsing JSON string into an object called
JValue
from Json4s for parsing. Let's createJValue
object from our source JSON.
val record: JValue = parse(sourceJSON)
- When we define a mapper we also define a set of "business" rules and function that are sometimes referred to as UDF in sql world. Lets defined (or rather implements) these rules so that transformation only adheres only businesse's definition of transformation. For now, please ignore the variables:
logger
,toLowerCase
, andudfMap
. We will go over them later. Simply copy and paste the following and create your transformer:
import com.github.deprosun.dataflattener.transform.Transformer
import org.slf4j.{Logger, LoggerFactory}
val transformer: Transformer = new Transformer {
override val logger: Logger = LoggerFactory.getLogger("demo")
//convert a string to lowercase
def toLowerCase(json: List[JValue]): JValue = {
json match {
case (JNothing | JNull) :: Nil => JNothing //if the value as null, then just resort to null
case JString(x) :: Nil => JString(x.toLowerCase) //perform lowercase function
case (_: JValue) :: xs =>
throw new RuntimeException("You cannot supply more than 1 argument to function toLowerCase")
}
}
//a set of internal business modification value function
override val udfMap: Map[String, MapFunc] = Map(
"toLowerCase" -> toLowerCase
)
}
- Let run it by executing function
transformer.transform
.mapper
is a list ofMapperContext
.
val tranformed = mappers map { mapper =>
transformer.transform(record, mapper)
}
- Oops! We got an error.
java.lang.IllegalArgumentException: Column donutId cannot be null.
at com.github.deprosun.dataflattener.transform.Transformerransformer$class.getValueFromStraight(Transformer.scala:67)
at $anon$1.getValueFromStraight(<console>:34)
-
Something went wrong. Hmm, seems like we we mentioned
donutId
as "not null". Also, it is defined to beFK
. As per defintion,donutId
is comingdonutUniqueId
in the mapper configuration. But if you look at the source json the schema ofbatters.batter
does not containdonutUniqueId
field. Since the value of this column comes from the parent primary key column, we need to somehow "push" the value down to child tableBatter
. This is where we useWITH (x.y = someAlias)
. -
Lets look at the new mapper. Then lets reload it to the same
mapper
object.
val mapperConfig =
"""
TOPIC Donut (
MAPPING (
donutUniqueId = donutUID VARCHAR NOT NULL PK
name = donutName VARCHAR NOT NULL
)
TOPIC Batter FROM batters.batter WITH (donutUniqueId = donutId) (
MAPPING (
id = batterUID VARCHAR NOT NULL PK
donutId = donutParentId VARCHAR NOT NULL FK
type = batterType VARCHAR NOT NULL
)
)
)
""".stripMargin
val mapper: List[MapperContext] = MapperContext.getMappers(mapperConfig)
- Lets run the code again
// initiate transformation for each mapper/table
mappers foreach { x =>
//get all the transformed tables
val tables = transformer.transform(json, x)
//print each table in json format
tables foreach { table =>
println(table.toJsonString)
}
}
- Result! Each line shows you a json representation of the table.
{"Donut":[{"donutUID":"0001","donutName":"Cake"}]}
{"Batter":[{"batterUID":"1001","donutParentId":"0001","batterType":"Regular"},{"batterUID":"1002","donutParentId":"0001","batterType":"Chocolate"},{"batterUID":"1003","donutParentId":"0001","batterType":"Blueberry"},{"batterUID":"1004","donutParentId":"0001","batterType":"Devil's Food"}]}