spark-hats

Spark extensions for working with nested arrays and structs

License	License Apache-2.0
GroupId	GroupId za.co.absa
ArtifactId	ArtifactId spark-hats_2.12
Last Version	Last Version 0.2.2
Release Date	Release Date Mar 8, 2021
Type	Type jar
Description	Description spark-hats Spark extensions for working with nested arrays and structs
Project URL	Project URL https://github.com/AbsaOSS/spark-hats
Project Organization	Project Organization ABSA Group Limited
Source Code Management	Source Code Management https://github.com/AbsaOSS/spark-hats/tree/master

Download spark-hats_2.12

Filename	Size
spark-hats_2.12-0.2.2.pom
spark-hats_2.12-0.2.2.jar	34 KB
spark-hats_2.12-0.2.2-sources.jar	14 KB
spark-hats_2.12-0.2.2-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/za.co.absa/spark-hats_2.12/ -->
<dependency>
    <groupId>za.co.absa</groupId>
    <artifactId>spark-hats_2.12</artifactId>
    <version>0.2.2</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/za.co.absa/spark-hats_2.12/
implementation 'za.co.absa:spark-hats_2.12:0.2.2'

Gradle Kotlin

// https://jarcasting.com/artifacts/za.co.absa/spark-hats_2.12/
implementation ("za.co.absa:spark-hats_2.12:0.2.2")

Apache Buildr

'za.co.absa:spark-hats_2.12:jar:0.2.2'

Apache Ivy

<dependency org="za.co.absa" name="spark-hats_2.12" rev="0.2.2">
  <artifact name="spark-hats_2.12" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='za.co.absa', module='spark-hats_2.12', version='0.2.2')
)

Scala SBT

libraryDependencies += "za.co.absa" % "spark-hats_2.12" % "0.2.2"

Leiningen

[za.co.absa/spark-hats_2.12 "0.2.2"]

Dependencies

compile (1)

Group / Artifact	Type	Version
za.co.absa : spark-hofs_2.12	jar	0.4.0

provided (4)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.12.10
org.apache.spark : spark-core_2.12	jar	2.4.7
org.apache.spark : spark-sql_2.12	jar	2.4.7
org.apache.spark : spark-catalyst_2.12	jar	2.4.7

test (1)

Group / Artifact	Type	Version
org.scalatest : scalatest_2.12	jar	3.0.3

Project Modules

There are no modules declared in this project.

spark-hats

Spark "Helpers for Array Transformations"

This library extends Spark DataFrame API with helpers for transforming fields inside nested structures and arrays of arbitrary levels of nesting.

Usage

Reference the library

Scala 2.11

groupId: za.co.absa
artifactId: spark-hats_2.11
version: 0.2.2

Scala 2.12

groupId: za.co.absa
artifactId: spark-hats_2.12
version: 0.2.2

Please, use the table below to determine what version of spark-hats to use for Spark compatibility.

spark-hats version	Scala version	Spark version
0.1.x	2.11, 2.12	2.4.3+
0.2.x	2.11, 2.12	2.4.3+

To use the extensions you need to add this import to your Spark application or shell:

import za.co.absa.spark.hats.Extensions._

Motivation

Here is a small example we will use to show you how spark-hats work. The important hthing is that the dataframe contains an array of struct fields.

scala> df.printSchema()
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
       
scala> df.show(false)
+---+------------------------------+
|id |my_array                      |
+---+------------------------------+
|1  |[[1, foo]]                    |
|2  |[[1, bar], [2, baz], [3, foz]]|
+---+------------------------------+

Now, say, we want to add a field c as part of the struct alongside a and b from the example above. The expression for c is c = a + 1.

Here is the code you can use in Spark:

    val dfOut = df.select(col("id"), transform(col("my_array"), c => {
      struct(c.getField("a").as("a"),
        c.getField("b").as("b"),
        (c.getField("a") + 1).as("c"))
    }).as("my_array"))

(to use transform() in Scala API you need to add spark-hofs as a dependency).

Here is how it looks when using spark-hats library.

    val dfOut = df.nestedMapColumn("my_array.a","c", a => a + 1)

Both produce the following results:

scala> dfOut.printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)

scala> dfOut.show(false)
+---+---------------------------------------+
|id |my_array                               |
+---+---------------------------------------+
|1  |[[1, foo, 2]]                          |
|2  |[[1, bar, 2], [2, baz, 3], [3, foz, 4]]|
+---+---------------------------------------+

Imagine how the code will look like for more levels of array nesting.

Methods

Add a column

The nestedWithColumn method allows adding new fields inside nested structures and arrays.

The addition of a column API is provided in two flavors: the basic and the extended API. The basic API is simpler to use, but the expressions it expects can only reference columns at the root of the schema. Here is an example of the basic add column API:

scala> df.nestedWithColumn("my_array.c", lit("hello")).printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: string (nullable = false)

scala> df.nestedWithColumn("my_array.c", lit("hello")).show(false)
+---+---------------------------------------------------+
|id |my_array                                           |
+---+---------------------------------------------------+
|1  |[[1, foo, hello]]                                  |
|2  |[[1, bar, hello], [2, baz, hello], [3, foz, hello]]|
+---+---------------------------------------------------+

Add column (extended)

The extended API method nestedWithColumnExtended works similarly to the basic one but allows the caller to reference other array elements, possibly on different levels of nesting. The way it allows this is a little tricky. The second parameter is changed from being a column to a function that returns a column. Moreover, this function has an argument which is a function itself, the getField() function. The getField() function can be used in the transformation to reference other columns in the dataframe by their fully qualified name.

In the following example, a transformation adds a new field my_array.c to the dataframe by concatenating a root level column id with a nested field my_array.b:

scala> val dfOut = df.nestedWithColumnExtended("my_array.c", getField =>
         concat(getField("id").cast("string"), getField("my_array.b"))
       )

scala> dfOut.printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: string (nullable = true)

scala> dfOut.show(false)
+---+------------------------------------------------+
|id |my_array                                        |
+---+------------------------------------------------+
|1  |[[1, foo, 1foo]]                                |
|2  |[[1, bar, 2bar], [2, baz, 2baz], [3, foz, 2foz]]|
+---+------------------------------------------------+

Note. You can still use col to reference root level columns. But if a column is inside an array (like my_array.b), invoking col("my_array.b") will reference the whole array, not an individual element. The getField() function that is passed to the transformation solves this by adding a generic way of addressing array elements on arbitrary levels of nesting.
Advanced Note. If there are several arrays in the schema, getField() allows to reference elements of an array if it is one of the parents of the output column.

Drop a column

The nestedDropColumn method allows dropping fields inside nested structures and arrays.

scala> df.nestedDropColumn("my_array.b").printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)

scala> df.nestedDropColumn("my_array.b").show(false)
+---+---------------+
|id |my_array       |
+---+---------------+
|1  |[[1]]          |
|2  |[[1], [2], [3]]|
+---+---------------+

Map a column

The nestedMapColumn method applies a transformation on a nested field. If the input column is a primitive field the method will add outputColumnName at the same level of nesting. If a struct column is expected you can use .getField(...) method to operate on its children.

The output column name can omit the full path as the field will be created at the same level of nesting as the input column.

scala> df.nestedMapColumn(inputColumnName = "my_array.a", outputColumnName = "c", expression = a => a + 1).printSchema
root
 |-- id: long (nullable = true)
 |-- my_array: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: string (nullable = true)
 |    |    |-- c: long (nullable = true)

scala> df.nestedMapColumn(inputColumnName = "my_array.a", outputColumnName = "c", expression = a => a + 1).show(false)
+---+---------------------------------------+
|id |my_array                               |
+---+---------------------------------------+
|1  |[[1, foo, 2]]                          |
|2  |[[1, bar, 2], [2, baz, 3], [3, foz, 4]]|
+---+---------------------------------------+

Other transformations

Unstruct

Syntax: df.nestedUnstruct("NestedStructColumnName").

Flattens one level of nesting when a struct is nested in another struct. For example,

scala> df.printSchema
root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|    |    |-- c: struct (containsNull = true)
|    |    |    |--nestedField1: string (nullable = true)
|    |    |    |--nestedField2: long (nullable = true)

scala> df.nestedUnstruct("my_array.c").printSchema
root
|-- id: long (nullable = true)
|-- my_array: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- a: long (nullable = true)
|    |    |-- b: string (nullable = true)
|    |    |-- nestedField1: string (nullable = true)
|    |    |-- nestedField2: long (nullable = true)

Note that the output schema doesn't have the c struct. All fields of c are now part of the parent struct.

Changelog

0.2.2 released 8 March 2021.
- #23 Added nestedUnstruct() method that flattens one level of nesting for a given struct.
0.2.1 released 21 January 2020.
- #10 Fixed error column aggregation when the input array is null.
0.2.0 released 16 January 2020.
- #5 Added the extended nested transformation API that allows referencing arbitrary columns.

ABSA OSS

ABSA Open Source

Versions

Version
0.2.2 Mar 8, 2021
0.2.1 Jan 21, 2020
0.2.0 Jan 16, 2020
0.1.0 Jan 6, 2020

spark-hats

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download spark-hats_2.12

How to add to project

Dependencies

compile (1)

provided (4)

test (1)

Project Modules

spark-hats

Usage

Scala 2.11

Scala 2.12

Motivation

Methods

Add a column

Add column (extended)

Drop a column

Map a column

Other transformations

Unstruct

Changelog

0.2.2 released 8 March 2021.

0.2.1 released 21 January 2020.

0.2.0 released 16 January 2020.

ABSA OSS

Versions