SparkUtils


License

License

GroupId

GroupId

com.warren-r
ArtifactId

ArtifactId

sparkutils_2.12
Last Version

Last Version

0.1.2
Release Date

Release Date

Type

Type

jar
Description

Description

SparkUtils
SparkUtils
Project URL

Project URL

https://github.com/warrenronsiek/SparkUtils
Project Organization

Project Organization

com.warren-r
Source Code Management

Source Code Management

https://github.com/warrenronsiek/SparkUtils

Download sparkutils_2.12

How to add to project

<!-- https://jarcasting.com/artifacts/com.warren-r/sparkutils_2.12/ -->
<dependency>
    <groupId>com.warren-r</groupId>
    <artifactId>sparkutils_2.12</artifactId>
    <version>0.1.2</version>
</dependency>
// https://jarcasting.com/artifacts/com.warren-r/sparkutils_2.12/
implementation 'com.warren-r:sparkutils_2.12:0.1.2'
// https://jarcasting.com/artifacts/com.warren-r/sparkutils_2.12/
implementation ("com.warren-r:sparkutils_2.12:0.1.2")
'com.warren-r:sparkutils_2.12:jar:0.1.2'
<dependency org="com.warren-r" name="sparkutils_2.12" rev="0.1.2">
  <artifact name="sparkutils_2.12" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.warren-r', module='sparkutils_2.12', version='0.1.2')
)
libraryDependencies += "com.warren-r" % "sparkutils_2.12" % "0.1.2"
[com.warren-r/sparkutils_2.12 "0.1.2"]

Dependencies

compile (5)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.12.12
org.apache.spark : spark-core_2.12 jar 3.0.0
org.apache.spark : spark-sql_2.12 jar 3.0.0
com.typesafe.scala-logging : scala-logging_2.12 jar 3.9.2
org.scalatest : scalatest_2.12 jar 3.2.3

test (2)

Group / Artifact Type Version
ch.qos.logback : logback-classic jar 1.2.3
ch.qos.logback : logback-core jar 1.2.3

Project Modules

There are no modules declared in this project.

SparkUtils warrenronsiek codecov

Generic spark utilities. Right now only has dataframe SnapshotTest util, but more stuff to come.

SnapshotTest

Snapshot tests store copies of data in parquet in your test/resources dir. When calling the assertSnapshot("snapName", newDf, "joinCol1", "joinCol2") the library will read the stored snapshot as a dataframe, join it to the provided dataframe on the joinColumns, and then test the dataframes for equality along every column. If it finds any diffs, it will print them and fail the test. If it doesn't find any diffs, it will succeed. If there is no stored snapshot in the test/resources that matches, it will create a new one.

Example Usage

import org.apache.spark.SparkConf
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.scalatest.flatspec.AnyFlatSpec
import com.warren_r.sparkutils.snapshot.SnapshotTest

class SnapshotTestTest extends AnyFlatSpec with SnapshotTest {
  val sparkConf: SparkConf = new SparkConf()
  val sparkSession: SparkSession = SparkSession
    .builder
    .master("local[*]")
    .appName("RunningTests")
    .config(sparkConf)
    .getOrCreate()
  
  val df: DataFrame = sparkSession.createDataFrame(
    sparkSession.sparkContext.parallelize(Seq((1, "a"), (2, "b")))
  )
  
  "snapshots" should "pass" in {
    assertSnapshot("demoSnapShot", df, "_c1")
  }
}

Versions

Version
0.1.2
0.1.1
0.1