online-stats

License	License MIT-style
GroupId	GroupId org.tupol
ArtifactId	ArtifactId online-stats_2.12
Last Version	Last Version 0.0.3
Release Date	Release Date May 16, 2019
Type	Type jar
Description	Description online-stats online-stats
Project URL	Project URL https://github.com/tupol/spark-utils
Project Organization	Project Organization org.tupol
Source Code Management	Source Code Management https://github.com/tupol/spark-utils.git

Download online-stats_2.12

Filename	Size
online-stats_2.12-0.0.3.pom
online-stats_2.12-0.0.3.jar	71 KB
online-stats_2.12-0.0.3-tests.jar	110 KB
online-stats_2.12-0.0.3-tests-sources.jar	8 KB
online-stats_2.12-0.0.3-tests-javadoc.jar	1 MB
online-stats_2.12-0.0.3-sources.jar	14 KB
online-stats_2.12-0.0.3-javadoc.jar	1 MB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/org.tupol/online-stats_2.12/ -->
<dependency>
    <groupId>org.tupol</groupId>
    <artifactId>online-stats_2.12</artifactId>
    <version>0.0.3</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/org.tupol/online-stats_2.12/
implementation 'org.tupol:online-stats_2.12:0.0.3'

Gradle Kotlin

// https://jarcasting.com/artifacts/org.tupol/online-stats_2.12/
implementation ("org.tupol:online-stats_2.12:0.0.3")

Apache Buildr

'org.tupol:online-stats_2.12:jar:0.0.3'

Apache Ivy

<dependency org="org.tupol" name="online-stats_2.12" rev="0.0.3">
  <artifact name="online-stats_2.12" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='org.tupol', module='online-stats_2.12', version='0.0.3')
)

Scala SBT

libraryDependencies += "org.tupol" % "online-stats_2.12" % "0.0.3"

Leiningen

[org.tupol/online-stats_2.12 "0.0.3"]

Dependencies

compile (1)

Group / Artifact	Type	Version
org.scala-lang : scala-library	jar	2.12.6

test (3)

Group / Artifact	Type	Version
org.scalatest : scalatest_2.12	jar	3.0.1
org.scalacheck : scalacheck_2.12	jar	1.14.0
org.apache.commons : commons-math3	jar	3.2

Project Modules

There are no modules declared in this project.

Spark Utils

Motivation

One of the biggest challenges after taking the first steps into the world of writing Apache Spark applications in Scala is taking them to production.

An application of any kind needs to be easy to run and easy to configure.

This project is trying to help developers write Spark applications focusing mainly on the application logic rather than the details of configuring the application and setting up the Spark context.

This project is also trying to create and encourage a friendly yet professional environment for developers to help each other, so please do no be shy and join through gitter, twitter, issue reports or pull requests.

Description

This project contains some basic utilities that can help setting up a Spark application project.

The main point is the simplicity of writing Apache Spark applications just focusing on the logic, while providing for easy configuration and arguments passing.

The code sample bellow shows how easy can be to write a file format converter from any acceptable type, with any acceptable parsing configuration options to any acceptable format.

object FormatConverterExample extends SparkApp[FormatConverterContext, DataFrame] {
  override def createContext(config: Config) = FormatConverterContext(config).get
  override def run(implicit spark: SparkSession, context: FormatConverterContext): DataFrame = {
    val inputData = spark.source(context.input).read
    inputData.sink(context.output).write
  }
}

Creating the configuration can be as simple as defining a case class to hold the configuration and a factory, that helps extract simple and complex data types like input sources and output sinks.

case class FormatConverterContext(input: FormatAwareDataSourceConfiguration,
                                  output: FormatAwareDataSinkConfiguration)

object FormatConverterContext extends Configurator[FormatConverterContext] {
    config.extract[FormatAwareDataSourceConfiguration]("input") |@|
      config.extract[FormatAwareDataSinkConfiguration]("output") apply
      FormatConverterContext.apply
  }
}

Optionally, the SparkFun can be used instead of SparkApp to make the code even more concise.

object FormatConverterExample extends 
          SparkFun[FormatConverterContext, DataFrame](FormatConverterContext(_).get) {
  override def run(implicit spark: SparkSession, context: FormatConverterContext): DataFrame = 
    spark.source(context.input).read.sink(context.output).write
}

For structured streaming applications the format converter might look like this:

object StreamingFormatConverterExample extends SparkApp[StreamingFormatConverterContext, DataFrame] {
  override def createContext(config: Config) = StreamingFormatConverterContext(config).get
  override def run(implicit spark: SparkSession, context: StreamingFormatConverterContext): DataFrame = {
    val inputData = spark.source(context.input).read
    inputData.streamingSink(context.output).write.awaitTermination()
  }
}

The streaming configuration the configuration can be as simple as following:

case class StreamingFormatConverterContext(input: FormatAwareStreamingSourceConfiguration, 
                                           output: FormatAwareStreamingSinkConfiguration)

object StreamingFormatConverterContext extends Configurator[StreamingFormatConverterContext] {
  def validationNel(config: Config): ValidationNel[Throwable, StreamingFormatConverterContext] = {
    config.extract[FormatAwareStreamingSourceConfiguration]("input") |@|
      config.extract[FormatAwareStreamingSinkConfiguration]("output") apply
      StreamingFormatConverterContext.apply
  }
}

The SparkRunnable and SparkApp or SparkFun together with the configuration framework provide for easy Spark application creation with configuration that can be managed through configuration files or application parameters.

The IO frameworks for reading and writing data frames add extra convenience for setting up batch and structured streaming jobs that transform various types of files and streams.

Last but not least, there are many utility functions that provide convenience for loading resources, dealing with schemas and so on.

Most of the common features are also implemented as decorators to main Spark classes, like SparkContext, DataFrame and StructType and they are conveniently available by importing the org.tupol.spark.implicits._ package.

Documentation

The documentation for the main utilities and frameworks available:

SparkApp, SparkFun and SparkRunnable
DataSource Framework for both batch and structured streaming applications
DataSink Framework for both batch and structured streaming applications

Latest stable API documentation is available here.

An extensive tutorial and walk-through can be found here. Extensive samples and demos can be found here.

A nice example on how this library can be used can be found in the spark-tools project, through the implementation of a generic format converter and a SQL processor for both batch and structured streams.

Prerequisites

Java 8 or higher
Scala 2.11 or 2.12
Apache Spark 2.4.X

Getting Spark Utils

Spark Utils is published to Maven Central and Spark Packages:

Group id / organization: org.tupol
Artifact id / name: spark-utils
Latest stable version is 0.4.2

Usage with SBT, adding a dependency to the latest version of tools to your sbt build definition file:

libraryDependencies += "org.tupol" %% "spark-utils" % "0.4.2"

Include this package in your Spark Applications using spark-shell or spark-submit

$SPARK_HOME/bin/spark-shell --packages org.tupol:spark-utils_2.11:0.4.1

Starting a New `spark-utils` Project

The simplest way to start a new spark-utils is to make use of the spark-apps.seed.g8 template project.

To fill in manually the project options run

g8 tupol/spark-apps.seed.g8

The default options look like the following:

name [My Project]:
appname [My First App]:
organization [my.org]:
version [0.0.1-SNAPSHOT]:
package [my.org.my_project]:
classname [MyFirstApp]:
scriptname [my-first-app]:
scalaVersion [2.11.12]:
sparkVersion [2.4.0]:
sparkUtilsVersion [0.4.0]:

To fill in the options in advance

g8 tupol/spark-apps.seed.g8 --name="My Project" --appname="My App" --organization="my.org" --force

What's new?

0.4.2

The project compiles with both Scala 2.11.12 and 2.12.12
Updated Apache Spark to 2.4.6
Updated the spark-xml library to 0.10.0
Removed the com.databricks:spark-avro dependency, as avro support is now built into Apache Spark
Removed the shadow org.apache.spark.Loggin class, which is replaced by the org.tupol.spark.Loggign knock-off

For previous versions please consult the release notes.

License

This code is open source software licensed under the MIT License.

Versions

Version
0.0.3 May 16, 2019

online-stats

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Project Organization

Source Code Management

Download online-stats_2.12

How to add to project

Dependencies

compile (1)

test (3)

Project Modules

Spark Utils

Motivation

Description

Documentation

Prerequisites

Getting Spark Utils

Starting a New spark-utils Project

What's new?

License

Versions

Starting a New `spark-utils` Project