spark-plug


License

License

GroupId

GroupId

com.bizo
ArtifactId

ArtifactId

spark-plug_2.11
Last Version

Last Version

1.2.6
Release Date

Release Date

Type

Type

jar
Description

Description

spark-plug
spark-plug
Project URL

Project URL

https://github.com/ogrodnek/spark-plug
Project Organization

Project Organization

com.bizo
Source Code Management

Source Code Management

https://github.com/ogrodnek/spark-plug

Download spark-plug_2.11

How to add to project

<!-- https://jarcasting.com/artifacts/com.bizo/spark-plug_2.11/ -->
<dependency>
    <groupId>com.bizo</groupId>
    <artifactId>spark-plug_2.11</artifactId>
    <version>1.2.6</version>
</dependency>
// https://jarcasting.com/artifacts/com.bizo/spark-plug_2.11/
implementation 'com.bizo:spark-plug_2.11:1.2.6'
// https://jarcasting.com/artifacts/com.bizo/spark-plug_2.11/
implementation ("com.bizo:spark-plug_2.11:1.2.6")
'com.bizo:spark-plug_2.11:jar:1.2.6'
<dependency org="com.bizo" name="spark-plug_2.11" rev="1.2.6">
  <artifact name="spark-plug_2.11" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.bizo', module='spark-plug_2.11', version='1.2.6')
)
libraryDependencies += "com.bizo" % "spark-plug_2.11" % "1.2.6"
[com.bizo/spark-plug_2.11 "1.2.6"]

Dependencies

compile (4)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.11.5
com.amazonaws : aws-java-sdk jar 1.10.16
com.googlecode.json-simple : json-simple jar 1.1.1
commons-lang : commons-lang jar 2.6

test (2)

Group / Artifact Type Version
junit : junit jar 4.10
com.novocode : junit-interface jar 0.10-M4

Project Modules

There are no modules declared in this project.

spark-plug

Build Status

A scala driver for launching Amazon EMR jobs

why?

We run a lot of reports. In the past, these have been kicked off by bash scripts that typically do things like date math, copy scripts and config files to s3 before calling to the amazon elastic-mapreduce command line client to launch the job. The emr client invocation ends up being dozen of lines of bash code adding each step and passing arguments.

It's been a pain to share defaults or add any abstraction over common job steps. Additionally, performing date arithmetic and conditionally adding EMR steps can be a pain. Lastly, the EMR client offers less control over certain options available from the EMR API.

simple example

val flow = JobFlow(
  name      = s"${stage}: analytics report [${date}]",
  cluster   = Master() + Core(8) + Spot(10),
  bootstrap = Seq(MemoryIntensive),
  steps     = Seq(
    SetupDebugging(),
    new HiveStep("s3://bucket/location/report.sql",
      Map("YEAR" -> year, "MONTH" -> month, "DAY" -> day))
  )
)

val id = Emr.run(flow)(ClusterDefaults(hadoop="1.0.3"))
println(id)

API documentation

download

Available in Maven Central as com.bizo spark-plug_2.10

Versions

Version
1.2.6