spark-ecs-s3

The spark-ecs-s3 project makes it possible to work with data stored in ECS using Apache Spark.

License	License The Apache Software License, Version 2.0
GroupId	GroupId com.emc.ecs
ArtifactId	ArtifactId spark-ecs-s3_2.11
Last Version	Last Version 1.4.1
Release Date	Release Date Apr 21, 2017
Type	Type jar
Description	Description spark-ecs-s3 The spark-ecs-s3 project makes it possible to work with data stored in ECS using Apache Spark.
Project URL	Project URL https://github.com/EMCECS/spark-ecs-s3
Source Code Management	Source Code Management https://github.com/EMCECS/spark-ecs-s3

Download spark-ecs-s3_2.11

Filename	Size
spark-ecs-s3_2.11-1.4.1.pom
spark-ecs-s3_2.11-1.4.1.jar	2 MB
spark-ecs-s3_2.11-1.4.1-sources.jar	7 KB
spark-ecs-s3_2.11-1.4.1-javadoc.jar	261 bytes
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.emc.ecs/spark-ecs-s3_2.11/ -->
<dependency>
    <groupId>com.emc.ecs</groupId>
    <artifactId>spark-ecs-s3_2.11</artifactId>
    <version>1.4.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.emc.ecs/spark-ecs-s3_2.11/
implementation 'com.emc.ecs:spark-ecs-s3_2.11:1.4.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.emc.ecs/spark-ecs-s3_2.11/
implementation ("com.emc.ecs:spark-ecs-s3_2.11:1.4.1")

Apache Buildr

'com.emc.ecs:spark-ecs-s3_2.11:jar:1.4.1'

Apache Ivy

<dependency org="com.emc.ecs" name="spark-ecs-s3_2.11" rev="1.4.1">
  <artifact name="spark-ecs-s3_2.11" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.emc.ecs', module='spark-ecs-s3_2.11', version='1.4.1')
)

Scala SBT

libraryDependencies += "com.emc.ecs" % "spark-ecs-s3_2.11" % "1.4.1"

Leiningen

[com.emc.ecs/spark-ecs-s3_2.11 "1.4.1"]

Dependencies

provided (2)

Group / Artifact	Type	Version
org.apache.spark : spark-core_2.11	jar	2.0.1
org.apache.spark : spark-sql_2.11	jar	2.0.1

Project Modules

There are no modules declared in this project.

Bucket Metadata Search with Spark SQL (2.x)

The spark-ecs-connector project makes it possible to view an ECS bucket as a Spark dataframe. Each row in the dataframe corresponds to an object in the bucket, and each column coresponds to a piece of object metadata.

How it Works

Spark SQL supports querying external data sources and rendering the results as a dataframe. With the PrunedFilteredScan trait, the external data source handles column pruning and predicate pushdown. In other words, the WHERE clause is pushed to ECS by taking advantage of the bucket metadata search feature of ECS 2.2.

Using

Linking to your Spark 2.x Application

The library is published to Maven Central. Link to the library using these dependency coordinates:

com.emc.ecs:spark-ecs-connector_2.11:1.4.2

Using in Zeppelin

Install Zeppelin 0.7+.
export SPARK_LOCAL_IP=127.0.0.1
bin/zeppelin.sh

Create a notebook with the following commands. Replace *** with your S3 credentials.

%dep
z.load("com.emc.ecs:spark-ecs-connector_2.11:1.4.2")

import java.net.URI
import com.emc.ecs.spark.sql.sources.s3._

val endpointUri = new URI("http://10.1.83.51:9020/")
val credential = ("***ACCESS KEY ID***", "***SECRET ACCESS KEY***")

val df = sqlContext.read.bucket(endpointUri, credential, "ben_bucket", withSystemMetadata = false)
df.createOrReplaceTempView("ben_bucket")

%sql
SELECT * FROM ben_bucket 
WHERE `image-viewcount` >= 5000 AND `image-viewcount` <= 10000

Contributing

Building

The project use the Gradle build system and includes a script that automatically downloads Gradle.

Build and install the library to your local Maven repository as follows:

$ ./gradlew publishShadowPublicationToMavenLocal

TODO

Implement 'OR' pushdown. ECS supports 'or', but not in combination with 'and'.
Avoid sending a query containing a non-indexable key.

Dell EMC ECS

Cloud Scale Object Storage

Versions

Version
1.4.1 Apr 21, 2017
1.4.0 Apr 14, 2017

spark-ecs-s3

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management