Spark2Cassandra


License

License

Categories

Categories

Cassandra Data Databases
GroupId

GroupId

com.github.leoromanovsky
ArtifactId

ArtifactId

spark2cassandra_2.11
Last Version

Last Version

3.0.0
Release Date

Release Date

Type

Type

jar
Description

Description

Spark2Cassandra
Spark2Cassandra
Project URL

Project URL

https://github.com/leoromanovsky/Spark2Cassandra
Project Organization

Project Organization

leoromanovsky
Source Code Management

Source Code Management

https://github.com/leoromanovsky/Spark2Cassandra

Download spark2cassandra_2.11

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.leoromanovsky/spark2cassandra_2.11/ -->
<dependency>
    <groupId>com.github.leoromanovsky</groupId>
    <artifactId>spark2cassandra_2.11</artifactId>
    <version>3.0.0</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.leoromanovsky/spark2cassandra_2.11/
implementation 'com.github.leoromanovsky:spark2cassandra_2.11:3.0.0'
// https://jarcasting.com/artifacts/com.github.leoromanovsky/spark2cassandra_2.11/
implementation ("com.github.leoromanovsky:spark2cassandra_2.11:3.0.0")
'com.github.leoromanovsky:spark2cassandra_2.11:jar:3.0.0'
<dependency org="com.github.leoromanovsky" name="spark2cassandra_2.11" rev="3.0.0">
  <artifact name="spark2cassandra_2.11" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.leoromanovsky', module='spark2cassandra_2.11', version='3.0.0')
)
libraryDependencies += "com.github.leoromanovsky" % "spark2cassandra_2.11" % "3.0.0"
[com.github.leoromanovsky/spark2cassandra_2.11 "3.0.0"]

Dependencies

compile (4)

Group / Artifact Type Version
org.scala-lang : scala-library jar 2.11.8
org.clapper : grizzled-slf4j_2.11 jar 1.3.0
com.datastax.spark : spark-cassandra-connector_2.11 jar 2.0.5
org.apache.cassandra : cassandra-all jar 3.9

provided (4)

Group / Artifact Type Version
org.scoverage : scalac-scoverage-runtime_2.11 jar 1.1.1
org.scoverage : scalac-scoverage-plugin_2.11 jar 1.1.1
org.apache.spark : spark-core_2.11 jar 2.2.0
org.apache.spark : spark-sql_2.11 jar 2.2.0

test (3)

Group / Artifact Type Version
com.holdenkarau : spark-testing-base_2.11 jar 2.2.0_0.7.2
org.scalatest : scalatest_2.11 jar 3.0.4
org.cassandraunit : cassandra-unit jar 3.1.1.0

Project Modules

There are no modules declared in this project.

Spark2Cassandra

Spark Library for Bulk Loading into Cassandra

  • Build Status
  • Join the chat at https://gitter.im/SparkleFormation/sfn

Requirements

Spark2Cassandra supports Spark 2.2.

It is compatible with the following versions of Cassandra:

  • 2.1.5+
  • 2.2
  • 3.0.x

Downloads

SBT

libraryDependencies += "com.github.leoromanovsky" %% "spark2cassandra" % "3.0.0"

Maven

<dependency>
  <groupId>com.github.leoromanovsky</groupId>
  <artifactId>spark2cassandra_2.11</artifactId>
  <version>x.y.z</version>
</dependency>

Features

Usage

Bulk Loading into Cassandra

// Import the following to have access to the `bulkLoadToEs()` function for RDDs or DataFrames.
import com.github.jparkie.spark.cassandra.rdd._
import com.github.jparkie.spark.cassandra.sql._

val sparkConf = new SparkConf()
val sc = SparkContext.getOrCreate(sparkConf)
val sqlContext = SQLContext.getOrCreate(sc)

// https://datastax-oss.atlassian.net/browse/SPARKC-475
implicit val rwf: RowWriterFactory[Row] = SqlRowWriter.Factory

val rdd = sc.parallelize(???)

val df = sqlContext.read.parquet("<PATH>")

// Specify the `keyspaceName` and the `tableName` to write.
rdd.bulkLoadToCass(
  keyspaceName = "twitter",
  tableName = "tweets_by_date"
)

// Specify the `keyspaceName` and the `tableName` to write.
df.bulkLoadToCass(
  keyspaceName = "twitter",
  tableName = "tweets_by_author"
)

For more information, refer to:

Configurations

As Spark2Cassandra utilizes https://github.com/datastax/spark-cassandra-connector for serializations from Spark and session management, please refer to the following for more configurations: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md.

SparkCassWriteConf

Refer to for more: SparkCassWriteConf.scala

Property Name Default Description
spark.cassandra_bulk.write.partitioner org.apache.cassandra.dht.Murmur3Partitioner The 'partitioner' defined in cassandra.yaml.
spark.cassandra_bulk.write.throughput_mb_per_sec Int.MaxValue The maximum throughput to throttle.
spark.cassandra_bulk.write.connection_per_host 1 The number of connections per host to utilize when streaming SSTables.

SparkCassServerConf

Refer to for more: SparkCassServerConf.scala

Property Name Default Description
spark.cassandra_bulk.server.storage.port 7000 The 'storage_port' defined in cassandra.yaml.
spark.cassandra_bulk.server.sslStorage.port 7001 The 'ssl_storage_port' defined in cassandra.yaml.
spark.cassandra_bulk.server.internode.encryption "none" The 'server_encryption_options:internode_encryption' defined in cassandra.yaml.
spark.cassandra_bulk.server.keyStore.path conf/.keystore The 'server_encryption_options:keystore' defined in cassandra.yaml.
spark.cassandra_bulk.server.keyStore.password cassandra The 'server_encryption_options:keystore_password' defined in cassandra.yaml.
spark.cassandra_bulk.server.trustStore.path conf/.truststore The 'server_encryption_options:truststore' defined in cassandra.yaml.
spark.cassandra_bulk.server.trustStore.password cassandra The 'server_encryption_options:truststore_password' defined in cassandra.yaml.
spark.cassandra_bulk.server.protocol TLS The 'server_encryption_options:protocol' defined in cassandra.yaml.
spark.cassandra_bulk.server.algorithm SunX509 The 'server_encryption_options:algorithm' defined in cassandra.yaml.
spark.cassandra_bulk.server.store.type JKS The 'server_encryption_options:store_type' defined in cassandra.yaml.
spark.cassandra_bulk.server.cipherSuites TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA The 'server_encryption_options:cipher_suites' defined in cassandra.yaml.
spark.cassandra_bulk.server.requireClientAuth false The 'server_encryption_options:require_client_auth' defined in cassandra.yaml.

Versions

Version
3.0.0