kina mongodb

Project containing the actual implementation of the RDD integrating Apache Spark and MongoDB

License

License

Categories

Categories

MongoDB Data Databases
GroupId

GroupId

com.github.lucarosellini
ArtifactId

ArtifactId

kina-mongodb
Last Version

Last Version

0.9.5
Release Date

Release Date

Type

Type

jar
Description

Description

kina mongodb
Project containing the actual implementation of the RDD integrating Apache Spark and MongoDB
Project URL

Project URL

https://github.com/lucarosellini/kina
Source Code Management

Source Code Management

https://github.com/lucarosellini/kina

Download kina-mongodb

How to add to project

<!-- https://jarcasting.com/artifacts/com.github.lucarosellini/kina-mongodb/ -->
<dependency>
    <groupId>com.github.lucarosellini</groupId>
    <artifactId>kina-mongodb</artifactId>
    <version>0.9.5</version>
</dependency>
// https://jarcasting.com/artifacts/com.github.lucarosellini/kina-mongodb/
implementation 'com.github.lucarosellini:kina-mongodb:0.9.5'
// https://jarcasting.com/artifacts/com.github.lucarosellini/kina-mongodb/
implementation ("com.github.lucarosellini:kina-mongodb:0.9.5")
'com.github.lucarosellini:kina-mongodb:jar:0.9.5'
<dependency org="com.github.lucarosellini" name="kina-mongodb" rev="0.9.5">
  <artifact name="kina-mongodb" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.github.lucarosellini', module='kina-mongodb', version='0.9.5')
)
libraryDependencies += "com.github.lucarosellini" % "kina-mongodb" % "0.9.5"
[com.github.lucarosellini/kina-mongodb "0.9.5"]

Dependencies

compile (14)

Group / Artifact Type Version
com.github.lucarosellini : kina-commons jar 0.9.5
log4j : log4j jar 1.2.17
de.flapdoodle.embed : de.flapdoodle.embed.mongo jar 1.46.0
org.mongodb : mongo-hadoop-core jar 1.3.0
org.mongodb : mongo-java-driver jar 2.12.2
org.apache.cassandra : cassandra-all jar 2.1.2
org.javassist : javassist jar 3.18.2-GA
com.datastax.cassandra : cassandra-driver-core jar 2.1.2
org.apache.spark : spark-core_2.10 jar 1.1.1
commons-io : commons-io jar 2.4
org.apache.hadoop : hadoop-common jar 2.4.0
org.apache.hadoop : hadoop-mapreduce-client-app jar 2.4.0
com.esotericsoftware.kryo : kryo jar 2.24.0
com.twitter : chill_2.10 jar 0.3.6

test (3)

Group / Artifact Type Version
com.github.lucarosellini : kina-commons test-jar 0.9.5
org.testng : testng jar 6.8.7
org.mockito : mockito-all jar 1.9.5

Project Modules

There are no modules declared in this project.

What is Kina?

Kina is an high level storage API integrating Apache Spark and several NoSQL datastores. I actually support Apache Cassandra and MongoDB, but in the near future we will add support for sever other datastores.

Apache Cassandra integration

The integration is not based on the Cassandra's Hadoop interface.

Kina comes with an user friendly API that lets developers create Spark RDDs mapped to Cassandra column families. We provide two different interfaces:

  • The first one will let developers map Cassandra tables to plain old java objects (POJOs), just like if you were using any other ORM. We call this API the 'entity objects' API. This abstraction is quite handy, it will let you work on RDD (under the hood Kina will transparently map Cassandra's columns to entity properties). Your domain entities must be correctly annotated using Kina annotations (take a look at kina-examples example entities in package kina.testentity).

  • The second one is a more generic 'cell' API, that will let developerss work on RDD<kina.entity.Cells> where a 'Cells' object is a collection of kina.entity.Cell objects. Column metadata is automatically fetched from the data store. This interface is a little bit more cumbersome to work with (see the example below), but has the advantage that it doesn't require the definition of additional entity classes. Example: you have a table called 'users' and you decide to use the 'Cells' interface. Once you get an instance 'c' of the Cells object, to get the value of column 'address' you can issue a c.getCellByName("address").getCellValue(). Please, refer to the Kina API documentation to know more about the Cells and Cell objects.

Kina comes with an example sub project called 'kina-examples' containing a set of working examples, both in Java and Scala. Please, refer to the kina-example project README for further information on how to setup a working environment.

MongoDB integration

Spark-MongoDB connector is based on Hadoop-mongoDB.

Support for MongoDB has been added in version 0.3.0 and is not yet considered a complete feature.

We provide two different interfaces:

  • ORM API, you just have to annotate your POJOs with Kina annotations and magic will begin, you will be able to connect MongoDB with Spark using your own model entities.

  • Generic cell API, you do not need to specify the collection's schema or add anything to your POJOs, each document will be transform to an object "Cells".

We added a few working examples for MongoDB in kina-examples subproject, take a look at:

Entities:

  • kina.examples.java.ReadingEntityFromMongoDB
  • kina.examples.java.WritingEntityToMongoDB
  • kina.examples.java.GroupingEntityWithMongoDB

Cells:

  • kina.examples.java.ReadingCellFromMongoDB
  • kina.examples.java.WritingCellToMongoDB
  • kina.examples.java.GroupingCellWithMongoDB

We are working on further improvements!

Requirements

  • Cassandra, we tested versions from 1.2.8 up to 2.0.8 (for Spark <=> Cassandra integration).
  • MongoDB, we tested the integration with MongoDB versions 2.2, 2.4 y 2.6 using Standalone, Replica Set and Sharded Cluster (for Spark <=> MongoDB integration).
  • Spark 1.0.0
  • Apache Maven >= 3.0.4
  • Java 1.7
  • Scala 2.10.3

Configure the development and test environment

  • Clone the project

  • To configure a development environment in Eclipse: import as Maven project. In IntelliJ: open the project by selecting the parent POM file

  • Install the project in you local maven repository. Enter root kina directory and perform: mvn clean install (add -DskipTests to skip tests)

  • Put Kina to work on a working cassandra + spark cluster. You have several options:

    • Download a pre-configured Stratio platform VM Stratio's BigData platform (SDS). This VM will work on both Virtualbox and VMWare, and comes with a fully configured distribution that also includes Stratio Kina. We also distribute the VM with several preloaded datasets in Cassandra. This distribution will include Stratio's customized Cassandra distribution containing our powerful open-source lucene-based secondary indexes, see Stratio documentation for further information. Once your VM is up and running you can test Kina using the shell. Enter /opt/sds and run bin/kina-shell.

    • Install a new cluster using the Stratio installer. Please refer to Stratio's website to download the installer and its documentation.

    • You already have a working Cassandra server on your development machine: you need a spark+kina bundle, we suggest to create one by running:

      cd kina-scripts

      ./make-distribution.sh

    this will build a Spark distribution package with Kina and Cassandra's jars included (depending on your machine this script could take a while, since it will compile Spark from sources). The package will be called kina-distribution-X.Y.Z.tgz, untar it to a folder of your choice, enter that folder and issue a ./kina-shell, this will start an interactive shell where you can test Kina (you may have noticed this is will start a development cluster started with MASTER="local").

    • You already have a working installation os Cassandra and Spark on your development machine: this is the most difficult way to start testing Kina, but you know what you're doing you will have to
      1. copy the Stratio Kina jars to Spark's 'jars' folder ($SPARK_HOME/jars).

      2. copy Cassandra's jars to Spark's 'jar' folder.

      3. copy Datastax Java Driver jar (v 2.0.x) to Spark's 'jar' folder.

      4. start spark shell and import the following:

        import kina.config._ import kina.entity._ import kina.context._ import kina.rdd._

Once you have a working development environment you can finally start testing Kina. This are the basic steps you will always have to perform in order to use Kina:

First steps with Spark and Cassandra

  • Build an instance of a configuration object: this will let you tell Kina the Cassandra endpoint, the keyspace, the table you want to access and much more. It will also let you specify which interface to use (the domain entity or the generic interface). We have a factory that will help you create a configuration object using a fluent API. Creating a configuration object is an expensive operation. Please take the time to read the java and scala examples provided in 'kina-examples' subproject.
  • Create an RDD: using the KinaSparkContext helper methods and providing the configuration object you've just instantiated.
  • Perform some computation over this RDD(s): this is up to you, we only help you fetching the data efficiently from Cassandra, you can use the powerful Spark API.
  • (optional) write the computation results out to Cassandra: we provide a way to efficiently save the result of your computation to Cassandra. In order to do that you must have another configuration object where you specify the output keyspace/column family. We can create the output column family for you if needed.

First steps with Spark and MongoDB

  • Build an instance of a configuration object: this will let you tell Stratio Kina the MongoDB endpoint, the MongoDB database and collection you want to access and much more. It will also let you specify which interface to use (the domain entity). We have a factory that will help you create a configuration object using a fluent API. Creating a configuration object is an expensive operation. Please take the time to read the java and scala examples provided in 'kina-examples' subproject.
  • Create an RDD: using the KinaSparkContext helper methods and providing the configuration object you've just instantiated.
  • Perform some computation over this RDD(s): this is up to you, we only help you fetching the data efficiently from MongoDB, you can use the powerful Spark API.
  • (optional) write the computation results out to MongoDB: we provide a way to efficiently save the result of your computation to MongoDB.

Migrating from version 0.2.9

From version 0.4.x, Kina supports multiple datastores, in order to correctly implement this new feature Kina has undergone an huge refactor between versions 0.2.9 and 0.4.x. To port your code to the new version you should take into account a few changes we made.

New Project Structure

From version 0.4.x, Kina supports multiple datastores, in your project you should import only the maven dependency you will use: kina-cassandra or kina-mongodb.

Changes to 'kina.entity.Cells'

  • Until version 0.4.x the 'Cells' was implicitly associated to a record coming from a specific table. When performing a join in Spark, 'Cell' objects coming from different tables are mixed into an single 'Cells' object. Kina now keeps track of the original table a Cell object comes from, changing the internal structure of 'Cells', where each 'Cell' is associated to its 'table'.
    1. If you are a user of 'Cells' objects returned from Kina, nothing changes for you. The 'Cells' API keeps working as usual.
    2. If you manually create 'Cells' objects you can keep using the original API, in this case each Cell you add to your Cells object is automatically associated to a default table name.
    3. You can specify the default table name, or let Kina chose an internal default table name for you.
    4. We added a new constructor to 'Cells' accepting the default table name. This way the 'old' API will always manipulate 'Cell' objects associated to the specified default table.
    5. For each method manipulating the content of a 'Cells' object, we added a new method that also accepts the table name: if you call the method whose signature does not have the table name, the table action is performed over the Cell associated to the default table, otherwise the action is performed over the 'Cell'(s) associated to the specified table.
    6. size() y isEmpty() will compute their results taking into account all the 'Cell' objects contained.
    7. size(String tableName) and isEmpty(tableName) compute their result taking into account only the 'Cell' objects associated to the specified table.
    8. Obviously, when dealing with Cells objects, Kina always associates a Cell to the correct table name.

Examples:

Cells cells1 = new Cells(); // instantiate a Cells object whose default table name is generated internally.
Cells cells2 = new Cells("my_default_table"); // creates a new Cells object whose default table name is specified by the user
cells2.add(new Cell(...)); // adds to the 'cells2' object a new Cell object associated to the default table
cells2.add("my_other_table", new Cell(...)); // adds to the 'cells2' object a new Cell associated to "my_other_table"  

RDD creation

Methods used to create Cell and Entity RDD has been merged into one single method:

  • CassandraKinaContext: cassandraEntityRDD(...) and cassandraGenericRDD(...) has been merged to cassandraRDD(...)
  • MongoKinaContext: mongoEntityRDD(...) and mongoCellRDD(...) has been merged to mongoRDD(...)

Versions

Version
0.9.5
0.9.4
0.9.3