laurelin

Spark datasource for ROOT (http://root.cern) files

License	License BSD 3-Clause
GroupId	GroupId edu.vanderbilt.accre
ArtifactId	ArtifactId laurelin
Last Version	Last Version 1.1.1
Release Date	Release Date May 24, 2020
Type	Type jar
Description	Description laurelin Spark datasource for ROOT (http://root.cern) files
Project URL	Project URL https://github.com/spark-root/laurelin
Source Code Management	Source Code Management https://github.com/spark-root/laurelin

Download laurelin

Filename	Size
laurelin-1.1.1.pom
laurelin-1.1.1.jar	1 MB
laurelin-1.1.1-sources.jar	72 KB
laurelin-1.1.1-javadoc.jar	500 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/edu.vanderbilt.accre/laurelin/ -->
<dependency>
    <groupId>edu.vanderbilt.accre</groupId>
    <artifactId>laurelin</artifactId>
    <version>1.1.1</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/edu.vanderbilt.accre/laurelin/
implementation 'edu.vanderbilt.accre:laurelin:1.1.1'

Gradle Kotlin

// https://jarcasting.com/artifacts/edu.vanderbilt.accre/laurelin/
implementation ("edu.vanderbilt.accre:laurelin:1.1.1")

Apache Buildr

'edu.vanderbilt.accre:laurelin:jar:1.1.1'

Apache Ivy

<dependency org="edu.vanderbilt.accre" name="laurelin" rev="1.1.1">
  <artifact name="laurelin" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='edu.vanderbilt.accre', module='laurelin', version='1.1.1')
)

Scala SBT

libraryDependencies += "edu.vanderbilt.accre" % "laurelin" % "1.1.1"

Leiningen

[edu.vanderbilt.accre/laurelin "1.1.1"]

Dependencies

compile (4)

Group / Artifact	Type	Version
org.tukaani : xz	jar	1.2
org.lz4 : lz4-java	jar	1.5.1
org.apache.logging.log4j : log4j-api	jar	2.11.2
org.apache.logging.log4j : log4j-core	jar	2.11.2

provided (4)

Group / Artifact	Type	Version
org.apache.hadoop : hadoop-common	jar	2.7.4
org.apache.hadoop : hadoop-client	jar	2.7.4
org.apache.spark : spark-core_2.11	jar	2.4.5
org.apache.spark : spark-sql_2.11	jar	2.4.5

test (1)

Group / Artifact	Type	Version
junit : junit	jar	4.11

Project Modules

There are no modules declared in this project.

spark-ttree

Implementation of ROOT I/O designed to get TTrees into Spark DataFrames. Consists of the following three components:

DataSource - Spark DataSourceV2 implementation
ArrayInterpretation - Accepts raw TBasket byte ranges and returns deserialzed arrays
root_proxy - Deserializes ROOT metadata to locate TBasket byte ranges

The scope of this project is only to perform vectorized (i.e. column-based) reads of TTrees consisting of relatively simple branches -- fundamental numeric types and both fixed-length/jagged arrays of those types.

Usage example

Note that the most recent version number can be found here. To use a different version, replace 1.0.0 with your desired version

import pyspark.sql

spark = pyspark.sql.SparkSession.builder \
    .master("local[1]") \
    .config('spark.jars.packages', 'edu.vanderbilt.accre:laurelin:1.0.0') \
    .getOrCreate()
sc = spark.sparkContext
df = spark.read.format('root') \
                .option("tree", "tree") \
                .load('small-flat-tree.root')
df.printSchema()

Known issues/not yet implemented functionality

The I/O is currently completely unoptimized -- there is no caching or prefetching. Remote reads will be slow as a consequence.
Arrays (both fixed and jagged) of booleans return the wrong result
Float16/Doubles32 are currently not supported
String types are currently not supported
C++ STD types are currently not supported (importantly, std::vector)

Spark-ROOT

Integrating the Spark and ROOT ecosystems

Versions

Version
1.1.1 May 24, 2020
1.0.2 Apr 13, 2020
1.0.1 Mar 20, 2020
1.0.0 Dec 17, 2019
0.6.2 Dec 17, 2019
0.6.1 Dec 3, 2019
0.5.1 Oct 25, 2019
0.5.0 Oct 16, 2019
0.4.0 Oct 9, 2019
0.3.0 Sep 5, 2019
0.2.1 Jul 24, 2019
0.1.1 Jul 17, 2019
0.1.0 Jul 12, 2019
0.0.19 Jun 23, 2019
0.0.15 Jun 18, 2019
0.0.14 Jun 12, 2019
0.0.4 Jun 4, 2019
0.0.3 Jun 4, 2019

laurelin

License

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management