laurelin

Spark datasource for ROOT (http://root.cern) files

License

License

GroupId

GroupId

edu.vanderbilt.accre
ArtifactId

ArtifactId

laurelin
Last Version

Last Version

1.1.1
Release Date

Release Date

Type

Type

jar
Description

Description

laurelin
Spark datasource for ROOT (http://root.cern) files
Project URL

Project URL

https://github.com/spark-root/laurelin
Source Code Management

Source Code Management

https://github.com/spark-root/laurelin

Download laurelin

How to add to project

<!-- https://jarcasting.com/artifacts/edu.vanderbilt.accre/laurelin/ -->
<dependency>
    <groupId>edu.vanderbilt.accre</groupId>
    <artifactId>laurelin</artifactId>
    <version>1.1.1</version>
</dependency>
// https://jarcasting.com/artifacts/edu.vanderbilt.accre/laurelin/
implementation 'edu.vanderbilt.accre:laurelin:1.1.1'
// https://jarcasting.com/artifacts/edu.vanderbilt.accre/laurelin/
implementation ("edu.vanderbilt.accre:laurelin:1.1.1")
'edu.vanderbilt.accre:laurelin:jar:1.1.1'
<dependency org="edu.vanderbilt.accre" name="laurelin" rev="1.1.1">
  <artifact name="laurelin" type="jar" />
</dependency>
@Grapes(
@Grab(group='edu.vanderbilt.accre', module='laurelin', version='1.1.1')
)
libraryDependencies += "edu.vanderbilt.accre" % "laurelin" % "1.1.1"
[edu.vanderbilt.accre/laurelin "1.1.1"]

Dependencies

compile (4)

Group / Artifact Type Version
org.tukaani : xz jar 1.2
org.lz4 : lz4-java jar 1.5.1
org.apache.logging.log4j : log4j-api jar 2.11.2
org.apache.logging.log4j : log4j-core jar 2.11.2

provided (4)

Group / Artifact Type Version
org.apache.hadoop : hadoop-common jar 2.7.4
org.apache.hadoop : hadoop-client jar 2.7.4
org.apache.spark : spark-core_2.11 jar 2.4.5
org.apache.spark : spark-sql_2.11 jar 2.4.5

test (1)

Group / Artifact Type Version
junit : junit jar 4.11

Project Modules

There are no modules declared in this project.

spark-ttree Build StatusMaven Central

Implementation of ROOT I/O designed to get TTrees into Spark DataFrames. Consists of the following three components:

  • DataSource - Spark DataSourceV2 implementation
  • ArrayInterpretation - Accepts raw TBasket byte ranges and returns deserialzed arrays
  • root_proxy - Deserializes ROOT metadata to locate TBasket byte ranges

The scope of this project is only to perform vectorized (i.e. column-based) reads of TTrees consisting of relatively simple branches -- fundamental numeric types and both fixed-length/jagged arrays of those types.

Usage example

Note that the most recent version number can be found here. To use a different version, replace 1.0.0 with your desired version

import pyspark.sql

spark = pyspark.sql.SparkSession.builder \
    .master("local[1]") \
    .config('spark.jars.packages', 'edu.vanderbilt.accre:laurelin:1.0.0') \
    .getOrCreate()
sc = spark.sparkContext
df = spark.read.format('root') \
                .option("tree", "tree") \
                .load('small-flat-tree.root')
df.printSchema()

Known issues/not yet implemented functionality

  • The I/O is currently completely unoptimized -- there is no caching or prefetching. Remote reads will be slow as a consequence.
  • Arrays (both fixed and jagged) of booleans return the wrong result
  • Float16/Doubles32 are currently not supported
  • String types are currently not supported
  • C++ STD types are currently not supported (importantly, std::vector)
edu.vanderbilt.accre

Spark-ROOT

Integrating the Spark and ROOT ecosystems

Versions

Version
1.1.1
1.0.2
1.0.1
1.0.0
0.6.2
0.6.1
0.5.1
0.5.0
0.4.0
0.3.0
0.2.1
0.1.1
0.1.0
0.0.19
0.0.15
0.0.14
0.0.4
0.0.3