Ballista Compute

JVM query engine based on Apache Arrow

License	License The Apache License, Version 2.0
GroupId	GroupId org.ballistacompute
ArtifactId	ArtifactId execution
Last Version	Last Version 0.2.5
Release Date	Release Date May 10, 2020
Type	Type module
Description	Description Ballista Compute JVM query engine based on Apache Arrow
Project URL	Project URL https://github.com/ballista-compute/ballista
Source Code Management	Source Code Management https://github.com/ballista-compute/ballista/

License

The Apache License, Version 2.0

GroupId

org.ballistacompute

ArtifactId

execution

Last Version

0.2.5

Release Date

May 10, 2020

Type

module

Description

Ballista Compute

JVM query engine based on Apache Arrow

Project URL

https://github.com/ballista-compute/ballista

Source Code Management

https://github.com/ballista-compute/ballista/

Download execution

Filename	Size
execution-0.2.5.pom
execution-0.2.5.module	4 KB
execution-0.2.5-sources.jar	1 KB
execution-0.2.5-javadoc.jar	261 bytes
Browse

Filename

Size

execution-0.2.5.pom

execution-0.2.5.module

4 KB

execution-0.2.5-sources.jar

1 KB

execution-0.2.5-javadoc.jar

261 bytes

Browse

Dependencies

runtime (12)

Group / Artifact	Type	Version
org.jetbrains.kotlin : kotlin-stdlib-jdk8	jar	1.3.50
org.jetbrains.kotlinx : kotlinx-serialization-runtime	jar	0.14.0
org.jetbrains.kotlin : kotlin-stdlib	jar	1.3.60
org.ballistacompute : datatypes	jar	0.2.5
org.ballistacompute : datasource	jar	0.2.5
org.ballistacompute : logical-plan	jar	0.2.5
org.ballistacompute : physical-plan	jar	0.2.5
org.ballistacompute : query-planner	jar	0.2.5
org.ballistacompute : optimizer	jar	0.2.5
org.ballistacompute : sql	jar	0.2.5
org.ballistacompute : fuzzer	jar	0.2.5
org.apache.arrow : arrow-vector	jar	0.17.0

Group / Artifact

Type

Version

org.jetbrains.kotlin : kotlin-stdlib-jdk8

jar

1.3.50

org.jetbrains.kotlinx : kotlinx-serialization-runtime

jar

0.14.0

org.jetbrains.kotlin : kotlin-stdlib

jar

1.3.60

org.ballistacompute : datatypes

jar

0.2.5

org.ballistacompute : datasource

jar

0.2.5

org.ballistacompute : logical-plan

jar

0.2.5

org.ballistacompute : physical-plan

jar

0.2.5

org.ballistacompute : query-planner

jar

0.2.5

org.ballistacompute : optimizer

jar

0.2.5

org.ballistacompute : sql

jar

0.2.5

org.ballistacompute : fuzzer

jar

0.2.5

org.apache.arrow : arrow-vector

jar

0.17.0

Project Modules

There are no modules declared in this project.

Ballista: Distributed Compute Platform

Overview

Ballista is a distributed compute platform primarily implemented in Rust, using Apache Arrow as the memory model. It is built on an architecture that allows other programming languages to be supported as first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

Apache Arrow memory model and compute kernels for efficient processing of data.
Apache Arrow Flight Protocol for efficient data transfer between processes.
Google Protocol Buffers for serializing query plans.
Docker for packaging up executors along with user-defined code.

Ballista can be deployed in Kubernetes, or as a standalone cluster using etcd for discovery.

Architecture

The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.

How does this compare to Apache Spark?

Although Ballista is largely inspired by Apache Spark, there are some key differences.

The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still largely row-based today.
The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.

Examples

The following examples should help illustrate the current capabilities of Ballista

Status

Ballista releases are now available on crates.io, Maven Central and Docker Hub. Please refer to the user guide for instructions on using a released version of Ballista.

Roadmap

We are currently working on performance tuning and adding support for more complex operators, particularly joins, using the TPC-H benchmarks to drive requirements. The full roadmap is available here.

Documentation

The user guide is hosted at https://ballistacompute.org, along with the blog where news and release notes are posted.

Developer documentation can be found in the docs directory.

Contributing

See CONTRIBUTING.md for information on contributing to this project.

Versions

Version
0.2.5 May 10, 2020
0.2.4 May 3, 2020
0.2.3 May 1, 2020
0.2.1 Apr 25, 2020

Version

0.2.5
May 10, 2020

0.2.4
May 3, 2020

0.2.3
May 1, 2020

0.2.1
Apr 25, 2020