circus-train-datasqueeze

Circus Train ⨉ DataSqueeze

License

License

Categories

Categories

Data
GroupId

GroupId

com.expedia.dsp
ArtifactId

ArtifactId

circus-train-datasqueeze
Last Version

Last Version

1.0
Release Date

Release Date

Type

Type

jar
Description

Description

circus-train-datasqueeze
Circus Train ⨉ DataSqueeze
Project Organization

Project Organization

Expedia Group (Data Services and Platforms)
Source Code Management

Source Code Management

https://github.com/ExpediaInceCommercePlatform/circus-train-datasqueeze/tree/master

Download circus-train-datasqueeze

How to add to project

<!-- https://jarcasting.com/artifacts/com.expedia.dsp/circus-train-datasqueeze/ -->
<dependency>
    <groupId>com.expedia.dsp</groupId>
    <artifactId>circus-train-datasqueeze</artifactId>
    <version>1.0</version>
</dependency>
// https://jarcasting.com/artifacts/com.expedia.dsp/circus-train-datasqueeze/
implementation 'com.expedia.dsp:circus-train-datasqueeze:1.0'
// https://jarcasting.com/artifacts/com.expedia.dsp/circus-train-datasqueeze/
implementation ("com.expedia.dsp:circus-train-datasqueeze:1.0")
'com.expedia.dsp:circus-train-datasqueeze:jar:1.0'
<dependency org="com.expedia.dsp" name="circus-train-datasqueeze" rev="1.0">
  <artifact name="circus-train-datasqueeze" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.expedia.dsp', module='circus-train-datasqueeze', version='1.0')
)
libraryDependencies += "com.expedia.dsp" % "circus-train-datasqueeze" % "1.0"
[com.expedia.dsp/circus-train-datasqueeze "1.0"]

Dependencies

provided (4)

Group / Artifact Type Version
com.expedia.dsp : datasqueeze jar 2.0
com.hotels : circus-train-api jar 12.1.0
com.hotels : circus-train-core jar 12.1.0
com.hotels : circus-train-s3-mapreduce-cp-copier jar 12.1.0

test (2)

Group / Artifact Type Version
org.powermock : powermock-module-junit4 jar 1.7.4
org.powermock : powermock-api-mockito jar 1.7.4

Project Modules

There are no modules declared in this project.

Circus Train DataSqueeze Copier

Maven Central Build Status GitHub license

Overview

This project implements new Copiers for Circus Train which integrate DataSqueeze.

These CopierFactories are available:

  • DataSqueezeCopier - Runs DataSqueeze to compact data from source to target (within the same filesystem)
  • S3DataSqueezeCopier - Composite Copier that first replicates data using the built-in S3DistCpCopier, then compacts with DataSqueezeCopier

The S3DataSqueezeCopier Composite Copier is built like this:

S3DistCpCopier: Source -> Temp
CompactionCopier: Temp -> Replica

Installation

This extension must be added to the Circus Train classpath to be available for use. These copiers will only be used if specifically requested via the Circus Train YAML, so adding this extension will not affect other Circus Train jobs accidentally.

Both Circus Train and DataSqueeze are not bundled with this extension, so they must be installed separately.

Classpath

Either copy the JAR from this project into the Circus Train lib folder, or modify CIRCUS_TRAIN_CLASSPATH to include them:

export CIRCUS_TRAIN_CLASSPATH=$CIRCUS_TRAIN_CLASSPATH:/opt/circus-train-datasqueeze/lib/*

Configuration

Add the following to your Circus Train YAML file:

---
  extension-packages: com.expedia.dsp.circustrain
  copier-options:
    copier-factory-class: com.expedia.dsp.circustrain.S3DataSqueezeCopierFactory
    composite-tmp-dir: s3://your-bucket/tmp/composite-tmp-dir
    threshold: 268435456
  table-replications:
    -
      ...

The composite-tmp-dir property is used to stage the data between the S3DistCpCopier and the CompactionCopier.

Property Required Description
extension-packages Yes Allows Circus Train to discover this extension
copier-options.copier-factory-class Yes Forces Circus Train to use the given copier: com.expedia.dsp.circustrain.S3DataSqueezeCopierFactory
copier-options.composite-tmp-dir Yes The temporary location to stage data between the S3 and DataSqueeze copiers
copier-options.threshold No A threshold determining which files will be compacted; any files over this size (in bytes) will not be compacted. If omitted, DataSqueeze will use its default value.

Contributing

We gladly accept contributions to this project in the form of issues, feature requests, and pull requests!

Legal

This project is available under the Apache 2.0 License.

Copyright 2018 Expedia Group

com.expedia.dsp

Versions

Version
1.0