Genomics Dataflow Core Components

Components for implementation Google Cloud Dataflow piplines to solve genomics processing tasks

License

License

Categories

Categories

Data
GroupId

GroupId

com.google.allenday
ArtifactId

ArtifactId

genomics-dataflow-core
Last Version

Last Version

0.1.4
Release Date

Release Date

Type

Type

jar
Description

Description

Genomics Dataflow Core Components
Components for implementation Google Cloud Dataflow piplines to solve genomics processing tasks
Project URL

Project URL

https://github.com/allenday/dataflow-genomics
Source Code Management

Source Code Management

https://github.com/allenday/dataflow-genomics.git

Download genomics-dataflow-core

How to add to project

<!-- https://jarcasting.com/artifacts/com.google.allenday/genomics-dataflow-core/ -->
<dependency>
    <groupId>com.google.allenday</groupId>
    <artifactId>genomics-dataflow-core</artifactId>
    <version>0.1.4</version>
</dependency>
// https://jarcasting.com/artifacts/com.google.allenday/genomics-dataflow-core/
implementation 'com.google.allenday:genomics-dataflow-core:0.1.4'
// https://jarcasting.com/artifacts/com.google.allenday/genomics-dataflow-core/
implementation ("com.google.allenday:genomics-dataflow-core:0.1.4")
'com.google.allenday:genomics-dataflow-core:jar:0.1.4'
<dependency org="com.google.allenday" name="genomics-dataflow-core" rev="0.1.4">
  <artifact name="genomics-dataflow-core" type="jar" />
</dependency>
@Grapes(
@Grab(group='com.google.allenday', module='genomics-dataflow-core', version='0.1.4')
)
libraryDependencies += "com.google.allenday" % "genomics-dataflow-core" % "0.1.4"
[com.google.allenday/genomics-dataflow-core "0.1.4"]

Dependencies

compile (14)

Group / Artifact Type Version
org.apache.beam : beam-sdks-java-core jar 2.20.0
org.apache.beam : beam-runners-direct-java jar 2.20.0
org.apache.beam : beam-runners-google-cloud-dataflow-java jar 2.20.0
com.google.cloud : google-cloud-storage jar 1.49.0
com.google.cloud : google-cloud-resourcemanager jar 0.116.0-alpha
com.google.inject : guice jar 4.2.3
com.google.auth : google-auth-library-oauth2-http jar 0.18.0
com.google.apis : google-api-services-lifesciences jar v2beta-rev20191015-1.30.3
com.github.samtools : htsjdk jar 2.20.3
commons-io : commons-io jar 2.6
commons-validator : commons-validator jar 1.4.0
org.javatuples : javatuples jar 1.2
org.slf4j : slf4j-simple jar 1.7.25
com.google.http-client : google-http-client-jackson jar 1.29.2

test (2)

Group / Artifact Type Version
junit : junit jar 4.11
org.mockito : mockito-all jar 1.10.19

Project Modules

There are no modules declared in this project.

Maven Central

Dataflow Genomics Core Components

Ready-to-use components for implementation Google Cloud Dataflow pipelines to solve genomics processing tasks

Overview

Here you can find a wide list of components for building genomics data processing pipelines based on Apache Beam unified programming model and runnable with Google Cloud Dataflow. Current package includes tools for:

  • Building Batch and Streaming processing transformation graphs of genomics data
  • Working with SRA metadata annotations
  • Manipulations with FASTQ files
  • Working with FASTA genome references
  • Sequence alignment
  • Different SAM/BAM data manipulations (Sorting, Merging, etc.)
  • Variant Calling
  • Variant Calling results (VCF) export

Efficiency

The usage of components from current library allows you to build highly scalable, parallel and efficient genomics data processing pipelines. Here is a principle schema of the pipeline that identifies genetic variations from sequence data that was built on library components: Pipeline principle schema

Prerequisites

Structure

The repository contains two Maven modules:

High-level components

There are several high-level classes, that could be used as the main building blocks for your pipeline. Here are some of them:

Sequence aligning

By default, minimap2 aligner is used for Sequence aligning stage. Optionally you can use BWA aligner by passsing --aligner=BWA to the Apache Beam PipelineOptions.

Also, you can add a custom aligner by extending AlignService class

Variant Calling

By default, pipeline uses GATK Haplotaype Caller. Optionally there is a possibility to run the pipeline with a Deep Variant variant caller. To do this you should pass --variantCaller=DEEP_VARIANT to the Apache Beam PipelineOptions.

Also, you can add a custom variant caller by extending VariantCallingService class

Usage

This repository contains an example of usage of Dataflow Genomics Core Components library, that provides a demo pipeline with batch processing of the NA12878 sample from Genome in a Bottle.

Already used by

Nanostream Dataflow - a scalable, reliable, and cost effective end-to-end pipeline for fast DNA sequence analysis using Dataflow on Google Cloud

GCP-PopGen Processing Pipeline - a repository, that contains a number of Apache Beam pipeline configurations for processing different populations of genomes (e.g. Homo Sapiens, Rice, Cannabis)

Testing

Repository contains unit test that covers all main components and one end-to-end integration test. For integration testing you have to configure TEST_BUCKET environment variable.

Versions

Version
0.1.4
0.1.3
0.1.2
0.1.1
0.0.17
0.0.16
0.0.15
0.0.14
0.0.13
0.0.12
0.0.11
0.0.10
0.0.9
0.0.8
0.0.7
0.0.6
0.0.5
0.0.3
0.0.2