Genomics Dataflow Core Components

Components for implementation Google Cloud Dataflow piplines to solve genomics processing tasks

License	License The Apache Software License, Version 2.0
Categories	Categories Data
GroupId	GroupId com.google.allenday
ArtifactId	ArtifactId genomics-dataflow-core
Last Version	Last Version 0.1.4
Release Date	Release Date Oct 18, 2020
Type	Type jar
Description	Description Genomics Dataflow Core Components Components for implementation Google Cloud Dataflow piplines to solve genomics processing tasks
Project URL	Project URL https://github.com/allenday/dataflow-genomics
Source Code Management	Source Code Management https://github.com/allenday/dataflow-genomics.git

Download genomics-dataflow-core

Filename	Size
genomics-dataflow-core-0.1.4.pom
genomics-dataflow-core-0.1.4.jar	145 KB
genomics-dataflow-core-0.1.4-sources.jar	62 KB
genomics-dataflow-core-0.1.4-javadoc.jar	536 KB
Browse

How to add to project

Apache Maven

<!-- https://jarcasting.com/artifacts/com.google.allenday/genomics-dataflow-core/ -->
<dependency>
    <groupId>com.google.allenday</groupId>
    <artifactId>genomics-dataflow-core</artifactId>
    <version>0.1.4</version>
</dependency>

Gradle Groovy

// https://jarcasting.com/artifacts/com.google.allenday/genomics-dataflow-core/
implementation 'com.google.allenday:genomics-dataflow-core:0.1.4'

Gradle Kotlin

// https://jarcasting.com/artifacts/com.google.allenday/genomics-dataflow-core/
implementation ("com.google.allenday:genomics-dataflow-core:0.1.4")

Apache Buildr

'com.google.allenday:genomics-dataflow-core:jar:0.1.4'

Apache Ivy

<dependency org="com.google.allenday" name="genomics-dataflow-core" rev="0.1.4">
  <artifact name="genomics-dataflow-core" type="jar" />
</dependency>

Groovy Grape

@Grapes(
@Grab(group='com.google.allenday', module='genomics-dataflow-core', version='0.1.4')
)

Scala SBT

libraryDependencies += "com.google.allenday" % "genomics-dataflow-core" % "0.1.4"

Leiningen

[com.google.allenday/genomics-dataflow-core "0.1.4"]

Dependencies

compile (14)

Group / Artifact	Type	Version
org.apache.beam : beam-sdks-java-core	jar	2.20.0
org.apache.beam : beam-runners-direct-java	jar	2.20.0
org.apache.beam : beam-runners-google-cloud-dataflow-java	jar	2.20.0
com.google.cloud : google-cloud-storage	jar	1.49.0
com.google.cloud : google-cloud-resourcemanager	jar	0.116.0-alpha
com.google.inject : guice	jar	4.2.3
com.google.auth : google-auth-library-oauth2-http	jar	0.18.0
com.google.apis : google-api-services-lifesciences	jar	v2beta-rev20191015-1.30.3
com.github.samtools : htsjdk	jar	2.20.3
commons-io : commons-io	jar	2.6
commons-validator : commons-validator	jar	1.4.0
org.javatuples : javatuples	jar	1.2
org.slf4j : slf4j-simple	jar	1.7.25
com.google.http-client : google-http-client-jackson	jar	1.29.2

test (2)

Group / Artifact	Type	Version
junit : junit	jar	4.11
org.mockito : mockito-all	jar	1.10.19

Project Modules

There are no modules declared in this project.

Dataflow Genomics Core Components

Ready-to-use components for implementation Google Cloud Dataflow pipelines to solve genomics processing tasks

Overview

Here you can find a wide list of components for building genomics data processing pipelines based on Apache Beam unified programming model and runnable with Google Cloud Dataflow. Current package includes tools for:

Building Batch and Streaming processing transformation graphs of genomics data
Working with SRA metadata annotations
Manipulations with FASTQ files
Working with FASTA genome references
Sequence alignment
Different SAM/BAM data manipulations (Sorting, Merging, etc.)
Variant Calling
Variant Calling results (VCF) export

Efficiency

The usage of components from current library allows you to build highly scalable, parallel and efficient genomics data processing pipelines. Here is a principle schema of the pipeline that identifies genetic variations from sequence data that was built on library components:

Prerequisites

Java Development Kit (JDK) version 8
Apache Maven

Structure

The repository contains two Maven modules:

genomics-dataflow-core - module with Dataflow Genomics Core Components Java source code
giab-example - module with demo project, that shows an example of usage of genomics-dataflow-core

High-level components

There are several high-level classes, that could be used as the main building blocks for your pipeline. Here are some of them:

ParseSourceCsvTransform - provides queue of input data transformation. It includes reading input CSV file (example), parsing, filtering, check for anomalies in metadata. Return ready to use key-value pair of SampleMetaData and list of FileWrapper
SplitFastqIntoBatches - provides FASTQ splitting mechanism to increase parallelism and balance load between workers
AlignAndSamProcessingTransform - contains queue of genomics transformation namely Sequence alignment (FASTQ->SAM), converting to binary format (SAM->BAM), sorting BAM and merging BAM in scope of specific contig region
VariantCallingTransform - Apache Beam PTransform that provides Variant Calling logic. Currently supported GATK Haplotaype Caller and Deep Variant pipeline from Google.
PrepareAndExecuteVcfToBqTransform - Apache Beam PTransform that groups Variant Calling results (VCF) of contig regions and exports them into the BigQuery table. Uses vcf-to-bigquery transform from GCP Variant Transforms

Sequence aligning

By default, minimap2 aligner is used for Sequence aligning stage. Optionally you can use BWA aligner by passsing --aligner=BWA to the Apache Beam PipelineOptions.

Also, you can add a custom aligner by extending AlignService class

Variant Calling

By default, pipeline uses GATK Haplotaype Caller. Optionally there is a possibility to run the pipeline with a Deep Variant variant caller. To do this you should pass --variantCaller=DEEP_VARIANT to the Apache Beam PipelineOptions.

Also, you can add a custom variant caller by extending VariantCallingService class

Usage

This repository contains an example of usage of Dataflow Genomics Core Components library, that provides a demo pipeline with batch processing of the NA12878 sample from Genome in a Bottle.

Already used by

Nanostream Dataflow - a scalable, reliable, and cost effective end-to-end pipeline for fast DNA sequence analysis using Dataflow on Google Cloud

GCP-PopGen Processing Pipeline - a repository, that contains a number of Apache Beam pipeline configurations for processing different populations of genomes (e.g. Homo Sapiens, Rice, Cannabis)

Testing

Repository contains unit test that covers all main components and one end-to-end integration test. For integration testing you have to configure TEST_BUCKET environment variable.

Versions

Version
0.1.4 Oct 18, 2020
0.1.3 May 6, 2020
0.1.2 May 6, 2020
0.1.1 May 4, 2020
0.0.17 Apr 23, 2020
0.0.16 Apr 17, 2020
0.0.15 Apr 16, 2020
0.0.14 Apr 15, 2020
0.0.13 Apr 15, 2020
0.0.12 Apr 13, 2020
0.0.11 Apr 10, 2020
0.0.10 Apr 9, 2020
0.0.9 Apr 6, 2020
0.0.8 Apr 6, 2020
0.0.7 Apr 2, 2020
0.0.6 Apr 2, 2020
0.0.5 Mar 20, 2020
0.0.3 Mar 10, 2020
0.0.2 Nov 7, 2019

Genomics Dataflow Core Components

License

Categories

GroupId

ArtifactId

Last Version

Release Date

Type

Description

Project URL

Source Code Management