"Dataflow variables are spectacularly expressive in concurrent programming"
Henri E. Bal , Jennifer G. Steiner , Andrew S. Tanenbaum
Quick overview
Nextflow is a bioinformatics workflow manager that enables the development of portable and reproducible workflows. It supports deploying workflows on a variety of execution platforms including local, HPC schedulers, AWS Batch, Google Cloud Life Sciences, and Kubernetes. Additionally, it provides support for manage your workflow dependencies through built-in support for Conda, Docker, Singularity, and Modules.
Contents
- Rationale
- Quick start
- Documentation
- Tool Management
- HPC Schedulers
- Cloud Support
- Community
- Build from source
- Contributing
- License
- Citations
- Credits
Rationale
With the rise of big data, techniques to analyse and run experiments on large datasets are increasingly necessary.
Parallelization and distributed computing are the best ways to tackle this problem, but the tools commonly available to the bioinformatics community often lack good support for these techniques, or provide a model that fits badly with the specific requirements in the bioinformatics domain and, most of the time, require the knowledge of complex tools or low-level APIs.
Nextflow framework is based on the dataflow programming model, which greatly simplifies writing parallel and distributed pipelines without adding unnecessary complexity and letting you concentrate on the flow of data, i.e. the functional logic of the application/algorithm.
It doesn't aim to be another pipeline scripting language yet, but it is built around the idea that the Linux platform is the lingua franca of data science, since it provides many simple command line and scripting tools, which by themselves are powerful, but when chained together facilitate complex data manipulations.
In practice, this means that a Nextflow script is defined by composing many different processes. Each process can execute a given bioinformatics tool or scripting language, to which is added the ability to coordinate and synchronize the processes execution by simply specifying their inputs and outputs.
Quick start
Download the package
Nextflow does not require any installation procedure, just download the distribution package by copying and pasting this command in your terminal:
curl -fsSL https://get.nextflow.io | bash
It creates the nextflow
executable file in the current directory. You may want to move it to a folder accessible from your $PATH
.
Download from Conda
Nextflow can also be installed from Bioconda
conda install -c bioconda nextflow
Documentation
Nextflow documentation is available at this link http://docs.nextflow.io
HPC Schedulers
Nextflow supports common HPC schedulers, abstracting the submission of jobs from the user.
Currently the following clusters are supported:
For example to submit the execution to a SGE cluster create a file named nextflow.config
, in the directory where the pipeline is going to be launched, with the following content:
process {
executor='sge'
queue='<your execution queue>'
}
In doing that, processes will be executed by Nextflow as SGE jobs using the qsub
command. Your pipeline will behave like any other SGE job script, with the benefit that Nextflow will automatically and transparently manage the processes synchronisation, file(s) staging/un-staging, etc.
Cloud support
Nextflow also supports running workflows across various clouds and cloud technologies. Nextflow can create AWS EC2 or Google GCE clusters and deploy your workflow. Managed solutions from both Amazon and Google are also supported through AWS Batch and Google Genomics Pipelines. Additionally, Nextflow can run workflows on either on-prem or managed cloud Kubernetes clusters.
Currently supported cloud platforms:
Tool management
Containers
Nextflow has first class support for containerization. It supports both Docker and Singularity container engines. Additionally, Nextflow can easily switch between container engines enabling workflow portability.
process samtools {
container 'biocontainers/samtools:1.3.1'
"""
samtools --version
"""
}
Conda environments
Conda environments provide another option for managing software packages in your workflow.
Environment Modules
Environment modules commonly found in HPC environments can also be used to manage the tools used in a Nextflow workflow.
Community
You can post questions, or report problems by using the Nextflow discussion forum or the Nextflow channel on Gitter.
Nextflow also hosts a yearly workshop showcasing researcher's workflows and advancements in the langauge. Talks from the past workshops are available on the Nextflow YouTube Channel
The nf-core project is a community effort aggregating high quality Nextflow workflows which can be used by the community.
Build from source
Required dependencies
- Compiler Java 8 or later
- Runtime Java 8 or later
Build from source
Nextflow is written in Groovy (a scripting language for the JVM). A pre-compiled, ready-to-run, package is available at the Github releases page, thus it is not necessary to compile it in order to use it.
If you are interested in modifying the source code, or contributing to the project, it worth knowing that the build process is based on the Gradle build automation system.
You can compile Nextflow by typing the following command in the project home directory on your computer:
make compile
The very first time you run it, it will automatically download all the libraries required by the build process. It may take some minutes to complete.
When complete, execute the program by using the launch.sh
script in the project directory.
The self-contained runnable Nextflow packages can be created by using the following command:
make pack
Once compiled use the script ./launch.sh
as a replacement for the usual nextflow
command.
The compiled packages can be locally installed using the following command:
make install
A self-contained distribution can be created with the command: make pack
. To include support of GA4GH and its dependencies in the binary, use make packGA4GH
instead.
IntelliJ IDEA
Nextflow development with IntelliJ IDEA requires the latest version of the IDE (2019.1.2 or later).
If you have it installed in your computer, follow the steps below in order to use it with Nextflow:
- Clone the Nextflow repository to a directory in your computer.
- Open IntelliJ IDEA and choose "Import project" in the "File" menu bar.
- Select the Nextflow project root directory in your computer and click "OK".
- Then, choose the "Gradle" item in the "external module" list and click on "Next" button.
- Confirm the default import options and click on "Finish" to finalize the project configuration.
- When the import process complete, select the "Project structure" command in the "File" menu bar.
- In the showed dialog click on the "Project" item in the list of the left, and make sure that the "Project SDK" choice on the right contains Java 8.
- Set the code formatting options with setting provided here.
Contributing
Project contribution are more than welcome. See the CONTRIBUTING file for details.
Build servers
License
The Nextflow framework is released under the Apache 2.0 license.
Citations
If you use Nextflow in your research, please cite:
P. Di Tommaso, et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017) doi:10.1038/nbt.3820
Credits
Nextflow is built on two great pieces of open source software, namely Groovy and Gpars.
YourKit is kindly supporting this open source project with its full-featured Java Profiler. Read more http://www.yourkit.com