Skip to content
Snippets Groups Projects

PopinSnake workflow

This Git repository contains a generic Snakemake workflow for running the program PopIns4Snake. It is supposed to facilitate your PopIns run and provides examples for a streamlined workflow.

Contents

  1. Prerequisites
  2. Download and installation
  3. Setting up your workflow
  4. Workflow execution
  5. Example data
  6. References

Prerequisites

As a minimum, an installation of the Conda package manager (Miniconda) is required for running the workflow. In addition, we recommend to install Mamba in the Conda base environment:

$ conda install mamba -n base -c conda-forge

If you do not have Snakemake installed already, create and activate a Conda environment for it:

$ conda activate base
$ mamba create -c conda-forge -c bioconda -n snakemake snakemake
$ conda activate snakemake

Most dependencies of the PopIns workflow will be automatically installed via Conda/Mamba when running the workflow for the first time.

Dependencies that are not available in Conda are included as additional programs in this repository and need to be installed before you can run the workflow. These installations require a C++14 capable compiler, e.g. GCC version ≥4.9.2, CMake version ≥2.8.12 and Zlib to be available on your system. If not already available on your system, they can be installed via Conda/Mamba:

$ conda activate base
$ mamba create -c conda-forge -n gcc cxx-compiler zlib cmake
$ conda activate gcc

Download and installation

Download the PopinSnake workflow from this repository via

$ git clone --recursive https://gitlab.informatik.hu-berlin.de/fonda_a6/popinSnake.git
$ cd popinSnake

The flag --recursive is required to download the required programs.

Before you can run the workflow, you need to install the programs popins4snake and sickle. The gatb-minia-pipeline program requires no separate installation.

Installing the PopIns4Snake program

PopIns4Snake is the main program executed by the workflow. Before we can compile PopIns4Snake, we need to compile and install its dependency Bifrost. For this, navigate to the bifrost folder and compile Bifrost using the flag MAX_KMER_SIZE=64:

$ cd program_bin/popins4snake/external/bifrost
$ mkdir local
$ mkdir build && cd build
$ cmake .. -DCMAKE_INSTALL_PREFIX=../local -DMAX_KMER_SIZE=64
$ make
$ make install

This installs Bifrost in the folder program_bin/popins4snake/external/bifrost/local, which is where the PopIns4Snake Makefile will look for it.

Now we can go back to the main popins4snake folder and compile PopIns4Snake:

$ cd ../../../
$ mkdir build
$ make

A binary popins4snake should now be available in program_bin/popins4snake/. The workflow is pre-configured to search for it in this folder.

Installing the Sickle program

Sickle is a small program for filtering and trimming sequencing reads. All you need to do to install Sickle is to navigate to its folder and run make:

$ cd program_bin/sickle
$ make

This should create the binary sickle in program_bin/sickle/. It is not required to copy the binary to a location available in your path as described in the Sickle installation instructions. The workflow is pre-configured to search for the sickle binary in the program_bin/sickle/ folder.

If running into environment issue while compiling sickle, try

$ make CFLAGS+="-I${CONDA_PREFIX}/include"

In case you created the GCC Conda environment, remember to now activate the Snakemake environment again:

$ conda activate snakemake

Setting up your workflow

You can find all configurable parameters of the workflow in the file snake_config.yaml. You will need to adjust them according to your input data, preferred output folders, etc. as described in the following.

The pre-configured values allow running the workflow on the example data included in this repository.

Configuring locations of the input data

Input to the workflow is a set of BAM files along with BAM index (BAI) files, and the corresponding reference genome in FASTA format. Optionally, an "alternative reference" in FASTA format for contamination removal can be specified (see also Workflow behaviour).

# Define input directories here
REFERENCE:
   ../path/to/hg38.fa
ALTREF:
   ../path/to/altref.fa
INPUT_DIR:
   ../path/to/my_bam_files/

Further, the workflow requires you to specify all sample names:

# Provide sample names here
SAMPLES:
   - genome1
   - genome2
   - genome3

Please make sure the list is consistent with your BAM files in the INPUT_DIR. The workflow expects a separate subfolder for each sample (input BAM file) in the INPUT_DIR, hence, your folder structure for the input BAM files should look similar to:

INPUT_DIR
├── genome1
│   ├── genome1.bam
│   └── genome1.bam.bai
├── genome2
│   ├── genome2.bam
│   └── genome2.bam.bai
└── genome3
    ├── genome3.bam
    └── genome3.bam.bai

Configuring workflow and file locations

MYPATH:
   ..prefix/of/workflow/path
WORK_DIR:
   /workdir
RESULTS_DIR:
   /results

MY_PATH should indicate the full path of where the workflow is located, this path serves as a prefix of path for all the sub-directories within the workflow. e.g.: /home/user/popinSnake
The workflow automatically creates two output folders WORK_DIR and RESULTS_DIR within the workflow directory.
WORK_DIR is intended for intermediate data per sample, i.e., the workflow will create a subfolder per sample in this directory. Depending on the number of BAM files input to the workflow, the space requirements of this folder may grow quite large. You may specify it in a temporary location or choose to delete it after successful completion of the workflow.
RESULTS_DIR will hold the final VCF and FASTA output files describing the called insertions as well as the figures created during intermediate data analyses. You probably want to keep most files written to this folder.

Configuring alternative installation directories

Define paths to program binaries. If you strictly followed the installation instructions above, you will not need to change the preconfigured values in this section. The default path to prgram_bin is under MY_PATH. If you install any of the program_bin in another location, it is possible to change the default paths to the program_bin' binaries:

POPINS2_BIN:
   path/to/program_bin/popins4snake/popins4snake
MINIA_BIN:
   path/to/program_bin/gatb-minia-pipeline/gatb
SICKLE_BIN:
   path/to/program_bin/sickle/sickle
VELVET_BIN:
   path/to/program_bin/velvet/velvet

Configuring workflow behaviour

You can influence the behaviour of the workflow by changing the parameters in this section:

Use sickle filtering or the default popins4snake filtering:

SICKLE:
   "yes" or "no"

The SICKLE parameter choose which method to use for trimming the reads by defining the quality and length thresholds.
By selecting "yes" it will use sickle for quality trimming and filtering. The default quality is 20 and length is 60.
By selecting "no" it will use the default popins4snake parameters for quality trimming and filtering. The default quality is 20 and length is 60.
The trimming quality and length for either methods can be adjusted in the paramter section in the configuration file using min-qual and min-read-len.

Choose assembly tool:

ASSEMBLER:
   "minia" or "velvet"

The ASSEMBLER parameter specifies the program to use for assembling reads without alignment to the reference genome into contiguous sequences (contigs). The default workflow installation supports two different assembly programs, "velvet" and "minia".

Choose to view the data analysis with jupyter notebooks or not:

ANALYSIS:
   "yes" or "no"

The ANALYSIS parameter takes in either "yes" or "no" as inputs to activate a sub-module for data analysis with jupyter notebooks.
The unmapped_analysis returns figures of unmapped reads before assembly. The assembly_analysis returns figures showing the quality of the assembly. The converage_analysis returns figures showing contig coverage before positioning and genotyping. To use the jupyter notebook analysis, it requires an installation of snakemake version 5.32.0 and the snakemake flag --edit-notebook. Please refer to the snakemake documentation on how to use jupyter notebooks for further information.

The popinSnake workflow can identify and remove contaminations using kraken2.
Configure the contamination removal module here:

remove_contamination:
   "yes" or "no"
kraken_db_path:
   /example_data/kraken_db
install_kraken_db:
   "yes" or "no"
kraken_lib:
   "https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20240605.tar.gz"
python_script:
   modules/scripts/remap_classified_human.py
  • The remove_contamination lets user choose to identify and remove sample contamination with or without using kraken2. The parameter takes in either "yes" or "no" as inputs.
  • The kraken_db_path parameter is used to define a path of which the kraken2 database is stored.
  • The install_kraken_db can automatically install the selected kraken database by link defined in the following parameter kraken_lib if user don't wish to download the database by themselves. The parameter takes in either "yes" or "no" as inputs.
  • The kraken_lib takes a link of the chosen kraken database and combine with install_kraken_db set to "yes", the workflow will automatically download the database from the link and store it under /example_data/kraken_db. The default database in the config file is Standard-8GB. All possible databases can be found here.
  • The python_script contains the python code for parsing the contaminated genome information to create files for remapping.

Configuring workflow parameters

The main parameters of the programs called by the workflow can be configured in this section. In most cases you will not need to change the default values.

Available parameters include:

  • memory: The maximum amount of memory samtools sort is allowed to use per thread. Default: 2G.
  • threads: The number of threads allowed for parallelizing rules calling samtools sort, bwa mem, minia, and popins4snake merge-contigs. Default: 16.
  • kmerlength: The k-mer length for Velvet assembly. Default: 29.
  • k_merge: The k-mer length for merging contigs across samples using popins4snake merge-contigs. Default: 63.
  • readlen: The read length required by popins4snake place-refalign and popins4snake place-splitalign. Default: 150.
  • min-qual: For rules popins2_crop_unmapped, popins2_crop_remapped and popins2_sickle, minimum average quality value in windows for read quality trimming. Default: 30.
  • min-read-len: For rules popins2_crop_unmapped, popins2_crop_remapped and popins2_sickle, minimum read length to keep the read after quality trimming. Default: 60.

Workflow Execution

After the initial setup, you can execute the workflow from the main directory of popinSnake/ which contains the main Snakefile:

$ snakemake --use-conda --cores 1

The tag --use-conda is required for successful workflow execution unless you have all workflow dependencies (and their correct versions) specified in Conda environments (see workflow/envs/ folder) installed in a location available on your path.

Note: When you run the workflow for the first time using --use-conda, it will create the conda environments. This process may take some time. The next time you run the workflow, this process is not necessary again.

If you chose not to install mamba (see Prerequisites), then you now have to add --conda-frontend conda to the snakemake command.

Similarly, other Snakemake options are available. For example, you can specify --cores all to use all available cores instead of just a single core or specify any other number of cores.

The Snakemake command includes execution of all Jupyter notebooks we implemented for intermediate data analysis. The output of these analyses can be found in the RESULTS_DIR folder.

If from the config file, user chose to use the data analysis module with jupyter notebooks, the command for execute the workflow is as below:

$ snakemake --use-conda --cores 1 --edit-notebook [MY_PATH]/popinSnake/results/insertions_genotypes.vcf.gz

[MY_PATH] is as defined in the config file where the main popinSnake workflow is.

Example data

We provide an example_data folder containing examplary data generated for testing purposes. It also shows you how to prepare your data for the workflow. The snake_config.yaml file is preconfigured to run the workflow on the example data. You can simply execute Snakemake to run the workflow on this data (see also Workflow execution):

$ snakemake --use-conda --cores 1

Note: When you run the workflow for the first time using --use-conda, it will create the conda environments. This process may take some time. The next time you run the workflow, this process is not necessary again.

In this example, the FASTA file chr21_ins.fa is used as a reference genome. A subfolder for each sample is present and its name matches the BAM file name. You can find three BAM files (.bam) and the corresponding index files (.bam.bai) in the three sample folders S0001, S0002, S0003.

References

Cao K., Elfaramawy N., Weidlich M., Kehr B. (2022) From Program Chains to Exploratory Workflows: PopinSnake for Insertion Detection in Genomics. In: 2023 IEEE 19th International Conference on e-Science (e-Science).

Krannich T., White W. T. J., Niehus S., Holley G., Halldórsson B. V., Kehr B. (2022) Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics, 38(3):604–611.

Kehr B., Helgadóttir A., Melsted P., Jónsson H., Helgason H., Jónasdóttir Að., Jónasdóttir As., Sigurðsson Á., Gylfason A., Halldórsson G. H., Kristmundsdóttir S., Þorgeirsson G., Ólafsson Í., Holm H., Þorsteinsdóttir U., Sulem P., Helgason A., Guðbjartsson D. F., Halldórsson B. V., Stefánsson K. (2017). Diversity in non-repetitive human sequences not found in the reference genome. Nature Genetics, 49(4):588–593.

Kehr B., Melsted P., Halldórsson B. V. (2016). PopIns: population-scale detection of novel sequence insertions. Bioinformatics, 32(7):961-967.