diff --git a/README.md b/README.md index faf861787bb6aa64f3dcd2169552671274105787..47f3b05f58421ee5393c80c35303de10aeed22ed 100644 --- a/README.md +++ b/README.md @@ -1,161 +1,189 @@ -# PopIns2 +# popins4snake + +A modularized version of the program [PopIns2](https://github.com/kehrlab/PopIns2) for population-scale detection of non-reference sequence variants. + +__Note: The recommended way to run popins4snake is via the Snakemake workflow [*PopinSnake*](https://gitlab.informatik.hu-berlin.de/fonda_a6/popinSnake).__ -[](https://doi.org/10.5281/zenodo.4890793) -Population-scale detection of non-reference sequence variants using colored de Bruijn Graphs ## Contents + 1. [Requirements](#requirements) -2. [Installation](#installation) -3. [Usage](#usage) -4. [Example](#example) -5. [Snakemake](#snakemake) -6. [Help](#help) +1. [Installation](#installation) +1. [Usage](#usage) +1. [Help](#help) +1. [References](#references) + -## Requirements: +## Requirements + +Prior to the installation make sure your system meets all the requirements: | Requirement | Tested with | | --- | --- | -| 64 bits POSIX-compliant operating system | Ubuntu 16.04 / 18.04, CentOS Linux 7.6 | -| C++14 capable compiler | g++ vers. 4.9.2, 5.5.0, 7.2.0 | -| [Bifrost](https://github.com/pmelsted/bfgraph) | vers. 1.0.4-ab43065 | -| [bwa](https://github.com/lh3/bwa) | vers. 0.7.15-r1140 | -| [samtools](https://github.com/samtools/samtools) | vers. 1.3, 1.5 | -| [sickle](https://github.com/najoshi/sickle) | vers. 1.33 | -| [gatb-minia-pipeline](https://github.com/Krannich479/gatb-minia-pipeline) | (*submodule; no need to install*) | -| [SeqAn](https://www.seqan.de/) | (*header library; no need to install*) | +| 64 bits POSIX-compliant operating system | Ubuntu 20.04, CentOS Linux 7.6 | +| C++14 capable compiler | g++ vers. 4.9.2, 5.5.0, 7.2.0, 9.4.0 | +| CMake | >= 2.8.12 (available through Conda) | -Prior to the installation make sure your system meets all the requirements. For the default settings of PopIns2 a *Bifrost* installation with MAX_KMER_SIZE=64 is required. Presently, the conda package of Bifrost does not meet this requirement. If the executables of the software dependencies (bwa, samtools, sickle) are not accessible systemwide, you have to write the full paths to the executables into a configfile (see [Installation](#installation)). Submodules and header libraries come by default with the git clone, there is no need for a manual installation. For backward compatibility PopIns2 still offers to use the *Velvet assembler* (see [popins](https://github.com/bkehr/popins) for installation recommendation). +For the default settings of popins4snake a *Bifrost* installation with MAX_KMER_SIZE=64 is required (see below). *Bifrost* is included as a submodule in this repository and comes with a recursive clone. Presently, the conda package of Bifrost does not meet this requirement. -## Installation: +CMake is required for installing *Bifrost*. + +The [SeqAn](https://www.seqan.de/) header library is included in this repository and comes with the git clone. There is no need for a manual installation. -``` -git clone --recursive https://github.com/kehrlab/PopIns2.git -cd PopIns2 -mkdir build -make -``` -If the binaries of the software dependencies are not globally available on your system (e.g. by appending them to your `PATH`) you have to set the paths to the binaries within the *popins2.config* prior to executing `make`. After the compilation with `make` you should see the binary *popins2* in the main folder. The PopIns2 [Wiki](https://github.com/kehrlab/PopIns2/wiki) gathers known issues that might occur during installation or runtime. -## Usage: +## Installation -PopIns2 is a program consisting of several submodules. The submodules are designed to be executed one after another and fit together into a consecutive workflow. To display the help page of a submodule type `popins2 <command> --help` as shown in the [help section](#help). +First clone the repository with the `--recursive` flag: -#### The assemble command ``` -popins2 assemble [OPTIONS] sample.bam +git clone --recursive https://gitlab.informatik.hu-berlin.de/fonda_a6/popins4snake.git ``` -The assemble command identifies reads without high-quality alignment to the reference genome, filters reads with poor base quality and assembles them into a set of contigs. The reads, given as BAM file, must be indexed by _bwa index_. Optionally, reads can be remapped to an additional reference FASTA before the filtering and assembly such that only the remaining reads without a high-quality alignment are further processed (e.g. useful for decontamination). The additional reference FASTQ must be indexed by _bwa index_ too. -#### The merge command +Next, compile and install *Bifrost* with `MAX_KMER_SIZE=64`. You can either install it globally on your system or locally in your home directory. +We recommend installing it locally to your home directory in the folder `~/local` by using: + ``` -popins2 merge [OPTIONS] {-s|-r} DIR +mkdir -p ~/local +cd external/bifrost && mkdir build && cd build +cmake .. -DCMAKE_INSTALL_PREFIX=~/local -DMAX_KMER_SIZE=64 +make +make install ``` -\[Default\] The merge command builds a colored and compacted de Bruijn Graph (ccdbg) of all contigs of all samples in a given source directory _DIR_. -By default, the merge module finds all files of the pattern `<DIR>/*/assembly_final.contigs.fa`. To process the contigs of the [assemble command](#the-assemble-command) the __-r__ input parameter is recommended. Once the ccdbg is built, the merge module identifies paths in the graph and returns _supercontigs_. + +To install *Bifrost* globally, omit the `-DCMAKE_INSTALL_PREFIX=~/local` in the CMake command. + +If you follow our recommendation of installing *Bifrost* locally, make sure that the local directory is appended to the relevant system variables. +Setting these paths is necessary for compiling and running *popins4snake*. ``` -popins2 merge [OPTIONS] -y GFA -z BFG_COLORS +export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:~/local/include/ +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/local/lib/ +export LIBRARY_PATH=$LIBRARY_PATH:~/local/lib/ +export PATH=$PATH:~/local/lib/ ``` -An alternative way of providing input for the merge command is to directly pass a ccdbg. Here, the merge command expects a _GFA_ file and a _bfg_colors_ file, which is specific to the Bifrost. If you choose to run the merge command with a _pre_-built GFA graph, mind that you have to set the Algorithm options accordingly (in particular __-k__). -#### The contigmap command +To make the local install directory (and *Bifrost*) permanently available (for running *popins4snake*), we recommend to add these exports to your `.bashrc`. + +Now, you can compile *popins4snake*: + ``` -popins2 contigmap [OPTIONS] SAMPLE_ID +cd popins4snake +mkdir build +make ``` -The contigmap command maps all reads with low-quality alignments of a sample to the set of supercontigs using BWA-MEM. The mapping information is then merged with the reads' mates. -#### The place commands +After the compilation with `make` you should see the binary *popins4snake* in the cloned directory. + +The [PopIns2 Wiki](https://github.com/kehrlab/PopIns2/wiki/Troubleshooting---FAQ) gathers known issues that might occur during installation or runtime. + + + +## Usage + +*Popins4snake* is a program consisting of several functions. +The functions are designed to be chained into a workflow together with calls to standard bioinformatics programs (samtools, bwa, ...) and bash commands. +__The recommended way of running *popins4snake* is using the Snakemake workflow [PopinSnake](https://gitlab.informatik.hu-berlin.de/fonda_a6/popinSnake).__ +To display the help page of each of the *popins4snake* function, type `popins2 <command> --help` as shown in the [help section](#help). + + +### The `crop-unmapped` command ``` -popins2 place-refalign [OPTIONS] -popins2 place-splitalign [OPTIONS] SAMPLE_ID -popins2 place-finish [OPTIONS] +popins2 crop-unmapped [OPTIONS] sample.bam ``` -In brief, the place commands attempt to anker the supercontigs to the samples. At first, all potential anker locations from all samples are collected. Then prefixes/suffixes of the supercontigs are aligned to all collected locations. For successful alignments records are written to a VCF file. In the second step, all remaining locations are split-aligned per sample. Finally, all locations from all successful split-alignments are combined and added to the VCF file. +The crop-unmapped command identifies reads without high-quality alignment to the reference genome. The reads given in the input BAM file must be indexed, i.e. the file `sample.bam.bai` is expected to exist. + -#### The genotype command +### The `merge-bams` command ``` -popins2 genotype [OPTIONS] SAMPLE_ID +popins2 merge-bams [OPTIONS] input1.bam input2.bam ``` -The genotype command generates alleles (ALT) of the supercontigs with some flanking reference genome sequence. Then, the reads of a sample are aligned to ALT and the reference genome around the breakpoint (REF). The ratio of alignments to ALT and REF determines a genotype quality and a final genotype prediction per variant per sample. -## Example: -Test data for a minimum working example can be found at [zenodo](https://doi.org/10.5281/zenodo.4890793). A simple project structure for PopIns2 looks like +### The `merge-contigs` command +``` +popins2 merge-contigs [OPTIONS] {-s|-r} /path/to/sample_directories/ +``` +\[Default\] The merge command builds a colored and compacted de Bruijn Graph (ccdbg) of all contigs of all samples in a given source directory _DIR_. +By default, the merge module finds all files of the pattern `<DIR>/*/assembly_final.contigs.fa`. To process the contigs of the [assemble command](#the-assemble-command) the __-r__ input parameter is recommended. Once the ccdbg is built, the merge module identifies paths in the graph and returns _supercontigs_. ``` -$ tree /path/to/your/project/ -/path/to/your/project/ -├── myFirstSample -│ ├── first_sample.bam -│ └── first_sample.bam.bai -├── mySecondSample -│ ├── second_sample.bam -│ └── second_sample.bam.bai -└── myThirdSample - ├── third_sample.bam - └── third_sample.bam.bai +popins2 merge [OPTIONS] -y input.gfa -z input.bfg_colors ``` +An alternative way of providing input for the merge command is to directly pass a ccdbg. Here, the merge command expects a _GFA_ file and a _bfg_colors_ file, which is specific to the Bifrost. If you choose to run the merge command with a _pre_-built GFA graph, mind that you have to set the Algorithm options accordingly (in particular __-k__). -and a simple workflow could look like +### The `find-locations` command +``` +popins2 find-locations [OPTIONS] SAMPLE_ID ``` -cd /path/to/your/project -ln -s /path/to/reference_genome.fa genome.fa -ln -s /path/to/reference_genome.fa.fai genome.fa.fai -popins2 assemble --sample sample1 /path/to/your/project/myFirstSample/first_sample.bam -popins2 assemble --sample sample2 /path/to/your/project/mySecondSample/second_sample.bam -popins2 assemble --sample sample3 /path/to/your/project/myThirdSample/third_sample.bam -popins2 merge -r /path/to/your/project -di +### The `merge-locations` command +``` +popins2 merge-locations [OPTIONS] +``` -popins2 contigmap sample1 -popins2 contigmap sample2 -popins2 contigmap sample3 -popins2 place-refalign -popins2 place-splitalign sample1 -popins2 place-splitalign sample2 -popins2 place-splitalign sample3 -popins2 place-finish +### The `place` commands +``` +popins2 place-refalign [OPTIONS] +popins2 place-splitalign [OPTIONS] SAMPLE_ID +popins2 place-finish [OPTIONS] +``` +In brief, the place commands attempt to anker the supercontigs to the samples. At first, all potential anker locations from all samples are collected. Then prefixes/suffixes of the supercontigs are aligned to all collected locations. For successful alignments records are written to a VCF file. In the second step, all remaining locations are split-aligned per sample. Finally, all locations from all successful split-alignments are combined and added to the VCF file. -popins2 genotype sample1 -popins2 genotype sample2 -popins2 genotype sample3 +### The `genotype` command +``` +popins2 genotype [OPTIONS] SAMPLE_ID ``` +The genotype command generates alleles (ALT) of the supercontigs with some flanking reference genome sequence. Then, the reads of a sample are aligned to ALT and the reference genome around the breakpoint (REF). The ratio of alignments to ALT and REF determines a genotype quality and a final genotype prediction per variant per sample. -## Snakemake -The workflow of PopIns2 can be effectively distributed among a HPC cluster environment. This [Github project](https://github.com/Krannich479/PopIns2_snakeproject) provides a template of a full PopIns2 workflow as individual cluster jobs using [Snakemake](https://snakemake.readthedocs.io/en/stable/), a Python-based workflow management tool. -## Help: +## Help ``` -$ popins2 -h +$ popins4snake -h -Population-scale detection of non-reference sequence insertions using colored de Bruijn Graphs -================================================================ +===================================================================== +A modularized version of the program PopIns2 + for population-scale detection of non-reference sequence variants +===================================================================== SYNOPSIS - popins2 COMMAND [OPTIONS] + ../popins4snake/popins4snake COMMAND [OPTIONS] COMMAND - assemble Filter, clip and assemble unmapped reads from a sample. - merge Generate supercontigs from a colored compacted de Bruijn Graph. - multik Multi-k framework for a colored compacted de Bruijn Graph. - contigmap Map unmapped reads to (super-)contigs. - place-refalign Find position of (super-)contigs by aligning contig ends to the reference genome. - place-splitalign Find position of (super-)contigs by split-read alignment (per sample). - place-finish Combine position found by split-read alignment from all samples. + crop-unmapped Extract unmapped and poorly aligned reads from a BAM file. + merge-bams Merge two name-sorted BAM files of the same sample and set mate information of now paired reads. + merge-contigs Merge sets of contigs into supercontigs using a colored compacted de Bruijn Graph. + find-locations Find insertion locations of (super-)contigs per sample. + merge-locations Merge insertion locations from all samples into one file. + place-refalign Find positions of (super-)contigs by aligning contig ends to the reference genome. + place-splitalign Find positions of (super-)contigs by split-read alignment (per sample). + place-finish Combine (super-)contig positions found by split-read alignment from all samples. genotype Determine genotypes of all insertions in a sample. VERSION - 0.12.0-a935f00, Date: on 2020-10-21 12:50:29 - -Try `popins2 COMMAND --help' for more information on each command. + 0.1.0-a52d4f5, Date: 2022-08-25 14:42:31 +Try `../popins4snake/popins4snake COMMAND --help' for more information on each command. ``` + + + +## References + +Krannich T., White W. T. J., Niehus S., Holley G., Halldórsson B. V., Kehr B. (2022) +Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. +[Bioinformatics, 38(3):604–611](https://academic.oup.com/bioinformatics/article/38/3/604/6415820). + +Kehr B., Helgadóttir A., Melsted P., Jónsson H., Helgason H., Jónasdóttir Að., Jónasdóttir As., Sigurðsson Ã., Gylfason A., Halldórsson G. H., Kristmundsdóttir S., Þorgeirsson G., Ólafsson Ã., Holm H., Þorsteinsdóttir U., Sulem P., Helgason A., Guðbjartsson D. F., Halldórsson B. V., Stefánsson K. (2017). +Diversity in non-repetitive human sequences not found in the reference genome. +[Nature Genetics,](http://rdcu.be/pDbJ) [doi:10.1038/ng.3801](http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3801.html). -For more troubleshooting, FAQs and tips about the usage of PopIns2 please have a look into the PopIns2 [Wiki](https://github.com/kehrlab/PopIns2/wiki). +Kehr B., Melsted P., Halldórsson B. V. (2016). +PopIns: population-scale detection of novel sequence insertions. +[Bioinformatics, 32(7):961-967](https://academic.oup.com/bioinformatics/article/32/7/961/2240308). \ No newline at end of file