# popins4snake

A modularized version of the program [PopIns2](https://github.com/kehrlab/PopIns2) for population-scale detection of non-reference sequence variants.


*Popins4snake* is a program consisting of several functions.
The functions are designed to be chained into a workflow, together with calls to standard bioinformatics programs (samtools, bwa, ...) and bash commands.

__The recommended way of running *popins4snake* is using the Snakemake workflow [PopinSnake](https://gitlab.informatik.hu-berlin.de/fonda_a6/popinSnake).__

You can find installation instructions for all dependencies of the PopinSnake workflow, including instructions for installing popins4snake in the [PopinSnake README file](https://gitlab.informatik.hu-berlin.de/fonda_a6/popinSnake/-/blob/main/README.md).



## Contents

1. [Requirements](#requirements)
1. [Installation](#installation)
1. [Usage](#usage)
1. [Summary of popins4snake functions](#summary-of-popins4snake-functions)
1. [Help](#help)
1. [References](#references)


## Requirements

Prior to the installation make sure your system meets all the requirements:

| Requirement | Tested with |
| --- | --- |
| 64 bits POSIX-compliant operating system | Ubuntu 20.04, CentOS Linux 7.6 |
| C++14 capable compiler | g++ vers. 4.9.2, 5.5.0, 7.2.0, 9.4.0 |
| CMake | >= 2.8.12 (available through Conda) |

For the default settings of popins4snake a *Bifrost* installation with MAX_KMER_SIZE=64 is required (see below). *Bifrost* is included as a submodule in this repository and comes with a recursive clone. Presently, the conda package of Bifrost does not meet this requirement.

CMake is required for installing *Bifrost*.

The [SeqAn](https://www.seqan.de/) header library is included in this repository and comes with the git clone. There is no need for a manual installation.



## Installation

First clone the repository with the `--recursive` flag:

```
git clone --recursive https://gitlab.informatik.hu-berlin.de/fonda_a6/popins4snake.git
```

Next, compile and install *Bifrost* with `MAX_KMER_SIZE=64`. You can either install it globally on your system or locally in your home directory.
We recommend installing it locally to your home directory in the folder `~/local` by using:

```
mkdir -p ~/local
cd external/bifrost && mkdir build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=~/local -DMAX_KMER_SIZE=64
make
make install
```

To install *Bifrost* globally, omit the `-DCMAKE_INSTALL_PREFIX=~/local` in the CMake command.

If you follow our recommendation of installing *Bifrost* locally, make sure that the local directory is appended to the relevant system variables.
Setting these paths is necessary for compiling and running *popins4snake*.

```
export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:~/local/include/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:~/local/lib/
export LIBRARY_PATH=$LIBRARY_PATH:~/local/lib/
export PATH=$PATH:~/local/lib/
```

To make the local install directory (and *Bifrost*) permanently available (for running *popins4snake*), we recommend to add these exports to your `.bashrc`.

Now, you can compile *popins4snake*:

```
cd popins4snake
mkdir build
make
```

After the compilation with `make` you should see the binary *popins4snake* in the cloned directory.

The [PopIns2 Wiki](https://github.com/kehrlab/PopIns2/wiki/Troubleshooting---FAQ) gathers known issues that might occur during installation or runtime.



## Usage

To get an overview of the functions offered in *popins4snake*, you can run `./popins4snake -h` after installation:

```
=====================================================================
A modularized version of the program PopIns2
    for population-scale detection of non-reference sequence variants
=====================================================================

SYNOPSIS
    ./popins4snake COMMAND [OPTIONS]

COMMAND
    crop-unmapped       Extract unmapped and poorly aligned reads from a BAM file.
    merge-bams          Merge two name-sorted BAM files of the same sample and set mate information of now paired reads.
    merge-contigs       Merge sets of contigs into supercontigs using a colored compacted de Bruijn Graph.
    find-locations      Find insertion locations of (super-)contigs per sample.
    merge-locations     Merge insertion locations from all samples into one file.
    place-refalign      Find positions of (super-)contigs by aligning contig ends to the reference genome.
    place-splitalign    Find positions of (super-)contigs by split-read alignment (per sample).
    place-finish        Combine (super-)contig positions found by split-read alignment from all samples.
    genotype            Determine genotypes of all insertions in a sample.

VERSION
    0.1.0-a52d4f5, Date: 2022-08-25 14:42:31

Try `../popins4snake/popins4snake COMMAND --help' for more information on each command.
```



## Summary of *popins4snake* functions

To display the help page of each of the *popins4snake* functions, type `./popins4snake <command> --help`.


### The `crop-unmapped` function
```
popins4snake crop-unmapped [OPTIONS] sample.bam
```
The crop-unmapped command identifies reads without high-quality alignment to the reference genome. The reads given in the input BAM file must be indexed, i.e. the file `sample.bam.bai` is expected to exist.


### The `merge-bams` function
```
popins4snake merge-bams [OPTIONS] input1.bam input2.bam
```


### The `merge-contigs` function
```
popins4snake merge-contigs [OPTIONS] {-s|-r} /path/to/sample_directories/
```
\[Default\] The merge command builds a colored and compacted de Bruijn Graph (ccdbg) of all contigs of all samples in a given source directory _DIR_.
By default, the merge module finds all files of the pattern `<DIR>/*/assembly_final.contigs.fa`. To process the contigs of the [assemble command](#the-assemble-command) the __-r__ input parameter is recommended. Once the ccdbg is built, the merge module identifies paths in the graph and returns _supercontigs_.

```
popins4snake merge-contigs [OPTIONS] -y input.gfa -z input.bfg_colors
```
An alternative way of providing input for the merge command is to directly pass a ccdbg. Here, the merge command expects a _GFA_ file and a _bfg_colors_ file, which is specific to the Bifrost. If you choose to run the merge command with a _pre_-built GFA graph, mind that you have to set the Algorithm options accordingly (in particular __-k__).


### The `find-locations` function
```
popins4snake find-locations [OPTIONS] SAMPLE_ID
```


### The `merge-locations` function
```
popins4snake merge-locations [OPTIONS]
```


### The `place` function
```
popins4snake place-refalign [OPTIONS]
popins4snake place-splitalign [OPTIONS] SAMPLE_ID
popins4snake place-finish [OPTIONS]
```
In brief, the place commands attempt to anker the supercontigs to the samples. At first, all potential anker locations from all samples are collected. Then prefixes/suffixes of the supercontigs are aligned to all collected locations. For successful alignments records are written to a VCF file. In the second step, all remaining locations are split-aligned per sample. Finally, all locations from all successful split-alignments are combined and added to the VCF file.

### The `genotype` function
```
popins4snake genotype [OPTIONS] SAMPLE_ID
```
The genotype command generates alleles (ALT) of the supercontigs with some flanking reference genome sequence. Then, the reads of a sample are aligned to ALT and the reference genome around the breakpoint (REF). The ratio of alignments to ALT and REF determines a genotype quality and a final genotype prediction per variant per sample.
    


## References

Krannich T., White W. T. J., Niehus S., Holley G., Halldórsson B. V., Kehr B. (2022)
Population-scale detection of non-reference sequence variants using colored de Bruijn graphs.
[Bioinformatics, 38(3):604–611](https://academic.oup.com/bioinformatics/article/38/3/604/6415820).

Kehr B., Helgadóttir A., Melsted P., Jónsson H., Helgason H., Jónasdóttir Að., Jónasdóttir As.,	Sigurðsson Á., Gylfason A., Halldórsson G. H., Kristmundsdóttir S., Þorgeirsson G., Ólafsson Í., Holm H., Þorsteinsdóttir U., Sulem P., Helgason A., Guðbjartsson D. F., Halldórsson B. V., Stefánsson K. (2017).
Diversity in non-repetitive human sequences not found in the reference genome.
[Nature Genetics,](http://rdcu.be/pDbJ) [doi:10.1038/ng.3801](http://www.nature.com/ng/journal/vaop/ncurrent/full/ng.3801.html).

Kehr B., Melsted P., Halldórsson B. V. (2016).
PopIns: population-scale detection of novel sequence insertions.
[Bioinformatics, 32(7):961-967](https://academic.oup.com/bioinformatics/article/32/7/961/2240308).