1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148
|
# Racon
[](https://github.com/lbcb-sci/racon/releases/latest)
[](https://travis-ci.com/lbcb-sci/racon)
[](https://doi.org/10.1101/gr.214270.116)
Consensus module for raw de novo DNA assembly of long uncorrected reads.
## Description
Racon is intended as a standalone consensus module to correct raw contigs generated by rapid assembly methods which do not include a consensus step. The goal of Racon is to generate genomic consensus which is of similar or better quality compared to the output generated by assembly methods which employ both error correction and consensus steps, while providing a speedup of several times compared to those methods. It supports data produced by both Pacific Biosciences and Oxford Nanopore Technologies.
Racon can be used as a polishing tool after the assembly with either **short accurate data** or **data produced by third generation of sequencing**. The type of data inputted is automatically detected. Although, Racon expects single-end short reads, while paired-end reads should be renamed with unique names up to the first whitespace and joined into a single file before mapping (which can be done with misc/racon_preprocess.py).
Racon takes as input only three files: contigs in FASTA/FASTQ format, reads in FASTA/FASTQ format and overlaps/alignments between the reads and the contigs in MHAP/PAF/SAM format. Output is a set of polished contigs in FASTA format printed to stdout. All input files **can be compressed with gzip** (which will have impact on parsing time).
Racon can also be used as a read error-correction tool. In this scenario, the MHAP/PAF/SAM file needs to contain pairwise overlaps between reads **including dual overlaps**.
A **wrapper script** is also available to enable easier usage to the end-user for large datasets. It has the same interface as racon but adds two additional features from the outside. Sequences can be **subsampled** to decrease the total execution time (accuracy might be lower) while target sequences can be **split** into smaller chunks and run sequentially to decrease memory consumption. Both features can be run at the same time as well.
## Dependencies
1. gcc 4.8+ or clang 3.4+
2. cmake 3.2+
3. zlib
### CUDA Support
1. gcc 5.0+
2. cmake 3.10+
3. CUDA 9.0+
## Installation
To install Racon run the following commands:
```bash
git clone https://github.com/lbcb-sci/racon && cd racon && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make
```
After successful installation, an executable named `racon` will appear in `build/bin` (alongside unit tests `racon_test`).
Optionally, you can run `sudo make install` to install racon executable to your machine.
To build the wrapper script add `-Dracon_build_wrapper=ON` while running `cmake`. After installation, an executable named `racon_wrapper` (python script) will be created in `build/bin`.
### CUDA Support
Racon makes use of [NVIDIA's GenomeWorks SDK](https://github.com/clara-parabricks/GenomeWorks) for CUDA accelerated polishing and alignment.
To build `racon` with CUDA support, add `-Dracon_enable_cuda=ON` while running `cmake`. If CUDA support is unavailable, the `cmake` step will error out.
Note that the CUDA support flag does not produce a new binary target. Instead it augments the existing `racon` binary itself.
```bash
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Dracon_enable_cuda=ON ..
make
```
***Note***: Short read polishing with CUDA is still in development!
### Packaging
To generate a Debian package for `racon`, run the following command from the build folder -
```bash
make package
```
## Usage
Usage of `racon` is as following:
racon [options ...] <sequences> <overlaps> <target sequences>
# default output is stdout
<sequences>
input file in FASTA/FASTQ format (can be compressed with gzip)
containing sequences used for correction
<overlaps>
input file in MHAP/PAF/SAM format (can be compressed with gzip)
containing overlaps between sequences and target sequences
<target sequences>
input file in FASTA/FASTQ format (can be compressed with gzip)
containing sequences which will be corrected
options:
-u, --include-unpolished
output unpolished target sequences
-f, --fragment-correction
perform fragment correction instead of contig polishing (overlaps
file should contain dual/self overlaps!)
-w, --window-length <int>
default: 500
size of window on which POA is performed
-q, --quality-threshold <float>
default: 10.0
threshold for average base quality of windows used in POA
-e, --error-threshold <float>
default: 0.3
maximum allowed error rate used for filtering overlaps
--no-trimming
disables consensus trimming at window ends
-m, --match <int>
default: 3
score for matching bases
-x, --mismatch <int>
default: -5
score for mismatching bases
-g, --gap <int>
default: -4
gap penalty (must be negative)
-t, --threads <int>
default: 1
number of threads
--version
prints the version number
-h, --help
prints the usage
only available when built with CUDA:
-c, --cudapoa-batches <int>
default: 0
number of batches for CUDA accelerated polishing per GPU
-b, --cuda-banded-alignment
use banding approximation for polishing on GPU. Only applicable when -c is used.
--cudaaligner-batches <int>
default: 0
number of batches for CUDA accelerated alignment per GPU
--cudaaligner-band-width <int>
default: 0
Band width for cuda alignment. Must be >= 0. Non-zero allows user defined
band width, whereas 0 implies auto band width determination.
`racon_test` is run without any parameters.
Usage of `racon_wrapper` equals the one of `racon` with two additional parameters:
...
options:
--split <int>
split target sequences into chunks of desired size in bytes
--subsample <int> <int>
subsample sequences to desired coverage (2nd argument) given the
reference length (1st argument)
...
## Contact information
For additional information, help and bug reports please send an email to one of the following: ivan.sovic@irb.hr, robert.vaser@fer.hr, mile.sikic@fer.hr, nagarajann@gis.a-star.edu.sg
## Acknowledgment
This work has been supported in part by Croatian Science Foundation under the project UIP-11-2013-7353. IS is supported in part by the Croatian Academy of Sciences and Arts under the project "Methods for alignment and assembly of DNA sequences using nanopore sequencing data". NN is supported by funding from A*STAR, Singapore.
|