[go: up one dir, main page]

File: README

package info (click to toggle)
seqan 1.2-1
  • links: PTS, VCS
  • area: main
  • in suites: squeeze
  • size: 21,368 kB
  • ctags: 17,376
  • sloc: cpp: 81,339; ansic: 58,225; makefile: 96; sh: 14
file content (498 lines) | stat: -rw-r--r-- 20,284 bytes parent folder | download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
*** RazerS - Fast Read Mapping with Sensitivity Control ***
http://www.seqan.de/projects/razers.html

---------------------------------------------------------------------------
Table of Contents
---------------------------------------------------------------------------
  1.   Overview
  2.   Installation
  3.   Usage
  4.   Output Format
  5.   Example
  6.   Contact

---------------------------------------------------------------------------
1. Overview
---------------------------------------------------------------------------

RazerS is a tool for mapping millions of short genomic reads onto a 
reference genome. It was designed with focus on mapping next-generation 
sequencing reads onto whole DNA genomes. RazerS searches for matches of
reads with a percent identity above a given threshold, whereby it detects
matches with mismatches as well as gaps.
RazerS uses a k-mer index of all reads and counts common k-mers of reads
and the reference genome in parallelograms. Each parallelogram with a k-mer
count above a certain threshold triggers a verification. On success, the 
genomic subsequence and the read number are stored and written to the 
output file.

---------------------------------------------------------------------------
2. Installation
---------------------------------------------------------------------------

RazerS is distributed with SeqAn - The C++ Sequence Analysis Library (see 
http://www.seqan.de). To build RazerS do the following:

  1)  Download the latest snapshot of SeqAn
  2)  Unzip it to a directory of your choice (e.g. snapshot)
  3)  cd snapshot/apps
  4)  make razers
  5)  cd razers
  6)  ./razers --help

Alternatively you can check out the latest SVN version of RazerS and SeqAn
with:

  1)  svn co http://svn.mi.fu-berlin.de/seqan/trunk/seqan
  2)  cd seqan
  3)  make forwards
  4)  cd projects/library/apps
  5)  make razers
  6)  cd razers
  7)  ./razers --help

On success, an executable file razers was build and a brief usage 
description was dumped.

---------------------------------------------------------------------------
3. Usage
---------------------------------------------------------------------------

To get a short usage description of RazerS, you can execute razers -h or 
razers --help.

Usage: razers [OPTION]... <GENOME FILE> <READS FILE>
       razers [OPTION]... <GENOME FILE> <MP-READS FILE1> <MP-READS FILE2>

RazerS expects the names of two or three DNA (multi-)Fasta files. The first
contains a reference genome and the second (and third) contains genomic reads
that should be mapped onto the reference. If two read files are given, both 
have to contain exactly the same number of reads, which are considered as
mate-pairs. Without any additional parameters RazerS would map all reads onto
both strands of the reference genome with 92% identity (i.e. 8% errors per
read) and dump all found matches in an output file. The output file name is
the read file name extended by the suffix ".result". The default behaviour 
can be modified by adding the following options to the command line:

---------------------------------------------------------------------------
3.1. Main Options
---------------------------------------------------------------------------

  [ -f ],  [ --forward ]

  Only map reads onto the positive/forward strand of the genome. By
  default, both strands are scanned.

  [ -r ],  [ --reverse ]

  Only map reads onto the negative/reverse-complement strand of the
  genome. By default, both strands are scanned.

  [ -i NUM ],  [ --percent-identity NUM ]

  Set the percent identity threshold. NUM must be a value between 50 and
  100 (default is 92). RazerS searches for matches with a percent identity
  of at least NUM. A match of a read R with e errors has percent identity
  of 100*(1 - e/|R|), whereby |R| is the read length. In other words, a
  read is allowed to have not more than |R|*(100-NUM)/100 errors.

  [ -rr NUM ],  [ --recognition-rate NUM ]
  
  Set the percent recognition rate. NUM must be a value between 80 and 100
  (default is 99). The recognition rate controls the sensitivity of RazerS.
  The higher the recognition rate the more sensitive is RazerS. The lower
  the recognition rate the faster runs RazerS. A value of 100 corresponds
  to a lossless read mapping. The recognition rate corresponds to the
  expected fraction of matches RazerS will find compared to a lossless
  mapping. Depending on the desired recogition rate, the percent identity
  and the read length the filter is configured to run as fast as possible.
  Therefore it needs access to files with precomputed filtration settings
  in a 'gapped_params' subfolder which resides in the razers folder. This 
  value is ignored if the shape (-s) or the minimum threshold (-t) is set
  manually.
  
  [ -id ],  [ --indels ]
  
  Consider insertions, deletions and mismatches as errors. By default, only 
  mismatches are recognized.
  
  [ -ll NUM ],  [ --library-length NUM ]
  
  Set the mean library size, default is 220. The library size is the outer
  distance of the two reads of a mate-pair. This value is used only for
  paired-end read mapping.
  
  [ -le NUM ],  [ --library-error NUM ]
  
  Set the tolerated absolute deviation of the library size, default value is
  50. This value is used only for paired-end read mapping.
  
  [ -m NUM ],  [ --max-hits NUM ]
  
  Output at most NUM of the best matches, default value is 100.
  
  [ --unique ]
  
  Output only unique best matches (like ELAND). This flag corresponds to 
  '-m 1 -dr 0 -pa'.
  
  [ -tr NUM ],  [ --trim-reads NUM ]
  
  Trim reads to length NUM.

  [ -o FILE ],  [ --output FILE ]

  Change the output filename to FILE. By default, this is the read file
  name extended by the suffix ".result".

  [ -v ],  [ --verbose ]
  
  Verbose. Print extra information and running times.

  [ -vv ],  [ --vverbose ]

  Very verbose. Like -v, but also print filtering statistics like true and
  false positives (TP/FP).

  [ -V ],  [ --version ]
  
  Print version information.

  [ -h ],  [ --help ]

  Print a brief usage summary.

---------------------------------------------------------------------------
3.2. Output Format Options
---------------------------------------------------------------------------

  [ -a ],  [ --alignment ]

  Dump the alignment for each match in the ".result" file. The alignment is
  written directly after the match and has the following format:
  #Read:   CAGGAGATAAGCTGGATCGTTTACGGT
  #Genome: CAGGAGATAAGC-GGATCTTTTACG--

  [ -pa ],  [ --purge-ambiguous ]
  
  Omit reads with more than #max-hits many matches.
  
  [ -dr NUM ], [ --distance-range NUM ]
  
  If the best match of a read has E errors, only consider hits with 
  E <= X <= E+NUM errors as matches.

  [ -of NUM ], [ --output-format NUM ]
  
  Select the output format the matches should be stored in. See section 4.

  [ -gn NUM ],  [ --genome-naming NUM ]

  Select how genomes are named in the output file. If NUM is 0, the Fasta
  ids of the genome sequences are used (default). If NUM is 1, the genome
  sequences are enumerated beginning with 1.

  [ -rn NUM ],  [ --read-naming NUM ]
  
  Select how reads are named in the output file. If NUM is 0, the Fasta ids 
  of the reads are used (default). If NUM is 1, the reads are enumerated
  beginning with 1. If NUM is 2, the read sequence itself is used.

  [ -so NUM ],  [ --sort-order NUM ]
  
  Select how matches are ordered in the output file. 
  If NUM is 0, matches are sorted primarily by the read number and 
  secondarily by their position in the genome sequence (default).
  If NUM is 1, matches are sorted primarily by their position in the genome
  sequence and secondarily by the read number.

  [ -pf NUM ],  [ --position-format NUM ]
  
  Select how positions are stored in the output file. 
  If NUM is 0, the gap space is used, i.e. gaps around characters are 
  enumerated beginning with 0 and the beginning and end position is the 
  postion of the gap before and after a match (default).
  If NUM is 1, the position space is used, i.e. characters are enumerated 
  beginning with 1 and the beginning and end position is the postion of the 
  first and last character involved in a match.
  
  Example: Consider the string CONCAT. The beginning and end positions 
  of the substring CAT are (3,6) in gap space and (4,6) in position space.

---------------------------------------------------------------------------
3.3. Filtration Options
---------------------------------------------------------------------------

  [ -s BITSTRING ],  [ --shape BITSTRING ]
  
  Define the k-mer shape. BITSTRING must be a sequence of bits beginning
  and ending with 1, e.g. 1111111001101. A '1' defines a relevant and a
  '0' an irrelevant position. Two k-mers are equal, if all characters at 
  relevant postitions are equal.  
  
  [ -t NUM ],  [ --threshold NUM ]

  Depending on the percent identity and the length, for each read a
  threshold of common k-mers between read and reference genome is
  calculated. These thresholds determine the filtratition strictness and are 
  crucial to the overall running time. With this option the threshold values 
  can manually be raised to a minimum value to reduce the running time at 
  cost of the mapping sensitivity. All threshold values smaller than NUM 
  are raised to NUM. The default value is 1.
  
  [ -oc NUM ],  [ --overabundance-cut NUM ]
  
  Remove overabundant read k-mers from the k-mer index. k-mers with a 
  relative abundance above NUM are removed. NUM must be a value between 
  0 (remove all) and 1 (remove nothing, default).

  [ -rl NUM ],  [ --repeat-length NUM ]

  The repeat length is the minimal length a simple-repeat in the 
  genome sequence must have to be filtered out by the repeat masker of 
  RazerS. Simple repeats are tandem repeats of only one repeated   
  character, e.g. AAAAA. Independently of this parameter, N characters in 
  the genome are filtered out automatically. Default value is 1000.

  [ -tl NUM ],  [ --taboo-length NUM ]

  The taboo length is the minimal distance two k-mer must have in the
  reference genome when counting common k-mers between reads and reference
  genome (default is 1).
  
---------------------------------------------------------------------------
3.4. Verification Options
---------------------------------------------------------------------------

  [ -mN ],  [ --match-N ]
  
  By default, 'N' characters in read or genome sequences equal to nothing, 
  not even to another 'N'. They are considered as errors. By activating this 
  option, 'N' equals to every other character and produces no mismatch in 
  the verification process. The filtration is not affected by this option.

  [ -ed FILE ],  [ --error-distr FILE ]
  
  Produce an error distribution file containing the relative frequencies of
  mismatches for each read position. If the "--indels" option is given, the
  relative frequencies of insertions and deletions are also recorded.


---------------------------------------------------------------------------
4. Output Formats
---------------------------------------------------------------------------

RazerS supports currently 4 different output formats selectable via the
"--output-format NUM" option. The following values for NUM are possible:

  0 = Razer Format
  1 = Enhanced Fasta Format
  2 = Eland Format
  3 = General Feature Format (GFF)
 
---------------------------------------------------------------------------
4.1. Razer Format
---------------------------------------------------------------------------

The output file is a text file whose lines represent matches. A line 
consists of different tab separated match values in the following format:

RNAME RBEGIN REND GSTRAND GNAME GBEGIN GEND PERCID [PAIRID PAIRSCR MATEDIST]

Match value description:

  RNAME        Name of the read sequence (see --read-naming)
  RBEGIN       Beginning position in the read sequence (0/1 see -pf option)
  REND         End position in the read sequence (length of the read)
  GSTRAND      'F'=forward strand or 'R'=reverse strand
  GNAME        Name of the genome sequence (see --genome-naming)
  GBEGIN       Beginning position in the genome sequence
  GEND         End position in the genome sequence
  PERCID       Percent identity (see --percent-identity)

For paired-end read mapping 3 additional values are dumped:

  PAIRID       Unique number to identify the two corresponding mate matches
  PAIRSCR      The sum of the negative number of errors in both matches
  MATEDIST     Relative outer distance to the mate match
               The absolute value is the insert size

For matches on the reverse strand, GBEGIN and GEND are positions on the 
related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.

---------------------------------------------------------------------------
4.2. Enhanced Fasta Format
---------------------------------------------------------------------------

The matches are stored in the same order as in the Razer format. Each match
is stored in two lines:

>GBEGIN,GEND[KEY1=VALUE1,KEY2=VALUE2,...]
READSEQ

Match value description:

  GBEGIN       Beginning position in the genome sequence
  GEND         End position in the genome sequence
  READSEQ      Read sequence.

The following keys are output:

  id           ID value of the input file Fasta header (>..[id=ID,..]..)
  fragId       Fragment ID value (>..[..,fragId=FRAGID,..]..)
  contigId     Name of the genome sequence (see --genome-naming)
  errors       Absolute numbers of errors in this match
  percId       Percent identity (see --percent-identity)
  ambiguity    Number of matches of this read as good as or better than this

If the ID or fragment ID values of a read couldn't be found in the reads file
the read number (beginning with 0) is used instead. For matches on the
reverse strand, GBegin and GEnd are positions on the related forward strand
and GBEGIN > GEND.

---------------------------------------------------------------------------
4.3. Eland Format
---------------------------------------------------------------------------

Each line of the output file corresponds to a read appearing in the same 
order as they are stored in the reads file. A line consists of the following
tab separated values:

>RNAME READSEQ TYPE N0 N1 N2 GNAME GBEGIN* GSTRAND '..' SUBST1 SUBST2 ...

Additional value description:

  TYPE         NM = No match found
               QC = No matching done (too many Ns in read sequence)
               Ux = Best match found was unique with x errors
               Rx = Multiple best matches found having x errors
  N0 N1 N2     Number of exact, 1-error, and 2-error matches
  GBEGIN*      Minimum of GBEGIN and GEND
  SUBSTx       Position and type of the x'th mismatch (not for --indels)
               (e.g. 12A: 12'th base was A in the genome)

---------------------------------------------------------------------------
4.4. General Feature Format
---------------------------------------------------------------------------

The General Feature Format is specified by the Sanger Institute as a tab-
delimited text format with the following columns:

<seqname> <src> <feat> <start> <end> <score> <strand> <frame> [attr] [cmts]

See also: http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml
Consistent with this specification razers GFF output looks as follows:

GNAME razers read GBEGIN GEND PERCID GSTRAND . ATTRIBUTES

Match value description:

  GNAME        Name of the genome sequence (see --genome-naming)
  razers       Constant
  read         Constant
  GBEGIN       Beginning position in the genome sequence 
               (positions are counted from 1)
  GEND         End position in the genome sequence (included!)
  PERCID       Percent identity (see --percent-identity)
  GSTRAND      '+'=forward strand or '-'=reverse strand
  .            Constant
  ATTRIBUTES   A list of attributes in the format <tag_name>[=<tag>] 
               separated by ';'

Attributes are: 

  ID=          Name of the read
  quality=     Ascii coded quality values of the read
  cigar=       Read-reference alignment description in cigar format*
  mutations=   Positions and bases that differ from the reference
               with respect to the read (counting from 1)
  unique       This is the best read match and it is unique
  multi        This is one of multiple best machtes 
  suboptimal   This is a suboptimal read match

The original read sequence can be retrieved using the genomic subsequence 
and the information contained in the 'cigar' and 'mutations' tags. 

For matches on the reverse strand, GBEGIN and GEND are positions on the 
related forward strand. It holds GBEGIN < GEND, regardless of GSTRAND.

*http://may2005.archive.ensembl.org/Docs/wiki/html/EnsemblDocs/CigarFormat.html

---------------------------------------------------------------------------
5. Example
---------------------------------------------------------------------------

---------------------------------------------------------------------------
5.1. Single read mapping
---------------------------------------------------------------------------

There are example read and genome files in the folder in snapshot/apps/
razers/example containing 3 27bp reads and two short contigs. The 3
reads and their reverse-complements were implanted with errors into the 
genome. To map the example reads onto the genome do the following:

  1)  cd snapshot/apps/razers
  2)  ./razers example/genome.fa example/reads.fa -id -a -mN -v
  3)  less example/reads.fa.result

On success, RazerS dumped the resulting matches with their corresponding 
semi-global alignments into the file example/reads.fa.result:

read1   0   27  F   contig1 47  73  92.593
#Read:   AATTGAATGAGGTCTTGCAGCCATGGC
#Genome: AATTGAATGACGTC-TGCAGCCATGGC
read1   0   27  R   contig2 300 328 92.593
#Read:   AATTGAATGAGGTCTT-GCAGCCATGGC
#Genome: AATTGAATGAGGTCTTCGCAGTCATGGC
read2   0   27  R   contig2 228 255 96.296
#Read:   CAGGAGATAAGCTGGATCGTTTACGGT
#Genome: CAGGAGATAAGCTGGATCGTTTACAGT
read3   0   27  F   contig2 335 362 100
#Read:   GCCATTAGAGGCCACCACACCAGACGT
#Genome: GCCATTAGAGGCCACCACACCAGNNNN

If alignments are not needed '-a' can be omited resulting in:

read1   0   27  F   contig1 47  73  92.593
read1   0   27  R   contig2 300 328 92.593
read2   0   27  R   contig2 228 255 96.296
read3   0   27  F   contig2 335 362 100

---------------------------------------------------------------------------
5.2. Paired-end read mapping
---------------------------------------------------------------------------

To demonstrate how paired-end read mapping works, we provide a read file
with mates of the reads in the example above. To run RazerS in paired-end
mode you simply have to add the second read file to the command line.

  1)  cd snapshot/apps/razers
  2)  ./razers example/genome.fa example/reads.fa example/reads2.fa -id -mN
  3)  less example/reads.fa.result
  
read1/L 0   27  F   contig1 47  73  92.593  1   -3  217
read1/R 0   27  R   contig1 236 264 96.296  1   -3  -217
read2/L 0   27  R   contig2 228 255 96.296  3   -1  -207
read2/R 0   27  F   contig2 48  75  100     3   -1  207
read3/L 0   27  F   contig2 335 362 100     2   0   221
read3/R 0   27  R   contig2 529 556 100     2   0   -221

In this short example the library sizes vary between 207bp and 221bp. The
default library size of RazerS is 220bp with a tolerance of 50bp, i.e. all
libraries between 170bp and 270bp are recognized. To alter the library size
settings use the -ll and -le options.

The pair score value in the second last column is the negative sum of 
errors in both matches and the third last column is a unique number to 
identify the two corresponding mate matches. Remember that RazerS can 
output more than one match per mate-pair and two corresponding mate matches
are not necessarily in consecutive lines in the output file, e.g. when 
altering the sort-order with the -so option.

---------------------------------------------------------------------------
6. Contact
---------------------------------------------------------------------------

For questions or comments, contact:
  David Weese <weese@inf.fu-berlin.de>
  Anne-Katrin Emde <emde@inf.fu-berlin.de>