reformat.sh producing output with 4 reads having the same id
BBMap short read aligner, and other bioinformatic tools.
Brought to you by:
brian-jgi
samtools fastq 10ng-TS-2.100k.bam \
| reformat.sh in=stdin.fq interleaved=true out=stdout.sam samplerate=1 sampleseed=4 forcetrimright=98 \
| samtools view -@3 -b - -o 10ng-TS-2.100k.99bp.bam
This produces an invalid output bam with a sneaky but that I did not detect until attemptint to realign these reads (dups were silently dropped).
I also tried 0.999999 for the samplerate, which produced the same invalid bam.
$ samtools view 10ng-TS-2.100k.bam | grep A00336:A00336:HCVHJDMXX:2:2488:7835:5353
A00336:A00336:HCVHJDMXX:2:2488:7835:5353 1187 chr1 517107 0 101M = 517212 206 CACAATAAAAAATAATATCTAAAACTCAAATATATTAAAATCCCCATACCTAAACTATAAATACTAAATAAAATTTACTTATCCATCCATTTTCTATATTC FFFFF:FFFFFFFFFFFFFFFFFFFFFF,::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFF:FF NM:i:0 MD:Z:101 AS:i:101 XS:i:101 RG:Z:TGACCAATTCTTTCCC YC:Z:GA YD:Z:r
A00336:A00336:HCVHJDMXX:2:2488:7835:5353 1107 chr1 517212 0 101M = 517107 -206 ACTAAAAAACTAAAAACAATACTTATTCTCAAAAAAATCTTCAACTTAAATAACTCTATAAAAAAAAAATTACATCATTAAAAAATCATCGCAAATCAAAT FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFF NM:i:0 MD:Z:101 AS:i:101 XS:i:101 RG:Z:TGACCAATTCTTTCCC YC:Z:CT YD:Z:r
$ samtools view 10ng-TS-2.100k.99bp.bam | grep A00336:A00336:HCVHJDMXX:2:2488:7835:5353
A00336:A00336:HCVHJDMXX:2:2488:7835:5353 77 * 0 0 * * 0 0 CACAATAAAAAATAATATCTAAAACTCAAATATATTAAAATCCCCATACCTAAACTATAAATACTAAATAAAATTTACTTATCCATCCATTTTCTATAT FFFFF:FFFFFFFFFFFFFFFFFFFFFF,::FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFF,FFFFFFFF:
A00336:A00336:HCVHJDMXX:2:2488:7835:5353 141 * 0 0 * * 0 0 TTTTTTTGAGAATAAGTATTGTTTTTAGTTTTTTAGTGTTGGAATATAGAAAATGGATGGATAAGTAAATTTTATTTAGTATTTATAGTTTAGGTATGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
A00336:A00336:HCVHJDMXX:2:2488:7835:5353 77 * 0 0 * * 0 0 ATTTGATTTGCGATGATTTTTTAATGATGTAATTTTTTTTTTATAGAGTTATTTAAGTTGAAGATTTTTTTGAGAATAAGTATTGTTTTTAGTTTTTTA FFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFF
A00336:A00336:HCVHJDMXX:2:2488:7835:5353 141 * 0 0 * * 0 0 ATATAAGCCACCTCACCTAACCTACAACAATTTTTCAATAATATAATTTCTCTTTTACAAAACCACCTAAACTAAAAATTCCCTTAAAAACAAATACTA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FF:FFF,FFFFFFF,FFFFFFFF:FFFFFFFFFFFFF
Hmmm... looks like the issue here is that the intermediate fastq file is not really interleaved. Sam/bam files are not interleaved in the first place and should never be treated as such; the order of records is arbitrary and reformatting them as fastq (with samtools) doesn't change that. If you force Reformat to process a noninterleaved file as interleaved, strange things will happen; in this case, the records for that pair are nonadjacent and that causes the header replication, as per the sam specification (pairs have to get the same header in sam output even if they had different headers in fastq format).
Anyway, this operation works as expected as long as the interleaved flag is correctly set to false.