TDAlab @ GA Tech.

Software

Benchmarking Short Sequence Mapping Tools

General information

The development of next-generation sequencing instruments has led to the generation of millions of short sequences in a single run. The process of aligning these reads to a reference genome is time consuming and demands the development of fast and accurate alignment tools. However, the current proposed tools make different compromises between the accuracy and the speed of mapping. Moreover, many important aspects are overlooked while comparing the performance of a newly developed tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools with respect to various aspects and provide an objective comparison. Information about the tools and the options we used in the experiments are shown in the following. In addition, the code used to verify the tools is included.

Related Publications

A. Hatem, D. Bozdag, A. E. Toland, U. V. Catalyurek "Benchmarking short sequence mapping tools" BMC Bioinformatics, 14(1):184, 2013.

Experimental setup

The experiments in this study can be repeated following three major steps: getting the reference genomes, generating the synthetic data sets, and choosing the right options for the tools. Each one of these are described in detail below. If needed, instead of regenerating datasets, you can also download them from here.

Getting the reference genomes:

http://genome.ucsc.edu/

Generating the synthetic data:

http://samtools.sourceforge.net/

http://www.niehs.nih.gov/research/resources/software/biostatistics/art/

Choosing the right options for the tools:

pMap:

http://bmi.osu.edu/hpc/software/pmap/pmap.html

pmap_index $genomefile $indexdir $indexprefix $programname

pmap_dist $workdir $outdir $readsfile [-r $readfile2]

pmap [-pe](paired end) -i $indexdir $indexprefix $workdir $outdir $programname $options

Bowtie

BOWTIE_INDEXES=/home/dayat/out-bowtie/index/lancelet; export BOWTIE_INDEXES

Quality threshold: -e 140 -n 2 -l 28 -S
Number of mismatches: -n 2 -l 28 -S -e (40, 60, 80, 100, 120, 140)
Seed length: -n 2 -l (20, 24, 28, 32, 36) -e 100 -S
Read length: -n 2 -l 28 -e 100 -S
Paired end: -n 2 -l 28 -e 100 -S -I 0 -X 500
Genome type: -n 2 -l 28 -e 100 -S
Performance: -n 2 -l 28 -e 100 -S -p (2, 4, 8)

Bowtie2

Bowtie2 -t --ignore-quals $indexdir/indexprefix -U $readfile -S $workdir/out.txt

Quality threshold: --score-min L,-21,0 --mp 3,3 --gbar 125
Number of mismatches: --score-min L,-(6, 9, 12, 15, 18, 21),0 --mp 3,3 --gbar 125
Read length: --score-min L,-15,0 --mp 3,3 --gbar (36, 70, 125)
Paired end: --score-min L,-15,0 --mp 3,3 (for ungapped --gbar 70) --no-discordant --no-mixed
Genome type: - --score-min L,-15,0 --mp 3,3 --gbar 125
Gaped alignment: --score-min L,-15,0 --mp 3,3
Performance: -p 2 --score-min L,-15,0 --mp 3,3 --gbar 125

BWA

Quality threshold: -n 5 -l 28 -k 2 -o 0
Number of mismatches: -l 28 -k 2 -o 0 -n (2, 3, 4, 5, 6, 7)
Seed length: -n 5 -l (20, 24, 28, 32, 36) -k 2 -o 0
Read length: -n 5 -l 28 -k 2 -o 0
Paired end: -n 5 -l 28 -k 2 -o 0 (or -o 1 -e 3 for gaped alignment) -conversion -a 500 -s(disable Smith-Waterman alignment)
Genome type: -n 5 -l 28 -k 2 -o 0
Gaped alignment: -n 5 -l 28 -k 2 -o 1 -e 3
Performance: I-n 5 -l 28 -k 2 . o 0 -t(2, 4, 8)

SOAP2

Quality threshold: -l 28 -v 7
Number of mismatches: -l 28 -v (2, 3, 4, 5, 6, 7)
Seed length: -l (28, 32, 36) -v 5
Read length: -l 28 -v 5
Paired end: -l 28 -v 5 -m 0 -x 500 -2 outfile:unpaired.txt
Genome type: -l 28 -v 5
Gaped alignment (only for paired end): -g 3
Performance: -l 28 -v 5 -p (2, 5, 8)

GSNAP

Quality threshold: -m 7 -n 1 -w 0 -T 0 -A sam
Number of mismatches: -m (2, 3, 4, 5, 6, 7) -n 1 -w 0 -T 0 -A sam
Seed length: does not support seeding
Read length: -m 5 -n 1 -w 0 -T 0 -A sam
Paired end: -m 5 -n 1 -w 0 -T 0 -A sam ( -i 0 -y 3 -Y 3 -z 3 -Z 3 for gaped alignment)
Genome type: -m 5 -n 1 -w 0 -T 0 -A sam
Gaped alignment: -m 5 -n 1 -w 0 -T 0 -A sam -i 0 -y 3 -Y 3 -z 3 -Z 3
Performance: -m 5 -n 1 -w 0 -T 0 -A sam -t (2, 4, 8)

Novoalign

novoalign -f $readsfile -o SAM -d $indexdir/$indexprefix > $workdir/out.txt

Quality threshold: -t 154 -o SAM -o FullNW -g 99 -x 99 -r Random
Number of mismatches: -t (44, 66, 88, 110, 132, 154) -o SAM -o FullNW -g 99 -x 99 -r Random
Read length: -t 110 -o SAM -o FullNW -g 99 -x 99
Paired end: -t 110 -o SAM -o FullNW (for ungapped alignment -g 99 -x 99) -i 500 50
Genome type: -t 110 -o SAM -o FullNW -g 99 -x 99
Gaped alignment: -t 110 -o SAM -o FullNW