Benchmarking Short Sequence Mapping Tools
General information
The development of next-generation sequencing instruments has led to the generation of millions of
short sequences in a single run. The process of aligning these reads to a reference genome is time
consuming and demands the development of fast and accurate alignment tools. However, the current
proposed tools make different compromises between the accuracy and the speed of mapping. Moreover,
many important aspects are overlooked while comparing the performance of a newly developed
tool to the state of the art. Therefore, there is a need for an objective evaluation method that covers all
the aspects. In this work, we introduce a benchmarking suite to extensively analyze sequencing tools
with respect to various aspects and provide an objective comparison.
Information about the tools and the options we used in the experiments are shown in the following. In addition,
the code used to verify the tools is included.
Related Publications
-
A. Hatem, D. Bozdag, A. E. Toland, U. V. Catalyurek
"Benchmarking short sequence mapping tools"
BMC Bioinformatics, 14(1):184, 2013.
Experimental setup
The experiments in this study can be repeated following three
major steps: getting the reference genomes, generating the synthetic
data sets, and choosing the right options for the tools. Each one of
these are described in detail below. If needed, instead of
regenerating datasets, you can also download them from here.
- Getting the reference genomes:
The reference genomes are available online from different websites.
For our experiments, we downloaded the genomes from UCSC Genome Bioinformatics Center
(http://genome.ucsc.edu/).
There are also other available reference genomes.
- Generating the synthetic data:
To generate the synthetic data, we used wgsim, which is part of SAMtools package
( http://samtools.sourceforge.net/ ).
You can use different options for wgsim to
emulate the base error rate, the mutation rate, and percentage of indels beside other options.
For our experiments, we used the following options with the following values:
-e 0.02 (default value), -r 0.0009, -R 0.0001
In addition, based on every experiments, we changed -N (total number of reads)
and -1 and -2 (length for the first and second read).
In addition to wgsim, we used ART (
http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ )
to generate reads with a varying sequencing error rate. The default options were used.
- Choosing the right options for the tools:
It is important to disable and enable the right options beside choosing the right values for them.
In the following, we show
how the different options are used to run a fair comparison.
It is important to note that in all of the experiments we used
pMap on a single node to provide the execution time for some of the tools;
not all of the tools provide the execution time.
Therefore, we first explain what pMap is and how to use it.
Then, we mention the scripts we used to run pMap with the different tools.
pMap:
pMap (
http://bmi.osu.edu/hpc/software/pmap/pmap.html) is an open-source
implementation of MPI-based tool that enable
parallelization of existing short sequence mapping tools.
Currently, it supports the following tools:
Bowtie, BWA, SOAP, GSNAP, MAQ, and RMAP.
In addition, it can be extended easily to integrate other tools.
The followings are the main commands to use pMap:
pmap_index $genomefile $indexdir $indexprefix $programname
pmap_dist $workdir $outdir $readsfile [-r $readfile2]
pmap [-pe](paired end) -i $indexdir $indexprefix $workdir $outdir $programname $options
- Bowtie
First of all, before calling the mapping program,
bowtie needs the bowtie_indexes environment variable to contain
the location of the reference genome index.
BOWTIE_INDEXES=/home/dayat/out-bowtie/index/lancelet; export BOWTIE_INDEXES
The experiments specific options are as follows:
- Quality threshold: -e 140 -n 2 -l 28 -S
- Number of mismatches: -n 2 -l 28 -S -e (40, 60, 80, 100, 120, 140)
- Seed length: -n 2 -l (20, 24, 28, 32, 36) -e 100 -S
- Read length: -n 2 -l 28 -e 100 -S
- Paired end: -n 2 -l 28 -e 100 -S -I 0 -X 500
- Genome type: -n 2 -l 28 -e 100 -S
- Performance: -n 2 -l 28 -e 100 -S -p (2, 4, 8)
- Bowtie2
Bowtie2 is not supported by pMap. To run Bowtie2, the following command is used:
Bowtie2 -t --ignore-quals $indexdir/indexprefix -U $readfile -S $workdir/out.txt
The experiments specific options are as follow:
- Quality threshold: --score-min L,-21,0 --mp 3,3 --gbar 125
- Number of mismatches: --score-min L,-(6, 9, 12, 15, 18, 21),0 --mp 3,3 --gbar 125
- Read length: --score-min L,-15,0 --mp 3,3 --gbar (36, 70, 125)
- Paired end: --score-min L,-15,0 --mp 3,3 (for ungapped --gbar 70) --no-discordant --no-mixed
- Genome type: - --score-min L,-15,0 --mp 3,3 --gbar 125
- Gaped alignment: --score-min L,-15,0 --mp 3,3
- Performance: -p 2 --score-min L,-15,0 --mp 3,3 --gbar 125
- BWA
Experiments specific options:
- Quality threshold: -n 5 -l 28 -k 2 -o 0
- Number of mismatches: -l 28 -k 2 -o 0 -n (2, 3, 4, 5, 6, 7)
- Seed length: -n 5 -l (20, 24, 28, 32, 36) -k 2 -o 0
- Read length: -n 5 -l 28 -k 2 -o 0
- Paired end: -n 5 -l 28 -k 2 -o 0 (or -o 1 -e 3 for gaped alignment) -conversion -a 500 -s(disable Smith-Waterman alignment)
- Genome type: -n 5 -l 28 -k 2 -o 0
- Gaped alignment: -n 5 -l 28 -k 2 -o 1 -e 3
- Performance: I-n 5 -l 28 -k 2 . o 0 -t(2, 4, 8)
- SOAP2
Experiments specific options:
- Quality threshold: -l 28 -v 7
- Number of mismatches: -l 28 -v (2, 3, 4, 5, 6, 7)
- Seed length: -l (28, 32, 36) -v 5
- Read length: -l 28 -v 5
- Paired end: -l 28 -v 5 -m 0 -x 500 -2 outfile:unpaired.txt
- Genome type: -l 28 -v 5
- Gaped alignment (only for paired end): -g 3
- Performance: -l 28 -v 5 -p (2, 5, 8)
- GSNAP
Experiments specific options:
- Quality threshold: -m 7 -n 1 -w 0 -T 0 -A sam
- Number of mismatches: -m (2, 3, 4, 5, 6, 7) -n 1 -w 0 -T 0 -A sam
- Seed length: does not support seeding
- Read length: -m 5 -n 1 -w 0 -T 0 -A sam
- Paired end: -m 5 -n 1 -w 0 -T 0 -A sam ( -i 0 -y 3 -Y 3 -z 3 -Z 3 for gaped alignment)
- Genome type: -m 5 -n 1 -w 0 -T 0 -A sam
- Gaped alignment: -m 5 -n 1 -w 0 -T 0 -A sam -i 0 -y 3 -Y 3 -z 3 -Z 3
- Performance: -m 5 -n 1 -w 0 -T 0 -A sam -t (2, 4, 8)
- Novoalign
Novoalign is not yet supported by pMap. Therefore, to run Novoalign, call the following command:
novoalign -f $readsfile -o SAM -d $indexdir/$indexprefix > $workdir/out.txt
Experiments specific options:
- Quality threshold: -t 154 -o SAM -o FullNW -g 99 -x 99 -r Random
- Number of mismatches: -t (44, 66, 88, 110, 132, 154) -o SAM -o FullNW -g 99 -x 99 -r Random
- Read length: -t 110 -o SAM -o FullNW -g 99 -x 99
- Paired end: -t 110 -o SAM -o FullNW (for ungapped alignment -g 99 -x 99) -i 500 50
- Genome type: -t 110 -o SAM -o FullNW -g 99 -x 99
- Gaped alignment: -t 110 -o SAM -o FullNW
- MAQ
Experiments specific options:
- Quality threshold: -e 140
- Number of mismatches: -e (40, 60, 80, 100, 120, 140)
- Read length: -e 100
- Read length: -e 100
- Genome type: -e 100
- RMAP
Experiments specific options:
- Quality threshold: -m 7 -w 125 -a outfile:amb.txt
- Number of mismatches: -m (2, 3, 4, 5, 6, 7) -w 125
- Read length: -m 5 -w (36, 70, 125, 200, 300)
- Paired end: -m 5 -w 70 -min-sep 0 -max-sep 500 -a outfile:amb.txt
- Genome type: -m 5 -w 70 -a outfile:amb.txt
- FANGS
FANGS is not yet supported by pMap. Therefore, to run FANGS, call the following command:
fangs $indexdir/$indexprefix $readsfile -out=pslx $workdir/out.txt
FANGS is used only in one experiment (Read length) and there is no specific options needed
to use for this experiment. In addition, by default, it allows five mismatches.
Download
Latest release: verifTools.tar
Genomes
Human genome: human.fa.gz (905MB)
Chimp genome: chimp.fa.gz (795MB)
Mouse genome: mouse.fa.gz (801MB)
Zebrafish genome: zebrafish.fa.gz (395MB)
Lancelet genome: lancelet.fa.gz (265MB)
Mellifera genome: mellifera.fa.gz (72MB)
C.Elegans genome: elegans.fa.gz (31MB)
wgsim synthetic data
Human data
Read length 36, paired end: human.36.1M.1.fq.gz (26MB), human.36.1M.2.fq.gz (26MB)
Read length 70, paired end: human.70.1M.1.fq.gz (37MB), human.70.1M.2.fq.gz (37MB)
Read length 125, paired end: human.125.1M.1.fq.gz (53MB), human.125.1M.2.fq.gz (53MB)
Read length 200, paired end: human.200.1M.1.fq.gz (75MB), human.200.1M.2.fq.gz (75MB)
Read length 300, paired end: human.300.1M.1.fq.gz (105MB), human.300.1M.2.fq.gz (105MB)
Read length 500, paired end: human.500.1M.1.fq.gz (161MB), human.500.1M.2.fq.gz (161MB)
Chimp data
Read length 70, paired end: chimp.1M.1.fq.gz (36MB), chimp.1M.2.fq.gz (36MB)
Mouse data
Read length 70, paired end: mouse.1M.1.fq.gz (37MB), mouse.1M.2.fq.gz (37MB)
Zebrafish data
Read length 70, paired end: zebrafish.1M.1.fq.gz (38MB), zebrafish.1M.2.fq.gz (38MB)
Lancelet data
Read length 70, paired end: lancelet.1M.1.fq.gz (37MB), lancelet.1M.2.fq.gz (37MB)
Mellifera data
Read length mellifera, paired end: mellifera.1M.1.fq.gz (24MB), mellifera.1M.2.fq.gz (24MB)
C.Elegans data
Read length 70, paired end: cElegans.1M.1.fq.gz (38MB), cElegans.1M.2.fq.gz (38MB)
ART synthetic data
Human, read length 70: human.70.1M.fq.gz (74MB)
Human, read length 100: human.100.1M.fq.gz (97MB)
Chimp, read length 70: chimp.1M.fq.gz (74MB)
Mouse, read length 36: mouse.36.1M.fq.gz (43MB)
Mouse, read length 70: mouse.70.1M.fq.gz (77MB)
Mouse, read length 100: mouse.100.1M.fq.gz (100MB)
Zebrafish, read length 70: zebrafish.1M.fq.gz (75MB)
Lancelet, read length 70: lancelet.1M.fq.gz (74MB)
Mellifera, read length 70: mellifera.1M.fq.gz (64MB)
C.Elegans, read length 70: cElegans.1M.fq.gz (75MB)