pMap is an MPI-based tool to parallelize the alignment step of state-of-the-art sequence mapping programs. It allows transparent execution of the alignment step of a selected program in parallel on a compute-cluster. The results are the same as those from a sequential run of the selected mapping program, except for differences due to random choices within the mapping program. The parallel execution using pMap is approximately NxP times faster, where N is the number of nodes and P is the number of cores/threads per node. Currently, pMap supports BWA, SOAP, Bowtie, GSNAP, MAQ and RMAP.
pMap divides the workload by assigning approximately R/N reads to each node, where R is the number of reads provided in a formatted input file (e.g. FASTA or FASTQ). Moreover, it replicates the pre-built index file(s) associated with the reference genome on each node. The selected sequence mapping program is executed independenty to map the reads assigned to each node, and finally, the output files are combined and written to a desired location. pMap consists of three subprograms corresponding to different steps of the parallel mapping process: pmap_index, pmap_dist and pmap.
Latest release: pMap-11-25-2010.tar.gz
After extracting the archive, change the default compilers mpicc and gcc in the Makefile, if desired. Then, run make to create three executables: pmap_index, pmap_dist and pmap. Make sure that the mapping programs to be used with pMap are included in the $PATH.
pmap_index
pmap_index <genomefile> <indexdir> <indexprefix> <program> [programpars]
genomefile: file that contains the
genome/chromosome sequences in FASTA format.
indexdir:
directory in which output index files will be written
indexprefix: the prefix of the
index files in indexdir
program: the sequence mapping
program to be called (bowtie, bwa or soap)
programpars: additional parameters
to be passed to program
When pmap_index is called:
#!/bin/bash genomefile=/home/myuser/mygenome.fa indexdir=/home/myuser/myindexdir indexprefix=myindexprefix pmap_index $genomefile $indexdir $indexprefix bowtie #Sequential equivalent is: bowtie-build $genomefile $indexdir/$indexprefix |
#!/bin/bash genomefile=/home/myuser/mygenome.fa indexdir=/home/myuser/myindexdir indexprefix=myindexprefix pmap_index $genomefile $indexdir $indexprefix bwa -a bwtsw #Sequential equivalent is: bwa index -a bwtsw -p $indexdir/$indexprefix $genomefile |
pmap_dist
pmap_dist <workdir> <outdir> <readsfile> [-r readsfile2] [-l num_lines_per_read] [-i indexdir indexprefix] [-f inpfile]
workdir: local directory in which intermediate files will be written
outdir: directory on the head node
in which output files will be written
readsfile: the reads file on
the head node that contains the reads to be mapped
readsfile2: the second reads file on
the head node needed when mapping paired end reads
num_lines_per_read: number of
lines per read in readsfile and readsfile2
indexdir: shared directory in which index files are stored
indexprefix: the prefix of the index files in indexdir
inpfile: an input file on
the head node that contains additional information needed
during mapping
When pmap_dist is called:
Please also see pmap examples below.
pmap
pmap [options] <workdir> <outdir> <program> [programpars] [-conversion <convpars>]
options: pmap has the following optional parameters:
-pe: paired end mapping. readsfile2
must have been distributed by pmap_dist if needed by the program.
-dc: don't cleanup intermediate files in workdir after the execution of pmap completes. -i <indexdir> <indexprefix>: use index files from a shared directory. The index files are in indexdir and file names start with indexprefix. If -i is not specified, index files must have been replicated using -i option of pmap_dist | |
workdir: local work directory created
by pmap_dist
outdir: directory on the head node
in which output files will be written
program: the sequence mapping
program to be called (bowtie, bwa, soap, gsnap, maq or rmap)
programpars: additional parameters
to be passed to program. If programpars includes an output
file, the output file name
should be specified in outfile:filename format. If programpars includes an input file
which is not to be read from a shared location,
the input file should be replicated using the -f <inpfile> option of pmap_dist. Then, in programpars, the input file name
should be specified in infile:filename format.
For both input and output files, filename is just
the name of the file, not a path.
convpars: additional paramaters of the conversion program, if
any, that is used to convert intermediate output files
from program-native format to human-readable
format. [-conversion <convpars>] should be the last
parameters of pmap
When pmap is called:
workdir=/temp outdir=/home/myuser readsfile=/home/myuser/reads.fq readsfile2=/home/myuser/reads2.fq #if needed indexdir=/home/myuser/myindexdir indexprefix=myindexprefix |
#PBS -l nodes=4:ppn=8 mpiexec -n 32 pmap_dist $workdir $outdir $readsfile mpiexec -n 32 pmap -i $indexdir $indexprefix $workdir $outdir bowtie |
#PBS -l nodes=4:ppn=8 mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix mpiexec -n 32 pmap -pe $workdir $outdir bwa |
#PBS -l nodes=4:ppn=8 uniq $PBS_NODEFILE > $ids # Get ids of the nodes and run a single instance with 8 threads on each node mpiexec -n 4 -machinefile $ids pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix mpiexec -n 4 -machinefile $ids pmap -pe $workdir $outdir soap -p 8 -2 outfile:unpaired.txt |
Please also note the following for pmap_dist and pmap:
Using pMap with BWA
When pmap is called for BWA program, bwa aln is executed with appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. Subsequently, the generated output files in BWA-native ".sai" format are converted to SAM format by executing bwa samse for single end reads and bwa sampe for paired end reads. The optional parameters of bwa sampe or bwa samse can be provided to pmap by using the -conversion option of pmap.
* Example: Map paired end reads. Index files are replicated. Use -l 28 option with bwa aln and -a 400 option with bwa sampe.
#PBS -l nodes=4:ppn=8 mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix mpiexec -n 32 pmap -pe $workdir $outdir bwa -l 28 -conversion -a 400 #Sequential equivalent is: bwa aln -l 28 $indexdir/$indexprefix $readsfile > $workdir/out.sai bwa aln -l 28 $indexdir/$indexprefix $readsfile2 > $workdir/out2.sai bwa sampe -a 400 $indexdir/$indexprefix $workdir/out.sai $workdir/out2.sai $readsfile $readsfile2 > $outdir/out.txt |
For more information on BWA: http://bio-bwa.sourceforge.net/
(pMap was tested using version 0.5.0 of BWA)
Using pMap with SOAP
When pmap is called for SOAP program, soap is executed with -D option and appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist.
-a,-b,-D,-o options of SOAP are automatically used within pmap and therefore should not be specified in programpars.
* Example: Map paired end reads. Index files are replicated. Write unpaired reads into unpaired.txt using -2 option of soap.
#PBS -l nodes=4:ppn=8 mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 -i $indexdir $indexprefix mpiexec -n 32 pmap -pe $workdir $outdir soap -2 outfile:unpaired.txt #Sequential equivalent is: soap -D $indexdir/$indexprefix -a $readsfile -b $readsfile2 -o $outdir/out.txt -2 $outdir/unpaired.txt |
For more information on SOAP: http://soap.genomics.org.cn/
(pMap was tested using version 2.20 of SOAP)
Note: Not using "-2 <str>" option when mapping paired
end reads generates an error in SOAP.
Using pMap with Bowtie
When pmap is called for Bowtie program, bowtie is executed with appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. The BOWTIE_INDEXES environment variable must be set to the folder name that contains the index files (e.g. the following line should be present in .bashrc file: BOWTIE_INDEXES=/home/myuser/myindexdir; export BOWTIE_INDEXES. Also see Bowtie manual). Currently, using Bowtie with replicated index files (i.e. -i option of pmap_dist) is not supported by pMap. Therefore it should be used with shared index files (i.e. using -i option of pmap).
--12,-c,-s,-u,-b,--refout,--mm,-f
(with ".mfa" files) options of Bowtie are not supported by
pmap.
-1,-2 options of Bowtie are
automatically used within pmap and therefore should not be
specified in programpars.
#PBS -l nodes=4:ppn=8 outdir1=/home/myuser/myoutdir1 #Write output of first run here outdir2=/home/myuser/myoutdir2 #Write output of second run here mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -r $readsfile2 mpiexec -n 32 pmap -pe -dc -i $indexdir $indexprefix $workdir $outdir1 bowtie mpiexec -n 32 pmap -pe -i $indexdir $indexprefix $workdir $outdir2 bowtie --best #Sequential equivalent is: outdir1=/home/myuser/myoutdir1 #Write output of first run here outdir2=/home/myuser/myoutdir2 #Write output of second run here bowtie $indexprefix -1 $readsfile -2 $readsfile2 $outdir1/out.txt bowtie --best $indexprefix -1 $readsfile -2 $readsfile2 $outdir2/out.txt |
For more information on Bowtie: http://bowtie-bio.sourceforge.net/
(pMap was tested using version 0.10.1 of Bowtie)
Using pMap with GSNAP
When pmap is called for GSNAP program, gsnap is executed with appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. When using pMap with GSNAP, the indexdir should be set to the database directory that corresponds to the genome being considered. Also, indexprefix should be set to the name of the database. GSNAP uses a single input file when mapping paired end reads, therefore readsfile2 paramater is not needed.
-D,-d options of GSNAP are automatically used within pmap and therefore should not be specified in programpars.
* Example: Map paired end reads. The readsfile is in a special format that contains 3 lines per read. Use known SNP information in /home/myuser/snp.iit file using -V option of gsnap. Replicate the SNP information file as well as the index files (database files).
#PBS -l nodes=4:ppn=8 indexdir=/home/myuser/mygenomedb indexprefix=mygenomedb inpfile=/home/myuser/snp.iit mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -l 3 -i $indexdir $indexprefix -f $inpfile mpiexec -n 32 pmap -pe $workdir $outdir gsnap -V infile:snp.iit #Sequential equivalent is: indexdir=/home/myuser/mygenomedb indexprefix=mygenomedb inpfile=/home/myuser/snp.iit gsnap -V $inpfile -D $indexdir -d $indexprefix $readsfile > $outdir/out.txt |
For more information on GSNAP: http://research-pub.gene.com/gmap/
(pMap was tested using version 2010-07-27 of GMAP/GSNAP)
Using pMap with MAQ
When pmap is called for MAQ program, first maq fastq2bfa and maq fastq2bfq are called to convert the input genome and the group of reads assigned to the node to an maq-internal format (optional [-n nreads] parameter of maq fastq2bfa is ignored). Then, maq map is executed with appropriate parameters on each node to map the reads. Subsequently, the generated output files in MAQ-native ".map" format are converted to a human-readable format by executing maq mapview. The optional parameters of maq mapview can be provided to pmap by using the -conversion option of pmap. Since there's no distinct indexing step in MAQ, the indexdir and indexprefix paramaters of pmap_dist and pmap should be set to the directory of the genome file and the genome file name respectively.
* Example: Map paired end reads. Index file (genome file) is replicated.
#PBS -l nodes=4:ppn=8 indexdir=/home/myuser indexprefix=mygenome.fa mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -i $indexdir $indexprefix mpiexec -n 32 pmap -pe $workdir $outdir maq #Sequential equivalent is: indexdir=/home/myuser indexprefix=mygenome.fa maq fasta2bfa $indexdir/$indexprefix $workdir/index.bfa maq fastq2bfq $readsfile $workdir/reads.bfq maq fastq2bfq $readsfile2 $workdir/reads2.bfq maq map $workdir/out.map $workdir/index.bfa $workdir/reads.bfq $workdir/reads2.bfq maq mapview $workdir/out.map > $outdir/out.txt |
For more information on MAQ: http://maq.sourceforge.net/
(pMap was tested using version 0.7.1 of MAQ)
Using pMap with RMAP
When pmap is called for RMAP program, rmap is executed with -Q option and appropriate parameters on each node to map the group of reads assigned to the node by pmap_dist. Since there's no distinct indexing step in RMAP, the indexdir and indexprefix paramaters of pmap_dist and pmap should be set to the directory of the genome file and the genome file name respectively. The genome should be stored in a file in FASTA format, which may contain one or multiple chromosomes.
-s,-F,-p
options of RMAP are not supported by pmap.
-c,-Q,-o options of
RMAP are automatically used within pmap and therefore
should not be specified in programpars.
#PBS -l nodes=4:ppn=8 indexdir=/home/myuser indexprefix=mygenome.fa mpiexec -n 32 pmap_dist $workdir $outdir $readsfile -i $indexdir $indexprefix mpiexec -n 32 pmap -pe $workdir $outdir rmap #Sequential equivalent is: indexdir=/home/myuser indexprefix=mygenome.fa rmappe -c $indexdir/$indexprefix -Q $readsfile -o $outdir/out.txt |
For more information on RMAP: http://rulai.cshl.edu/rmap/
(pMap was tested using version 2.05 of RMAP)