| Literature DB >> 19930550 |
Ben Langmead1, Michael C Schatz, Jimmy Lin, Mihai Pop, Steven L Salzberg.
Abstract
As DNA sequencing outpaces improvements in computer speed, there is a critical need to accelerate tasks like alignment and SNP calling. Crossbow is a cloud-computing software tool that combines the aligner Bowtie and the SNP caller SOAPsnp. Executing in parallel using Hadoop, Crossbow analyzes data comprising 38-fold coverage of the human genome in three hours using a 320-CPU cluster rented from a cloud computing service for about $85. Crossbow is available from http://bowtie-bio.sourceforge.net/crossbow/.Entities:
Mesh:
Year: 2009 PMID: 19930550 PMCID: PMC3091327 DOI: 10.1186/gb-2009-10-11-r134
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Experimental parameters for Crossbow experiments using simulated reads from human chromosomes 22 and X
| Reference chromosome | Chromosome 22 | Chromosome X |
|---|---|---|
| Reference base pairs | 49.7 million | 155 million |
| Chromosome copy number | Diploid | Haploid |
| HapMap SNPs introduced | 36,096 | 71,976 |
| Heterozygous | 24,761 | 0 |
| Homozygous | 11,335 | 71,976 |
| Novel SNPs introduced | 10,490 | 30,243 |
| Heterozygous | 6,967 | 0 |
| Homozygous | 3,523 | 30,243 |
| Simulated coverage | 40-fold | 40-fold |
| Read type | 35-bp paired | 35-bp paired |
SNP calling measurements for Crossbow experiments using simulated reads from human chromosomes 22 and X
| Chromosome 22 | Chromosome X | |||||
|---|---|---|---|---|---|---|
| True number of sites | Crossbow sensitivity | Crossbow precision | True number of sites | Crossbow sensitivity | Crossbow precision | |
| All SNP sites | 46,586 | 99.0% | 99.1% | 102,219 | 99.0% | 99.6% |
| Only HapMap SNP sites | 36,096 | 99.8% | 99.9% | 71,976 | 99.9% | 99.9% |
| Only novel SNP sites | 10,490 | 96.3% | 96.3% | 30,243 | 96.8% | 98.8% |
| Only homozygous | 14,858 | 98.7% | 99.9% | NA | NA | NA |
| Only heterozygous | 31,728 | 99.2% | 98.8% | NA | NA | NA |
| Only novel het | 6,967 | 96.6% | 94.6% | NA | NA | NA |
| All other | 39,619 | 99.4% | 99.9% | NA | NA | NA |
Sensitivity is the proportion of true SNPs that were correctly identified. Precision is the proportion of called SNPs that were genuine. NA denotes "not applicable" because of the ploidy of the chromosome.
Coverage and agreement measurements comparing Crossbow (CB) and SOAP/SOAPsnp (SS) to the genotyping results obtained by an Illumina 1 M genotyping assay in the SOAPsnp study
| (SS) | (CB) | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Illumina 1 M genotype | Sites | Sites covered (SS) | Sites covered (CB) | Agreed (SS) | Agreed (CB) | Missed allele | Other disagreement | Missed allele | Other disagreement |
| Chromosome X | |||||||||
| HOM reference | 27,196 | 98.65% | 99.83% | 99.99% | 99.99% | NA | 0.004% | NA | 0.011% |
| HOM mutant | 10,737 | 98.49% | 99.19% | 99.89% | 99.85% | NA | 0.113% | NA | 0.150% |
| Total | 37,933 | 98.61% | 99.65% | 99.97% | 99.95% | NA | 0.035% | NA | 0.050% |
| Autosomal | |||||||||
| HOM reference | 540,878 | 99.11% | 99.88% | 99.96% | 99.92% | NA | 0.044% | NA | 0.078% |
| HOM mutant | 208,436 | 98.79% | 99.28% | 99.81% | 99.70% | NA | 0.194% | NA | 0.296% |
| HET | 250,667 | 94.81% | 99.64% | 99.61% | 99.75% | 0.374% | 0.017% | 0.236% | 0.014% |
| Total | 999,981 | 97.97% | 99.70% | 99.84% | 99.83% | 0.091% | 0.069% | 0.059% | 0.108% |
'Sites covered' is the proportion of BeadChip sites covered by a sufficient number of sequencing reads (roughly four reads for diploid and two reads for haploid chromosomes). 'Agreed' is the proportion of covered BeadChip sites where the BeadChip call equaled the SOAPsnp/Crossbow call. 'Missed allele' is the proportion of covered sites where SOAPsnp/Crossbow called a position as homozygous for one of two heterozygous alleles called by BeadChip. 'Other disagreement' is the proportion of covered sites where the BeadChip call differed from the SOAPsnp/Crossbow in any other way. NA denotes "not applicable" due to ploidy.
Timing and cost for Crossbow experiments using reads from the Wang et al. study [5]
| EC2 Nodes | 1 master, 10 workers | 1 master, 20 workers | 1 master, 40 workers |
|---|---|---|---|
| Worker CPU cores | 80 | 160 | 320 |
| Wall clock time | 6 h:30 m | 4 h:33 m | 2 h:53 m |
| Approximate cluster setup time | 18 m | 18 m | 21 m |
| Approximate crossbow time | 6 h:12 m | 4 h:15 m | 2 h:32 m |
| Approximate cost (US/Europe) | $52.36/$60.06 | $71.40/$81.90 | $83.64/$95.94 |
Costs are approximate and based on the pricing as of this writing, that is, $0.68 per extra-large high-CPU EC2 node per hour in the US and $0.78 in Europe. Times can vary subject to, for example, congestion and Internet traffic conditions.
Figure 1Number of worker CPU cores allocated from EC2 versus throughput measured in experiments per hour: that is, the reciprocal of the wall clock time required to conduct a whole-human experiment on the Wang et al. dataset [5]. The line labeled 'linear speedup' traces hypothetical linear speedup relative to the throughput for 80 CPU cores.
Figure 2Crossbow workflow. Previously copied and pre-processed read files are downloaded to the cluster, decompressed and aligned using many parallel instances of Bowtie. Hadoop then bins and sorts the alignments according to primary and secondary keys. Sorted alignments falling into each reference partition are then submitted to parallel instances of SOAPsnp. The final output is a stream of SNP calls made by SOAPsnp.
Figure 3Four basic steps to running the Crossbow computation. Two scenarios are shown: one where Amazon's EC2 and S3 services are used, and one where a local cluster is used. In step 1 (red) short reads are copied to the permanent store. In step 2 (green) the cluster is allocated (may not be necessary for a local cluster) and the scripts driving the computation are uploaded to the master node. In step 3 (blue) the computation is run. The computation download reads from the permanent store, operates on them, and stores the results in the Hadoop distributed filesystem. In step 4 (orange), the results are copied to the client machine and the job completes. SAN (Storage Area Network) and NAS (Network-Attached Storage) are two common ways of sharing filesystems across a local network.