| Literature DB >> 23802613 |
Shanrong Zhao1, Kurt Prenger, Lance Smith, Thomas Messina, Hongtao Fan, Edward Jaeger, Susan Stephens.
Abstract
BACKGROUND: Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses.Entities:
Mesh:
Year: 2013 PMID: 23802613 PMCID: PMC3698007 DOI: 10.1186/1471-2164-14-425
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1The Rainbow pipeline. S3 centralizes data storage, including inputs, intermediate results, and outputs. A typical scenario to run Rainbow to analyze large-scale WGS data is import → execute → export. Alignment and SNP call are performed by Crossbow in a cluster with multiple nodes. AWS, Amazon Web Service.
Figure 2The four main steps in WGS data analysis using Rainbow. Steps 1, 2, and 3 can be executed in parallel in the cloud by launching multiple instances or clusters. Crossbow performs both the alignment and the SNP call.
Description of the WGS data for 44 subjects and a summary of the detected SNPs
| SG1226 | 150 | 427 | 1900 | 1503239 | 2507716 | 2105 | 4013060 |
| SG1227 | 84 | 269 | 1220 | 1508162 | 2452631 | 1768 | 3962561 |
| SG1229 | 115 | 283 | 1280 | 1548421 | 2242797 | 1480 | 3792698 |
| SG1230 | 111 | 301 | 1362 | 1494063 | 2581379 | 1906 | 4077348 |
| SG1231 | 128 | 388 | 1740 | 1481840 | 2626852 | 2138 | 4120830 |
| SG1232 | 97 | 258 | 1162 | 1497747 | 2729469 | 2468 | 4229684 |
| SG1233 | 72 | 244 | 1204 | 1707533 | 2145142 | 1710 | 3854385 |
| SG1234 | 103 | 333 | 1496 | 1550433 | 2790889 | 3233 | 4344555 |
| SG1235 | 89 | 283 | 1258 | 1479108 | 2370792 | 1465 | 3851365 |
| SG1236 | 96 | 311 | 1382 | 1490258 | 2752238 | 2800 | 4245296 |
| SG1237 | 115 | 346 | 1552 | 1520718 | 2715088 | 2983 | 4238789 |
| SG1238 | 102 | 322 | 1432 | 1504343 | 2327337 | 1457 | 3833137 |
| SG1239 | 93 | 260 | 1170 | 1463430 | 2517434 | 1596 | 3982460 |
| SG1240 | 91 | 251 | 1122 | 1472236 | 2439010 | 1599 | 3912845 |
| SG1241 | 89 | 251 | 1120 | 1477814 | 2477763 | 1617 | 3957194 |
| SG1242 | 97 | 300 | 1360 | 1501310 | 2470358 | 1856 | 3973524 |
| SG1243 | 93 | 305 | 1356 | 1476636 | 2467669 | 1646 | 3945951 |
| SG1244 | 95 | 269 | 1202 | 1477605 | 2462121 | 1613 | 3941339 |
| SG1245 | 98 | 277 | 1244 | 1504849 | 2401576 | 1716 | 3908141 |
| SG1246 | 97 | 269 | 1202 | 1496063 | 2474375 | 1747 | 3972185 |
| SG1248 | 149 | 363 | 1632 | 1461303 | 2493556 | 1876 | 3956735 |
| SG1249 | 126 | 382 | 1702 | 1484862 | 2502888 | 2005 | 3989755 |
| SG1250 | 141 | 418 | 1860 | 1491431 | 2556946 | 2249 | 4050626 |
| SG1251 | 144 | 418 | 1860 | 1507550 | 2516188 | 2161 | 4025899 |
| SG1252 | 146 | 427 | 1918 | 1470888 | 2633297 | 2108 | 4106293 |
| SG1253 | 142 | 432 | 1940 | 1492478 | 2594146 | 2158 | 4088782 |
| SG1254 | 127 | 374 | 1682 | 1527922 | 2504784 | 1971 | 4034677 |
| SG1255 | 138 | 392 | 1760 | 1472663 | 2622476 | 2130 | 4097269 |
| SG1256 | 143 | 420 | 1872 | 1470631 | 2666175 | 2235 | 4139041 |
| SG1257 | 122 | 381 | 1712 | 1481487 | 2555804 | 2118 | 4039409 |
| SG1258 | 112 | 284 | 1278 | 1608618 | 2267405 | 1681 | 3877704 |
| SG1259 | 124 | 330 | 1492 | 1463934 | 2661886 | 1991 | 4127811 |
| SG1260 | 133 | 307 | 1382 | 1523274 | 2414508 | 1616 | 3939398 |
| SG1263 | 112 | 349 | 1552 | 1470360 | 2653711 | 2075 | 4126146 |
| SG1264 | 122 | 376 | 1690 | 1486174 | 2631507 | 2034 | 4119715 |
| SG1265 | 101 | 307 | 1378 | 1489230 | 2495349 | 1800 | 3986379 |
| SG1266 | 118 | 352 | 1576 | 1479540 | 2552931 | 2014 | 4034485 |
| SG1267 | 99 | 310 | 1382 | 1486385 | 2490343 | 1874 | 3978602 |
| SG1268 | 105 | 334 | 1488 | 1473782 | 3225627 | 2422 | 4701831 |
| SG1269 | 114 | 358 | 1608 | 1512683 | 2477654 | 2043 | 3992380 |
| SG1270 | 108 | 298 | 1340 | 1557762 | 2692982 | 2656 | 4253400 |
| SG1271 | 127 | 388 | 1742 | 1516008 | 2596244 | 2149 | 4114401 |
| SG1272 | 87 | 256 | 1134 | 1406043 | 2820949 | 1743 | 4228735 |
| SG1273 | 183 | 461 | 1950 | 1488525 | 2393320 | 2188 | 3884033 |
aHomo_SNPs are SNPs where both alleles are the same but different from the reference.
bHetero_SNPs are SNPs where one allele is the same as the reference and the other allele is different.
cHet2_SNPs are SNPs where both alleles are different from the reference, and different from each other.
Figure 3Download time from S3 to EC2 instance in the cloud versus the BAM file size.
Figure 4Running time of Picard versus the number of sequence reads.
Figure 5Running time of Bowtie alignment versus the number of paired sequence reads. The cluster consisted of 40 c1.xlarge instances.
Figure 6Running time of SOAPsnp versus the number of paired mapped reads. The cluster consisted of 40 c1.xlarge instances.