| Literature DB >> 23185243 |
Adam C English1, Stephen Richards, Yi Han, Min Wang, Vanesa Vee, Jiaxin Qu, Xiang Qin, Donna M Muzny, Jeffrey G Reid, Kim C Worley, Richard A Gibbs.
Abstract
Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to "phase 3 finished" status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides "lift-over" co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.Entities:
Mesh:
Year: 2012 PMID: 23185243 PMCID: PMC3504050 DOI: 10.1371/journal.pone.0047768
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Gap numbers and size distributions for representative high quality draft assemblies of highly studied species.
| Organism | Common Name | % Bases in Gaps | Mean Gap Size | Median Gap Size |
| Apis mellifera | Honey Bee | 8.40% | 1892 | 54 |
| Equus caballus | Horse | 1.80% | 1010 | 282 |
| Gallus gallus | Chicken | 1.30% | 1267 | 302 |
| Glycine max | Soy Bean | 1.80% | 1208 | 176 |
| Macaca mulatta | Rhesus macaque | 7.20% | 1513 | 374 |
| Monodelphis domestica | Opossum | 2.50% | 1699 | 386 |
| Ornithorhynchus anatinus | Platypus | 7.70% | 585 | 249 |
| Pan troglodytes | Chimpanzee | 6.70% | 1799 | 539 |
| Pongo abelii | Orangutan | 6.40% | 852 | 428 |
| Rattus norvegicus | Brown Rat | 8.90% | 1590 | 57 |
| Sus scrofa | Wild hog | 10.30% | 1218 | 100 |
| Strongylocentrotus purpuratus | Purple Sea Urchin | 12.80% | 844 | 50 |
Figure 1A schematic of PBJelly's workflow and decision-making.
(A) A flow chart of PBJelly's steps. (B) A schematic describing two hypothetical gaps supported by reads and the classifications used during the Support step. (C) A detailed flow chart for local assembly of PacBio reads in a gap region used during the assembly step.
Figure 2Description of sequencing data sets used.
Histograms of read lengths in (A) Dmel, (B) Dpse, (C) Mund, (D) Caty. Panel (E) contains detailed metrics of each dataset.
Figure 3Gap filling Improvements and categories produced by PBJelly.
Histograms showing gap-size distribution in the original and upgraded (A) D .mel, (B) Dpse, (C) Mund, and (D) Caty references as well as a summary of the upgrade categories for gaps.
Gap Fill Statistics for PBJelly.
| Dmel | Original | Upgraded | Improvement |
| Gap Count | 4,651 | 311 | 15.0× |
| Gap n50 | 1,815 bp | 3,504 bp | 1.9× |
| Total Gap Size | 3.19 Mb | 541.3 Kb | 5.9× |
| Contig n50 | 64,006 bp | 723,621 bp | 11.3× |
| Total Contig Size | 133.6 Mb | 136.3 Mb | 1.0× |
Sanger Validation Results Per Gap.
| Gap Id | Flanking Contig Accuracy | Gap Filling Sequence Accuracy | Sanger #bp Over Flanking Contig | Base Pairs Placed In Gap by Sanger Read | Base Pairs Placed in Gap by PBJelly |
| ref0002030_110_111 | 82.3% | 67.8% | 439 | 111 | 90 |
| ref0003044_87_88 | 85.5% | 73.3% | 539 | 67 | 72 |
| ref0003044_71_72 | 86.5% | 76.0% | 637 | 90 | 93 |
| ref0002030_202_203 | 87.7% | 76.0% | 467 | 24 | 24 |
| ref0004545_3_4__ | 88.1% | 83.1% | 551 | 89 | 81 |
| ref0003477_0_1__ | 94.5% | 86.7% | 414 | 233 | 239 |
| ref0004673_6_7__ | 80.0% | 87.8% | 429 | 485 | 520 |
| ref0004554_129_130 | 97.9% | 90.3% | 471 | 91 | 90 |
| ref0004824_98_99 | 98.1% | 90.8% | 208 | 123 | 127 |
| ref0004554_20_21 | 98.3% | 90.9% | 231 | 11 | 10 |
| ref0004554_360_361 | 99.0% | 91.4% | 413 | 32 | 35 |
| ref0004554_67_68 | 97.9% | 91.6% | 478 | 164 | 179 |
| ref0000204_139_140 | 88.5% | 91.7% | 582 | 55 | 60 |
| ref0000204_415_416 | 99.7% | 91.9% | 645 | 34 | 37 |
| ref0003302_34_35 | 99.6% | 92.3% | 480 | 36 | 39 |
| ref0000495_174_175 | 99.7% | 92.8% | 637 | 80 | 80 |
| ref0003625_75_76 | 99.2% | 93.3% | 511 | 154 | 165 |
| ref0003625_121_122 | 99.7% | 93.3% | 602 | 42 | 45 |
| ref0003625_268_269 | 99.1% | 93.4% | 571 | 113 | 121 |
| ref0003302_346_347 | 99.3% | 93.7% | 610 | 74 | 79 |
| ref0000495_153_154 | 99.8% | 94.0% | 407 | 190 | 201 |
| ref0003625_397_398 | 99.0% | 94.1% | 516 | 192 | 200 |
| ref0003625_175_176 | 99.0% | 94.3% | 510 | 100 | 106 |
| ref0003625_408_409 | 99.3% | 94.3% | 554 | 50 | 53 |
| ref0003625_5_6__ | 99.2% | 94.4% | 529 | 204 | 216 |
| ref0004554_292_293 | 99.4% | 95.0% | 506 | 113 | 119 |
| ref0003625_91_92 | 99.8% | 95.2% | 502 | 99 | 104 |
| ref0004554_288_289 | 99.6% | 95.3% | 540 | 204 | 210 |
| ref0004554_300_301 | 99.8% | 95.2% | 543 | 118 | 124 |
| ref0003302_398_399 | 76.3% | 95.3% | 277 | 121 | 127 |
| ref0003302_675_676 | 99.0% | 95.5% | 488 | 63 | 66 |
| ref0003441_8_9__ | 98.1% | 96.0% | 580 | 24 | 25 |
| ref0004554_180_181 | 99.8% | 97.3% | 425 | 107 | 110 |
| ref0000641_380_381 | 98.7% | 97.6% | 527 | 164 | 168 |
| ref0003625_84_85 | 98.4% | 98.2% | 497 | 55 | 56 |
| ref0002030_105_106 | 99.5% | 98.9% | 564 | 94 | 94 |
| ref0003302_100_101 | 99.5% | 100.0% | 561 | 31 | 31 |
| ref0002030_529_530 | 99.2% | 100.0% | 510 | 11 | 11 |
| ref0004709_25_26 | 96.7% | 100.0% | 244 | 6 | 6 |
| ref0003625_282_283* | 98.7% | 100.0% | 539 | −12 | −12 |
| ref0000204_291_292* | 99.8% | 100.0% | 589 | −25 | −25 |
| ref0004554_137_138* | 96.7% | 100.0% | 426 | −1 | −1 |
| ref0003044_23_24* | 88.1% | 100.0% | 370 | −43 | −43 |
| ref0002030_252_253* | 99.8% | 100.0% | 520 | −23 | −23 |
| ref0004554_118_119* | 98.0% | 85.7% | 457 | −42 | −36 |
Negative gaps are marked with an asterisk.
Figure 4Validation of PBJelly
. Using Sanger sequencing of Dpse we validated 7 negative gap closures (A) and 45 closed gaps (B). We also compared PBJelly's gap closing sequence with the original Dmel reference (C).
Figure 5Distribution of amount of sequence placed in closed gaps compared to overfilled gaps.
Frequency plots of the absolute value of sequence placed into gaps subtracted from the predicted gap size in closed gaps versus overfilled gaps in (A) Dpse (B) Mund (C) Caty. Data for Dmel is not shown because synthetically inserted gaps' predicted gap sizes matched the amount of sequence that should have been placed into the gaps.