| Literature DB >> 28724409 |
Jason R Miller1, Peng Zhou2, Joann Mudge3, James Gurtowski4, Hayan Lee5, Thiruvarangan Ramaraj3, Brian P Walenz6, Junqi Liu7, Robert M Stupar7, Roxanne Denny8, Li Song9, Namrata Singh10, Lyza G Maron10, Susan R McCouch10, W Richard McCombie4, Michael C Schatz11, Peter Tiffin2, Nevin D Young2, Kevin A T Silverstein12.
Abstract
BACKGROUND: Long-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation.Entities:
Keywords: Genome assembly; Hybrid assembly pipeline; Medicago truncatula; Tandem repeats
Mesh:
Year: 2017 PMID: 28724409 PMCID: PMC5518131 DOI: 10.1186/s12864-017-3927-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Change in reference agreement attributable to hybrid assembly methods
| Source | Metric | ALLPATHS | PBJelly | Alpaca |
|---|---|---|---|---|
| Agreement | ||||
| Nucmer | Alignment N50 | 20,539 | +86% | +99% |
| ATAC | Alignment N50 | 174,306 | +12% | +27% |
| Quast | NGA50 | 86,432 | 0% | +30% |
| Disagreement | ||||
| Quast | Misassemblies | 3784 | +50% | −17% |
| Quast | Local misassemblies | 9444 | −21% | −43% |
| Quast | Misassembled contigs | 1423 | +17% | −13% |
The rice Nipponbare genome was assembled with ALLPATHS and then re-assembled with the PBJelly and Alpaca hybrid methods. All assemblies were compared to the independently derived reference and reference agreement was measured relative to the ALLPATHS level. Top: the sizes of alignments to the reference characterized by N50. Nucmer alignments are bounded by contigs while ATAC “M c” alignments can span intra-scaffold gaps. Quast NGA adjusted N50 after breaking at mis-assemblies. Bottom: Quast uses Nucmer alignments to infer global and local mis-assemblies, where the former involve spans or transpositions of 1Kbp or larger
Analysis of short and long tandem repeats in three assemblies of rice
| A | Category | ALLPATHS | PBJelly | Alpaca |
| Unit > =2Kbp | One scaffold | 2.4% | 6.9% | 51.6% |
| Two scaffolds | 4.2% | 25.3% | 36.5% | |
| Underrepresented | 93.4% | 67.8% | 11.8% | |
| Total | 4734 | 4734 | 4734 | |
| Unit < 2Kbp | One scaffold | 71.3% | 81.8% | 80.1% |
| Two scaffolds | 12.8% | 12.0% | 6.7% | |
| Underrepresented | 15.9% | 6.2% | 13.2% | |
| Total | 61,140 | 61,140 | 61,140 | |
| B | Category | ALLPATHS | PBJelly | Alpaca |
| Unit > =2Kbp | One chromosome | 43.9% | 32.1% | 61.3% |
| Two chromosomes | 0.9% | 1.1% | 4.7% | |
| Underrepresented | 55.3% | 66.8% | 33.9% | |
| Total | 114 | 184 | 548 | |
| Unit < 2Kbp | One chromosome | 61.6% | 58.1% | 73.3% |
| Two chromosomes | 4.1% | 4.2% | 1.9% | |
| Underrepresented | 34.3% | 37.7% | 24.7% | |
| Total | 8079 | 8034 | 9368 |
A. Repeat pairs on reference chromosomes were classified by whether both repeated units were 50% covered by alignments to one scaffold, two scaffolds, or were “underrepresented”, in each of three assemblies. B. Conversely, repeat pairs on assembled scaffolds were classified by whether they were 50% covered by alignments to chromosomes in the reference. There are fewer total repeats in (B) because the number of same-scaffold repeats is lower in each assembly than the number of same-chromosome repeats in the reference
Counts and lengths of alignments to the reference
| Accession | ALLPATHS | PBJelly | Alpaca | |
|---|---|---|---|---|
| Count, 2Kbp or longer | HM034 | 296 | 553 | 2058 |
| HM056 | 257 | 436 | 1652 | |
| HM340 | 273 | 443 | 1947 | |
| Count, Under 2Kbp | HM034 | 14,990 | 14,911 | 18,888 |
| HM056 | 14,665 | 14,110 | 14,603 | |
| HM340 | 18,206 | 17,225 | 19,334 | |
| Average length | HM034 | 294 | 388 | 769 |
| HM056 | 291 | 373 | 767 | |
| HM340 | 271 | 336 | 730 |
In each of three Medicago accessions assembled three ways, the Alpaca assembly contained the most repeats and the largest average repeat length
Fig. 1Tandemly array counts per assembly. Teh assemblies of four Medicago truncatula accessions were analyzed for gene cluster content. Each vertical bar of the histogram indicates the number of tandem gene clusters. Left to right per cluster: light blue = HM056 ALLPATHS, blue = HM056 PBJelly, dark blue = HM056 Alpaca, light green = HM034 ALLPATHS, green = HM034 PBJelly, dark green = HM034 Alpaca, light orange = HM340 ALLPATHS, orange = HM340 PBJelly, dark orange = HM340 Alpaca, and purple = the Mt4.0 reference assembly of the A17 (HM101) accession
Gene copy number predictions and validations for a CRP3710 subfamily
| accession: | HM101 | HM034 | HM056 | HM340 |
|---|---|---|---|---|
| A. Assembly | ||||
| MT4.0 | 2 | |||
| ALLPATHS | 1 | 3 | 3 | |
| PBJelly | 8 | 5 | 3 | |
| Alpaca | 29 | 26 | 30 | |
| B. Coverage (RPM) | ||||
| Medtr1g061160 (control 1) | 0.26 | 0.38 | 0.29 | 0.50 |
| Medtr1g080770 (control 2) | 0.29 | 0.59 | 0.51 | 0.57 |
| CRP3710 | 7.00 | 5.60 | 9.00 | 14.00 |
| estimated copy number | 25.5 | 11.5 | 22.5 | 26.2 |
| C. qPCR | ||||
| estimated copy number | 12.0 | 11.0 | 9.7 | 8.9 |
A. Annotation found between 1 and 30 copies per assembly. B. Coverages in reads per million bases for this gene and two controls, followed by the copy numbers estimated by fold increase of gene over control average, per accession. C. Copy numbers estimated from quantitative PCR per accession
Fig. 2Alpaca pipeline schematic. The figure shows inputs (dashed outline), processes (light-filled boxes), and outputs (blue boxes)