| Literature DB >> 29367590 |
Iria Fernandez-Silva1,2, James B Henderson3,4, Luiz A Rocha3,5, W Brian Simison3,4.
Abstract
The diversity of DNA sequencing methods and algorithms for genome assemblies presents scientists with a bewildering array of choices. Here, we construct and compare eight candidate assemblies combining overlapping shotgun read data, mate-pair and Chicago libraries and four different genome assemblers to produce a high-quality draft genome of the iconic coral reef Pearlscale Pygmy Angelfish, Centropyge vrolikii (family Pomacanthidae). The best candidate assembly combined all four data types and had a scaffold N50 127.5 times higher than the candidate assembly obtained from shotgun data only. Our best candidate assembly had a scaffold N50 of 8.97 Mb, contig N50 of 189,827, and 97.4% complete for BUSCO v2 (Actinopterygii set) and 95.6% complete for CEGMA matches. These contiguity and accuracy scores are higher than those of any other fish assembly released to date that did not apply linkage map information, including those based on more expensive long-read sequencing data. Our analysis of how different data types improve assembly quality will help others choose the most appropriate de novo genome sequencing strategy based on resources and target applications. Furthermore, the draft genome of the Pearlscale Pygmy angelfish will play an important role in future studies of coral reef fish evolution, diversity and conservation.Entities:
Mesh:
Year: 2018 PMID: 29367590 PMCID: PMC5784092 DOI: 10.1038/s41598-018-19430-x
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The Pearlscale Pygmy Angelfish, Centropyge vrolikii.
Summary of data types and Illumina read statistics used in assemblies.
| Library | Read length | Illumina sequencing mode | Raw data | Filtered data | ||||
|---|---|---|---|---|---|---|---|---|
| Read pair count | Size (bp) | Coverage* | Read pair count | Size (bp) | Coverage* | |||
| Shotgun | 250 PE | 1 lane HiSeq 2500 Rapid mode | 166,049,777 | 83,024,888,500 | 118.61 X | 124,528,871 | 61,983,918,428 | 88.55 X |
| Mate-pair 3.2 Kb | 150 PE | 1/3 lane Illumina HiSeq X SBS | 136,170,869 | 41,123,602,438 | 58.75 X | 91,604,453 | 21,700,169,234 | 31.00 X |
| Mate-pair 6.5 Kb | 150 PE | 1/3 lane Illumina HiSeq X SBS | 135,661,062 | 40,969,640,724 | 58.53 X | 33,812,494 | 8,032,596,441 | 11.48 X |
| Chicago | 100 PE | 1 lane HiSeq 2500 Rapid mode | 133,206,279 | 26,907,668,358 | 38.44 X | 133,206,279 | 26,907,668,358 | 38.44 X |
*Genome Size = 700,000.
Figure 2Flow chart of each of the eight candidate assemblies. Colored ovals represent each of the eight assemblies with color indicating which sources of DNA were used. The bold oval represents our highest scoring and final assembly C_vrolikii_CAS243847_v1.0.
Figure 3Comparison of four contiguity and accuracy statistics among the eight candidate assemblies described in Fig. 2. Contig N50, scaffold N50, scaffold N90 and the proportion of at least partial (complete and fragmented) genes present in our assembly of a set of 3,023 highly conserved single-copy orthologs (BUSCO score), considering fragments longer than 500 bp.
Summary of quantitative measures for selected fish genomes. For BUSCO v1 and v2 values, the first line is the count of complete matches [count of duplicate completenes], count of fragmented, count of missing. The second line is percent of complete matches [percent duplicate completes], percent fragmented, percent missing.
| Name | Scaffold N50 | Contig N50 | BUSCO v1 vertebrata set of 3,023 | BUSCO v2 actinoptergygii set of 4,584 |
|---|---|---|---|---|
| 8,966,845 | 189,827 | 2,820 [219], 92, 111 93% [7.2%], 3.0%, 3.6% | 4,465 [341], 40, 79 97.4% [7.4%], 0.9%, 1.7% | |
|
| 170,231 | 38,328 | 2,735 [154], 117, 171 90% [5.0%], 3.8%, 5.6% | 4393 [146], 90, 101 95.8% [3.2%], 2.0%, 2.2% |
|
| 25,848,596 | 1,721,997 | 2,744 [118], 75, 204 90% [3.9%], 2.4%, 6.7% | 4,382 [140], 52, 150 95.6% [3.1%], 1.1%, 3.3% |
|
| 53,345,113 | 1,263,519 | 2,728 [138], 99, 196 | 4,368 [168], 76, 140 95.3% [3.7%], 1.7%, 3.0% |
|
| 3,158,421 | 79,912 | 2,790 [103], 92, 141 92% [3.4%], 3.0%, 4.6% | 4,454 [95], 54, 76 97.2% [2.1%], 1.2%, 1.6% |
|
| 37,007,722 | 3,090,215 | 2,797 [102], 76, 150 92% [3.3%], 2.5%, 4.9% | 4,464 [94], 42, 78 97.4% [2.1%], 0.9%, 1.7% |
|
| 11,516,971 | 52,883 | 2,628 [92], 97, 298 86% [3.0%], 3.2%, 9.8% | 4,419 [106], 74, 91 96.4% [2.3%], 1.6%, 2.0% |
Figure 4Comparison of the Pearlscale Pygmy Angelfish assembly with 18 other recently published fish assemblies. Assemblies are ranked by scaffold N50 and color coded by type of data that was utilized to generate the assembly (See also information in Table S3). Only scaffolds 1,000 bp and longer were considered for calculating scaffold N50 and CEGMA scores.
Summary of annotation statistics for C_vrolikii_CAS243847_v1.0.
| Length (bp) | Scaffold N50 (bp) | Scaffold L50 | Gene Models | Gene Length (assembly %) | Average Gene Length | Repeats (assembly %) | BUSCO Complete (% of 3,023) |
|---|---|---|---|---|---|---|---|
| 696,494,240 | 8,966,845 | 22 | 28,113 | 36.53% | 9,049 | 15.95% | 93.30% |
Figure 5SyMAP synteny analyses between C. vrolikii and O. niloticus. (A) Syntenic mapping of C. vrolikii contigs to the O. niloticus chromosomes (a.k.a. linkage groups). (B) Whole genome dot plot. Dots represent anchors (“sequence matches”) and blue boxes indicate synteny blocks determined by the SyMap synteny-finding algorithm[27]. (C) Circular view of the synteny between C. vrolikii contigs and the O. niloticus chromosomes.