| Literature DB >> 28100185 |
Ole K Tørresen1, Bastiaan Star2, Sissel Jentoft2,3, William B Reinar2, Harald Grove4, Jason R Miller5, Brian P Walenz6, James Knight7, Jenny M Ekholm8, Paul Peluso8, Rolf B Edvardsen9, Ave Tooming-Klunderud2, Morten Skage2, Sigbjørn Lien4, Kjetill S Jakobsen2, Alexander J Nederbragt10,11.
Abstract
BACKGROUND: The first Atlantic cod (Gadus morhua) genome assembly published in 2011 was one of the early genome assemblies exclusively based on high-throughput 454 pyrosequencing. Since then, rapid advances in sequencing technologies have led to a multitude of assemblies generated for complex genomes, although many of these are of a fragmented nature with a significant fraction of bases in gaps. The development of long-read sequencing and improved software now enable the generation of more contiguous genome assemblies.Entities:
Keywords: Assembly algorithms; Assembly consolidation; Dinucleotide repeats; Gadus morhua; Heterozygosity; Indel polymorphism; Long-read sequencing technology; Microsatellites; PacBio; Repetitive DNA
Mesh:
Year: 2017 PMID: 28100185 PMCID: PMC5241972 DOI: 10.1186/s12864-016-3448-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Contig and scaffold N50 lengths of the different cod assemblies. gadMor2 was created by following the sequences in CA454ILM in a path through a graph created from a multiple alignment of the four original assemblies, and outputting the contig sequences from CA454PB for each alignment. NEWB454 and ALPILM were used to extend the scaffolds, see Table 1
Overview of assembly statistics
| Assembly | Total size | N50 | N50 | Percentage | CEGMAa | BUSCOb | REAPRc |
| Potential |
|---|---|---|---|---|---|---|---|---|---|
| assembly | contig | scaffold | gap bases | conflict | |||||
| (Mbp) | (kbp) | (Mbp) | (sequences)e | ||||||
| gadMor1f | 832 | 2.3 | 0.14 | 26.9 | 444 (96.9%) | 3 308 (89.4%) | 2 547 | 4 210 772 | 76 |
| ALPILM | 660 | 4.4 | 0.16 | 28.7 | 424 (92.6%) | 3 016 (81.6%) | 19 787 | 2 182 096 | 122 |
| NEWB454 | 656 | 6.2 | 1.30 | 24.4 | 435 (95.0%) | 3 109 (84.1%) | 18 117 | 2 044 008 | 26 |
| CA454ILM | 647 | 9.9 | 0.50 | 3.49 | 447 (97.5%) | 3 379 (91.4%) | 7 406 | 1 351 500 | 96 |
| CA454PB | 682 | 95 | 0.27 | 1.62 | 431 (94.1%) | 3 310 (89.5%) | 8 617 | 1 508 054 | 188 |
| gadMor2g | 643 | 116 | 1.15 | 1.69 | 435 (95.0%) | 3 447 (93.2%) | 7 359 | 1 248 792 | 15 |
aCEGMA annotates 458 highly conserved eukaryotic genes
bBUSCO annotates 3,698 actinopterygii specific genes
cREAPR analyses the discordance between the expected order, orientation and distance of mapped paired reads, with detected potential errors, fewer is better
d FRC uses a similar approach as REAPR, with total number of features (i.e., potential assembly problems), fewer is better
eNumber of sequences mapping to more than one linkage group or to multiple linkage groups, fewer is better
fFrom [5]
g93% of the gadMor2 assembly is additionally oriented and ordered into 23 linkage groups (Additional file 1: Table S3)
Fig. 2The HoxC cluster in gadMor1 and gadMor2. Blocks of dark and light blue are contig sequences, white blocks are gaps and red lines are tandem repeats. Gene models are sketched at the top of the figure. This region is a single contig in gadMor2 and 21 contigs in gadMor1. Tandem repeats are at the borders between almost all gaps and contigs in gadMor1
Comparison between the gene annotations of gadMor1 and gadMor2
| Assembly | Total size | Number of genes | N50 length (bp)b | Amount gap bases | BUSCOd |
|---|---|---|---|---|---|
| transcriptome (Mbp)a | (Mbp)c | ||||
| gadMor1 | 32.2 (24.8) | 22 618 e | 1 854 (1 398) | 1.7 | 2 947 (79.7%) |
| gadMor2 | 52.9 (33.4) | 23 246 f | 3 239 (1 995) | 0 | 2 714 (73.4%) |
aSum of bases in transcripts with UTRs (without UTRs)
bHalf the transcriptome is in sequences of this length or longer, with UTRs (without UTRs)
cGaps represented as ’N’s in annotated transcripts
dNumber (percentage) of conserved actinopterygii genes detected out of a total of 3,698
eWhen excluding pseudogenes, alternative transcripts, etc., the number of protein-coding genes is 20,095
fProtein-coding genes only
Comparison of the SNP and indel rates of selected organisms
| Species | SNP rate | Indel rate | N50 contig | N50 scaffold |
|---|---|---|---|---|
| (SNPs/base) | (indels/base) | (kbp) | (Mbp) | |
| Atlantic cod (gadMor2) | 4.07 × 10 −3 | 0.98 × 10 −3 | 116 | 1.15 |
| Sticklebacka | 1.43 × 10 −3 | NA | 83.2 | 10.8 |
| Miiuy croakerb | 2.24 × 10 −3 | 0.61 × 10 −3 | 73.3 | 1.15 |
| Atlantic herringc | 3.2 × 10 −3 | NA | 21.3 | 1.84 |
|
| 46 × 10 −3 | NA | 12 | 0.192 |
|
| 46 × 10 −3 | NA | 47 | 0.989 |
aFrom [68]
bFrom [67]
cFrom [69]
dFrom [66]
eFrom [66], with haplotype assembly and merging
The repeat content of of the Atlantic cod genome assembly
| Repeat | Number | Coverage (Mbp) | Coveragea (%) |
|---|---|---|---|
| of elements | |||
| LINEs | 64 344 | 18.4 | 2.86 |
| LTR elements | 81 087 | 22.3 | 3.47 |
| DNA elements | 269 835 | 46.5 | 7.23 |
| Unclassified | 215 676 | 59.2 | 9.21 |
| Total interspersed repeatsb | 636 132 | 147.1 | 22.86 |
| Tandem repeats | 582 198 | 51.2 | 7.96 |
aGroups of elements covering less than 1% of the genome assembly are not shown
bThis is the sum of all annotated interspersed repeats, including the first four rows plus SINEs
Fig. 3The density of TRs and the size of the assembly for different cod assemblies. The different assemblies (black) are all similar in size, around 650 Mbp, with the exception of the much larger gadMor1, while the amount of sequence in contigs in the different assemblies (grey) differs substantially. The vertical distance between pairs of points for each assembly equals the amount of sequence in gaps
Overview of tandem repeat statistics
| Assembly | Total size | Number of TRs | Mean length ±standard | Density of TRs |
|---|---|---|---|---|
| assembly (Mbp) | deviation (bp) | (% of assembly) | ||
| gadMor1 | 832 | 970 798 | 56.50 ±45.17 | 8.75 |
| ALPILM | 660 | 530 801 | 49.64 ±53.64 | 5.41 |
| NEWB454 | 656 | 601 043 | 60.35 ±62.72 | 7.01 |
| CA454ILM | 647 | 921 184 | 73.43 ±97.89 | 10.2 |
| CA454PB | 682 | 890 967 | 86.01 ±130.64 | 10.6 |
| gadMor2 | 643 | 876 691 | 84.32 ±121.86 | 10.9 |
Fig. 4The number of tandem repeats categorized based on unit size. Only tandem repeats with unit size 1-20 bp are shown. A unit size of one indicates a mononucleotide tandem repeat, two a dinucleotide, three a trinucleotide, repeats etc. The horizontal axis denotes the unit sizes of the repeat, while the vertical axis shows the count of the particular repeat
Fig. 5The density of TRs in genome assemblies, promoters and coding regions. The assemblies shown here are from Ensembl release 81, excluding gadMor1, plus a human genome based on PacBio data, the California sea hare Aplysia californica and gadMor2 (n = 71). The panels show the density (percentage of bases) of TRs in the whole assembly, coding regions and promoter regions, respectively. The human PacBio assembly is not included in the gene and promoter analysis because it has no annotation, and the opossum is lacking for technical limitations. The species marked are Oc (Ochotona princeps, pika), Hs (Homo sapiens, human), Hs(PB) (Homo sapiens, human, PacBio based assembly), Cf (Canis familiaris, dog), Do (Dipodomys ordii, kangaroo rat), Xt (Xenopus tropicalis, frog), Pf (Poecilia formosa, Amazon molly), Dr (Danio rerio, zebrafish), Pm (Petromyzon marinus, lamprey), Sc (Saccharomyces cerevisiae, yeast), Ac (Aplysia californica, California sea hare) and Gm (Gadus morhua, Atlantic cod, gadMor2)
Fig. 6The intersections between contig termini and different annotated features. The percentage of contig termini (the position of the terminal nucleotides of each contig) intersecting different annotations of the genome