| Literature DB >> 27279738 |
Stefan E Seemann1, Christian Anthon1, Oana Palasca1, Jan Gorodkin1.
Abstract
The era of high-throughput sequencing has made it relatively simple to sequence genomes and transcriptomes of individuals from many species. In order to analyze the resulting sequencing data, high-quality reference genome assemblies are required. However, this is still a major challenge, and many domesticated animal genomes still need to be sequenced deeper in order to produce high-quality assemblies. In the meanwhile, ironically, the extent to which RNAseq and other next-generation data is produced frequently far exceeds that of the genomic sequence. Furthermore, basic comparative analysis is often affected by the lack of genomic sequence. Herein, we quantify the quality of the genome assemblies of 20 domesticated animals and related species by assessing a range of measurable parameters, and we show that there is a positive correlation between the fraction of mappable reads from RNAseq data and genome assembly quality. We rank the genomes by their assembly quality and discuss the implications for genotype analyses.Entities:
Keywords: assembly quality; domesticated animals; genome assembly
Year: 2016 PMID: 27279738 PMCID: PMC4898645 DOI: 10.4137/BBI.S29333
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1Discrepancy between phylogeny and gene annotation.
Notes: (A) The phylogenetic tree of the 21 investigated species is shown, with a clear separation between placental mammals and birds. The tree is a subset of the UCSC-generated 100-way tree. (B) A UCSC genome browser view in human of the genomic region around PROZ. PROZ is missing in the pig assembly susScr102 and in the phylogenetic subtree around dog, but the gene is conserved in the phylogenetic subtree of pig and even in the more distant birds.
Genome assembly quality features of human, domesticated animals, and related species.
| SPECIES | ASSEMBLY | PROTEIN-CODING (PCE) | ULTFACONSEVED (UC) | ORTHOLOGS (BUSCO) | rRNA | tRNA | GAPS | CONTIGUITY (N50) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4,856 EXONS | 473 LOCI | 3,023 GENES | 21 AA | |||||||||||||||
| VERSION | YEAR | SIZE [MBP] | D [#] | P [#] | S [#] | W [#] | D [#] | P [#] | S [#] | C [%] | CD [%] | F [%] | M [%] | [SCORE] | [#] | [#] | [KBP] | |
| Human | hg19 | 2009 | 3,137 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 90 | 1.7 | 5.1 | 4.5 | 8 | 20 | 411 | 46,396 |
| Mouse | mm10 | 2012 | 2,731 | 17 | 23 | 2 | 5 | 0 | 0 | 0 | 91 | 2.2 | 4.8 | 3.8 | 3 | 21 | 582 | 52,589 |
| Panda | ailMel1 | 2009 | 2,300 | 30 | 34 | 11 | 44 | 0 | 0 | 0 | 88 | 0.5 | 8.0 | 3.4 | 0 | 21 | 108,147 | 1,282 |
| Cow | bosTau8 | 2009 | 2,670 | 23 | 26 | 6 | 24 | 3 | 1 | 0 | 84 | 1.8 | 8.6 | 6.9 | 6 | 21 | 72,051 | 6,380 |
| Dog | canFam3 | 2011 | 2,411 | 33 | 22 | 9 | 24 | 6 | 2 | 0 | 89 | 2.0 | 6.3 | 4.2 | 8 | 20 | 23,876 | 45,877 |
| Domestic goat | capHir1 | 2013 | 2,636 | 126 | 91 | 23 | 79 | 3 | 1 | 0 | 79 | 1.0 | 12 | 8.2 | 0 | 21 | 260,474 | 14,391 |
| Horse | equCab2 | 2007 | 2,485 | 76 | 39 | 12 | 27 | 2 | 4 | 1 | 86 | 0.5 | 8.9 | 4.0 | 8 | 21 | 55,283 | 46,750 |
| Hedgehog | eriEur2 | 2012 | 2,716 | 63 | 28 | 3 | 85 | 2 | 1 | 0 | 86 | 1.4 | 8.5 | 5.3 | 0 | 21 | 219,764 | 3,265 |
| Cat | felCat8 | 2014 | 2,641 | 34 | 28 | 7 | 22 | 2 | 0 | 0 | 88 | 0.7 | 7.3 | 4.3 | 4 | 21 | 100,040 | 18,072 |
| Ferret | musFur1 | 2011 | 2,411 | 52 | 26 | 7 | 53 | 2 | 1 | 0 | 89 | 1.0 | 6.5 | 3.8 | 0 | 20 | 109,700 | 9,335 |
| Microbat | myoLuc2 | 2010 | 2,035 | 162 | 36 | 13 | 93 | 7 | 5 | 0 | 83 | 3.9 | 9.1 | 7.4 | 4 | 20 | 61,131 | 4,293 |
| Sheep | oviAri31 | 2012 | 2,619 | 67 | 77 | 14 | 54 | 0 | 0 | 1 | 81 | 1.2 | 11 | 7.2 | 3 | 21 | 125,067 | 100,080 |
| Megabat | pteVam2 | 2014 | 2,198 | 65 | 29 | 17 | 110 | 0 | 0 | 1 | 87 | 0.7 | 7.7 | 4.8 | 4 | 20 | 189,339 | 5,954 |
| Shrew | sorAra2 | 2012 | 2,423 | 117 | 33 | 10 | 66 | 6 | 0 | 0 | 85 | 1.2 | 7.6 | 6.9 | 0 | 21 | 188,953 | 22,794 |
| Pig | susScr102 | 2011 | 2,809 | 210 | 81 | 28 | 213 | 25 | 12 | 1 | 69 | 2.4 | 12 | 17 | 7 | 20 | 238,439 | 576 |
| Dolphin | turTru2 | 2012 | 2,552 | 24 | 41 | 23 | 469 | 1 | 3 | 8 | 72 | 1.5 | 14 | 13 | 4 | 21 | 313,713 | 116 |
| Alpaca | vicPac2 | 2013 | 2,172 | 48 | 33 | 19 | 56 | 4 | 0 | 0 | 87 | 1.2 | 7.9 | 4.1 | 0 | 21 | 174,225 | 7,264 |
| Zebra finch | taeGut2 | 2013 | 1,232 | 816 | 125 | 15 | 81 | 13 | 12 | 6 | 77 | 2.0 | 8.7 | 13 | 4 | 20 | 87,710 | 8,237 |
| Mallard duck | anaPla1 | 2013 | 1,105 | 979 | 117 | 19 | 119 | 11 | 14 | 2 | 72 | 0.7 | 10 | 16 | 0 | 20 | 125,115 | 1,234 |
| Chicken | galGal4 | 2011 | 1,047 | 686 | 79 | 8 | 50 | 10 | 7 | 0 | 85 | 0.9 | 5.5 | 8.8 | 4 | 20 | 11,109 | 12,877 |
| Turkey | melGal5 | 2014 | 1,128 | 637 | 96 | 20 | 376 | 7 | 5 | 2 | 74 | 0.5 | 10 | 14 | 0 | 20 | 64,955 | 3,801 |
Notes: rRNA is the completeness of one 45S ribosomal DNA cluster consisting of pRNA, 28S, 5.8S, and 18S rRNAs in exactly this 5′ to 3′ order. tRNA is the occurrence of 21 amino acids (aa). Gaps are 10 or more nucleotides long. Contiguity is the scaffold N50. PCE, UC, tRNA, and Gaps are absolute counts [#], BUSCO is in percentage [%], rRNA is presented as a score, and genome size and N50 are sequence lengths. The assembly version is the UCSC Genome Browser assembly ID.
Assembly level is scaffold, otherwise chromosome.
Abbreviations: PCE, Protein-coding exons could be D, deleted; P, partially deleted; S, split; or in W, wrong order. UC, Ultraconserved elements could be D, deleted; P, partially deleted; or S, split. BUSCO, Universal single-copy orthologs could be C, complete; CD, complete duplicated; F, fragmented; or M, missing.
Figure 2Relationship between principal components and quality features.
Notes: The first three principal components (PCs) account for 75% of the feature variance (PC1: 47.1%, PC2: 17.2%, and PC3: 10.8%). Rectangular nodes describe the 15 quality features. The edge weight describes how much variance of a feature is explained by the principal component. Green edges connect features negatively related to genome quality, and purple edges connect features positively related to genome quality. Relations (edges) are shown if greater than 0.3 or smaller than −0.3. See Table 1 for abbreviations of the quality features. Gaps are normalized by assembly size. The figure was made using the R qgraph package.
Figure 3Correlation of genome assembly quality to (A) phylogenetic distance to human and to (B) frequency of mapped reads from RNAseq experiments.
Notes: The genome assembly quality (PCA score) is measured as the Euclidian distance of principal component 1 (PC1), PC2, and PC3 between human and 20 other species. The human genome serves as reference and has a Euclidian distance of zero. RNAseq experiments are divided into polyA-selected RNA (blue circle) and total RNA (green diamond), and the mean and standard deviation of mapped reads are shown. After removing human (reference) and bird genomes (large phylogenetic distance), the Pearson’s correlation coefficient between assembly quality and phylogenetic distance is 0.26, between assembly quality and polyA-selected RNA mapped reads is 0.91, and between polyA-selected RNA mapped reads and phylogenetic distance is 0.43. Only the correlation between assembly quality and polyA-selected RNA mapped reads is significant (P <.005).