| Literature DB >> 29703150 |
Michael P Dunne1, Steven Kelly2.
Abstract
BACKGROUND: The accurate determination of the genomic coordinates for a given gene - its gene model - is of vital importance to the utility of its annotation, and the accuracy of bioinformatic analyses derived from it. Currently-available methods of computational gene prediction, while on the whole successful, frequently disagree on the model for a given predicted gene, with some or all of the variant gene models often failing to match the biologically observed structure. Many prediction methods can be bolstered by using experimental data such as RNA-seq. However, these resources are not always available, and rarely give a comprehensive portrait of an organism's transcriptome due to temporal and tissue-specific expression profiles.Entities:
Keywords: Annotation errors; Gene model; Genome annotation; Orthogroups; Orthology
Mesh:
Year: 2018 PMID: 29703150 PMCID: PMC5923031 DOI: 10.1186/s12864-018-4704-z
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Species sets used for algorithm validation
| Species Name | Source | Version/Strain | Taxonomy ID | References | |
|---|---|---|---|---|---|
|
|
| JGI1 | TAIR10 | 3702 | [ |
|
| JGI | v1.3 | 3711 | [ | |
|
| JGI | ASGPBv0.4 | 3649 | [ | |
|
| JGI | v1.0 | 81,985 | [ | |
|
| JGI | v1.1 | 3641 | [ | |
|
|
| NCBI | CanFam3.1/AR105 | 6915 | [ |
|
| CCDS3 | GRCh38.p7/CCDS20 | 9606 | [ | |
|
| NCBI | MonDom5 / AR103 | 13,616 | [ | |
|
| CCDS | GRCm38.p4/CCDS21 | 10,090 | [ | |
|
| NCBI | OryCun2.0/AR102 | 568,996 | [ | |
|
|
| JG |
| 284,811 | [ |
|
| JGI |
| 284,592 | [ | |
|
| JGI |
| 284,590 | [ | |
|
| SGD4 |
| 559,292 | [ | |
|
| JGI |
| 284,591 | [ |
1 Joint Genome Institute; National Centre for Biotechnology Information; Consensus Coding Sequence Project; Saccharomyces Genome Database
SRA RNA-seq data sources
| Species | SRA ID | Instrument/details | |
|---|---|---|---|
| Plants |
| SRR3932355 | Illumina HiSeq 2500, paired end. Wild type Columbia |
|
| SRR2984945 | Illumina HiSeq 2000, paired end. ga-deficient dwarf (gad1–2) + GA rep2 | |
|
| SRR3509576 | Illumina HiSeq 2500, paired end. SunUp/Sunset cultivar, young hermaphrodite leaf | |
|
| SRR3993756 | Illumina HiSeq 2000, paired end. Leaf sample | |
|
| SRR3217315 | Illumina HiSeq 2000, paired end. Flower/leaf sample | |
| Mammals |
| ERR266386 | Illumina Genome Analyzer II, paired end, brain frontal cortex, male |
| ERR266355 | Illumina Genome Analyzer II, paired end, brain frontal cortex, female | ||
| ERR266382 | Illumina Genome Analyzer II, paired end, brain frontal cortex, male | ||
|
| SRR5938455 | Illumina HiSeq 2000, paired end, dorsolateral prefrontal cortex, male | |
|
| SRR500906 | Illumina HiSeq 2000, paired end, brain | |
| SRR500925 | Illumina HiSeq 2000, paired end, brain | ||
|
| SRR5441717 | Illumina HiSeq 2000, paired end, brain (striatum) | |
| SRR6269591 | Illumina NovaSeq 6000, paired end, cerebellum | ||
|
| ERR266399 | Illumina Genome Analyzer II, paired end, brain frontal cortex, female | |
| SRR400990 | Illumina Genome Analyzer II, paired end, brain frontal cortex | ||
| SRR401040 | Illumina Genome Analyzer II, paired end, brain frontal cortex | ||
| SRR401041 | Illumina Genome Analyzer II, paired end, brain frontal cortex | ||
| SRR401042 | Illumina Genome Analyzer II, paired end, brain frontal cortex | ||
| Fungi |
| SRR1200528 | Illumina Genome Analyzer II, single |
|
| SRR539284 | Illumina HiSeq 2000, paired end | |
|
| SRR868669 | Illumina HiSeq 2000, single |
Fig. 1Multipartite Choice Function. The choice function aims to find optimal variants from a set of protein sequences. a) Sequences are aligned; b) A consensus alignment is produced: on a column-by-column basis the choice of amino acid for each sequence that optimises the alignment score for that column is chosen as a representative; c) A binary representation is produced from the original alignment: for each position in alignment, a 1 is assigned if the amino acid matches the consensus, and a 0 is assigned if it does not. This leaves a sequence of vertical binary strings. The aim is to find a single vertical binary string that agrees with (i.e. is a bitwise subset of) as many as possible of these, and that is also compatible with the category constraints. The best such string in this case is shown to the right in green. d) The result
Fig. 2Calculation of adjacency groups. a) Amino acid sequences for individual putative exons are strung together and aligned. b) A graph is formed with vertices formed by gene parts (or exons), and edges drawn when the overlap between two parts is greater than or equal to one third the length of one of them. c) Cliques are extracted and then ordered lexicographically to form the adjacency groups
Fig. 3Simplified overview of OMGene workflow. a) Gene regions are extracted from around the gene model; b) Exonerate is used to cross-align all constituent exons and full open reading frames to construct basic prototype gene models; c) The exonic regions from these prototype gene models are sorted into adjacency groups, which are then sequentially optimised using the multipartite choice function; d) Results are compared against the original gene models to incorporate potentially overlooked combinations, and filtered under various criteria to produce results
Fig. 4Examples of individual gene model changes for genes in A. thaliana. a) AT1G01320.1, orthogroup OG0010924, exon extension, splice acceptor side; b) AT1G76280.3.TAIR10, orthogroup OG10336, exon contraction, splice acceptor side; c) AT1G22860.1, orthogroup OG0010738, novel exon introduced; d) AT2G38720.1, orthogroup OG0009331, removed exon; e) AT3G01980.3, orthogroup OG0011814, novel intron introduced; f) AT4G14590.1, orthogroup OG0010029, intron removed; g) AT3G01380.1, orthogroup OG0012127, moved start codon; h) AT5G11490.2, orthogroup OG0013306, complex event: exon has been removed and the previous exon boundary has been extended to include the stop codon
Per-species gene change breakdown
| Species | No. changed genes | Nucleotides added/removed (means per change) | In original annotation as alternative “non-primary” gene model | |||
|---|---|---|---|---|---|---|
| + (mean) | + (mean) | + (mean) | ||||
| Plants |
| 175 | 1749 (43) | −23,747 (−118) | −22,139 (−92) | 53 (30.3%) |
|
| 97 | 1787 (58) | −25,740 (−250) | − 23,953 (− 179) | 4 (4.1%) | |
|
| 540 | 23,820 (65) | −72,053 (− 128) | −48,233 (−52) | 0 (0.0%) | |
|
| 298 | 6568 (71) | −55,005 (− 170) | − 48,437 (−117) | 2 (0.7%) | |
|
| 556 | 3700 (45) | −120,984 (−118) | −117,284 (− 124) | 95 (17.1%) | |
|
| 1666 | 37,624 (61) | − 297,529 (−145) | − 259,905 (−97) | 154 (9.2%) | |
| 598 | 13,623 (42) | −167,038 (−35) | −51,177 (−57) | N/A | ||
| Mammals |
| 698 | 8993 (64) | −98,467 (−117) | −89,474 (−91) | 375 (53.7%) |
|
| 397 | 4429 (59) | −41,401 (−101) | −36,972 (−76) | 218 (54.9%) | |
|
| 787 | 8637 (53) | −100,256 (−101) | −91,619 (−80) | 349 (44.3%) | |
|
| 270 | 9685 (120) | −19,236 (−79) | − 9551 (−29) | 81 (30.0%) | |
|
| 534 | 12,038 (61) | −72,398 (− 112) | −60,360 (−71) | 243 (45.5%) | |
|
| 2686 | 43,782 (67) | −331,758 (− 106) | −287,976 (−76) | 1266 (47.1%) | |
| 2907 | 251,344 (79) | − 952,864 (−167) | − 701,520 (−79) | N/A | ||
| Fungi |
| 46 | 0 (0) | − 4338 (−93) | −4338 (−93) | N/A |
|
| 13 | 0 (0) | − 2080 (−149) | −2080 (−149) | N/A | |
|
| 11 | 0 (0) | − 1314 (− 110) | −1314 (−110) | N/A | |
|
| 11 | 93 (93) | − 2483 (− 191) | − 2390 (−170) | N/A | |
|
| 23 | 117 (29) | − 4186 (− 199) | − 4069 (− 163) | N/A | |
| TOTAL | 104 | 210 (42) | −14,401 (− 135) | − 14,191 (−127) | N/A | |
| 19 | 601 (120) | − 5561 (− 347) | − 4960 (−236) | N/A | ||
Fig. 5Chart showing the number of changes made. a) C. papaya and T. cacao experienced the most changes in the plant data set. The de novo predicted gene models for the A. thaliana genome underwent three times more changes than the publicly available one. b) The number of changes in mammals was roughly the same as in plants. As expected, M. musculus and H. sapiens experienced the fewest changes. The de novo predicted gene models for H. sapiens underwent considerably more changes than the curated genome annotation. c) The number of changes made was significantly less for the fungi data set. As for the plants and mammals, the de novo predicted genes for S. cerevisiae underwent more changes than the curated version
Fig. 6The average number of nucleotides added or removed from gene models as a result of changes made by OMGene. The units of the y-axis are the number of nucleotides. a) Average magnitudes of each change for plants; b) Average magnitudes for changes made to mammal genes; c) Average magnitudes for changes made to fungal genes. In all cases, predicted gene sequences were shortened to a greater extent than they were lengthened.
Summary of gene model change categories
| Species | No. changes | Exon boundary | Exon | Intron | Moved start | ||||
|---|---|---|---|---|---|---|---|---|---|
| contraction | extension | add | del | add | del | ||||
|
|
| 242 | 47 | 23 | 4 | 117 | 5 | 13 | 33 |
|
| 134 | 11 | 14 | 9 | 56 | 3 | 8 | 33 | |
|
| 928 | 148 | 205 | 95 | 345 | 18 | 42 | 74 | |
|
| 415 | 32 | 32 | 39 | 101 | 1 | 19 | 191 | |
|
| 949 | 117 | 59 | 9 | 624 | 10 | 13 | 117 | |
| TOTAL | 2668 | 355 | 333 | 156 | 1243 | 37 | 95 | 448 | |
| 1344 | 151 | 255 | 49 | 780 | 2 | 10 | 97 | ||
|
|
| 980 | 116 | 72 | 38 | 418 | 9 | 4 | 323 |
|
| 485 | 31 | 43 | 29 | 223 | 0 | 0 | 159 | |
|
| 1148 | 122 | 89 | 56 | 472 | 19 | 3 | 387 | |
|
| 324 | 20 | 29 | 42 | 111 | 6 | 6 | 110 | |
|
| 834 | 139 | 122 | 43 | 245 | 18 | 9 | 258 | |
| TOTAL | 3771 | 428 | 355 | 208 | 1469 | 52 | 22 | 1237 | |
| 8883 | 606 | 1128 | 1866 | 4217 | 16 | 17 | 1033 | ||
|
|
| 46 | 0 | 0 | 0 | 1 | 0 | 0 | 45 |
|
| 13 | 0 | 0 | 0 | 1 | 0 | 0 | 12 | |
|
| 11 | 0 | 0 | 0 | 0 | 0 | 0 | 11 | |
|
| 13 | 1 | 0 | 0 | 0 | 1 | 1 | 10 | |
|
| 24 | 0 | 0 | 0 | 4 | 5 | 0 | 15 | |
| TOTAL | 107 | 1 | 0 | 0 | 6 | 6 | 1 | 93 | |
| 20 | 0 | 2 | 0 | 4 | 0 | 2 | 12 | ||
Fig. 7Distribution of types of changes made in the three data sets. a) The most common change in plants was exon deletion. b) Moved start codons and removed exons were most common in mammals. c) In fungi, the most common change was overwhelmingly a moved start codon
RNA-seq coverage and junction F-scores
| Species | Junction F-score | Coverage F-score | |||
|---|---|---|---|---|---|
| Better | Worse | Better | Worse | ||
| Plants |
| 94 (87.8%) | 13 (12.1%) | 109 (91.5%) | 10 (8.4%) |
|
| 24 (63.1%) | 14 (36.8%) | 29 (56.8%) | 22 (43.1%) | |
|
| 246 (82.2%) | 53 (17.7%) | 344 (83.9%) | 66 (16.0%) | |
|
| 90 (89.1%) | 11 (10.8%) | 186 (91.6%) | 17 (8.3%) | |
|
| 275 (88.9%) | 34 (11.0%) | 358 (87.3%) | 52 (12.6%) | |
| TOTAL | 729 (85.3%) | 125 (14.6%) | 1026 (86.0%) | 167 (13.9%) | |
| 422 (87.3%) | 61 (12.6%) | 475 (91.1%) | 46 (8.8%) | ||
| Mammals |
| 323 (85.9%) | 53 (14.1%) | 478 (86.28%) | 76 (13.72%) |
|
| 102 (82.9%) | 21 (17.1%) | 258 (87.46%) | 37 (12.54%) | |
|
| 239 (72.2%) | 92 (27.8%) | 439 (76.48%) | 135 (23.52%) | |
|
| 71 (62.3%) | 43 (37.7%) | 133 (71.12%) | 54 (28.88%) | |
|
| 213 (84.2%) | 40 (15.8%) | 353 (86.10%) | 57 (13.90%) | |
|
| 948 (79.2%) | 249 (112.5%) | 1661 (82.83%) | 359 (17.77%) | |
| 832 (89.4%) | 99 (10.6%) | 2017 (88.74%) | 256 (11.26%) | ||
| Fungi |
| 0 (N/A) | 0 (N/A) | 9 (100.0%) | 0 (0%) |
|
| 0 (N/A) | 0 (N/A) | 6 (75.0%) | 2 (25.0%) | |
|
| 2 (28.5%) | 5 (71.4%) | 11 (64.7%) | 6 (35.2%) | |
| TOTAL | 3 (37.5%) | 5 (62.5%) | 30 (75.0%) | 10 (25.0%) | |
| 4 (100%) | 0 (0%) | 11 (64.7%) | 6 (35.2%) | ||
Subcellular localisation predictions
| Category | No. orthogroups with changed localisation predictions | Entropy score | |||
|---|---|---|---|---|---|
| Better | Same | Worse | |||
| Plants | Public data | 55 | 42 (76.4%) | 5 (7.7%) | 8 (14.5%) |
| 11 | 9 (81.9%) | 0 (0%) | 2 (18.2%) | ||
| Mammals | Public data | 509 | 444 (87.2%) | 19 (3.7%) | 46 (9.0%) |
| 527 | 458 (86.9%) | 23 (4.4%) | 46 (8.7%) | ||
| Fungi | Public data | 7 | 6 (85.7%) | 0 (0%) | 1 (14.3%) |
| 1 | 0 (0%) | 0 (0%) | 1 (100%) | ||
Fig. 8Example change in subcellular localisation prediction for a gene. Thecc1EG021604t1.CGDv1.1 from T. cacao has undergone a change in start codon, revealing a signal peptide at its 5′ end. In this case, what was previously assumed to be cytosolic has been found to be targeted to the secretory pathway, the same as the other members of the orthogroup (OG0009265). In this case, the Shannon entropy score for the orthogroup has fallen from 0.72 to 0