| Literature DB >> 32264824 |
Patrick J Monnahan1,2,3, Jean-Michel Michno1, Christine O'Connor1,2, Alex B Brohammer1, Nathan M Springer3, Suzanne E McGaugh2, Candice N Hirsch4.
Abstract
BACKGROUND: Advances in sequencing technologies have led to the release of reference genomes and annotations for multiple individuals within more well-studied systems. While each of these new genome assemblies shares significant portions of synteny between each other, the annotated structure of gene models within these regions can differ. Of particular concern are split-gene misannotations, in which a single gene is incorrectly annotated as two distinct genes or two genes are incorrectly annotated as a single gene. These misannotations can have major impacts on functional prediction, estimates of expression, and many downstream analyses.Entities:
Keywords: Annotation; Genome assembly; Maize; Split-gene
Mesh:
Year: 2020 PMID: 32264824 PMCID: PMC7140576 DOI: 10.1186/s12864-020-6696-8
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Identifying syntenic homologs and isolating split-gene candidates. a Homology classifications from syntenic homology pipeline. b Schematic for calculation of tandem duplicate percentage. We require the ratio of L1 to L2 to be < 0.1 (i.e. the proportionate overlap of the BLAST query genes with respect to the total aligned space of the subject gene). c Summary of homology classifications and split-gene candidate filtration. A ‘Testable candidate’ is one in which all of the genes involved are expressed. d Corroboration of testable candidates. E.g. 43 ‘Corroborated’ split-gene candidates in the B73 annotation (‘B73 - Split’) were simultaneously identified as a single gene in W22 and PH207, while there were 61 genes in B73 that corresponded to multiple genes in both PH207 and W22 (‘B73 - Merged’), and the 438 ‘Unique’ split-gene candidates in B73 were identified as a single gene in W22 or PH207
Fig. 2M2f approach for determining correct gene model(s) for split-gene candidates. a Calculating average normalized expression across exons within a tissue for a pair of split-gene genes. b M2f calculation. The absolute log2-fold change in average expression (from a) across the split-genes is averaged across tissues. Higher values reflect large expression differences across split-genes. c Simulating the M2f distribution under the null hypothesis that split-gene expression differences come from a single underlying gene. Observed M2f values greater than the 90th percentile of this null distribution are unlikely to result if the single gene annotation is correct. d Simulating the M2f distribution under the null hypothesis that split-gene expression differences come from separate, adjacent genes
Fig. 3Results of M2f classification. a Observed M2f distribution across all split-genes detected in each annotation. The dotted lines are the threshold values generated by simulating null distributions in Fig. 2c-d. b Number of split-gene candidates (Multiple genes) classified as to whether the split-genes should be annotated as distinct genes or a single, merged gene for each pairwise comparison of annotations. c Correlation of M2f values for instances where a single gene from one annotation corresponded to split-gene candidates in both of the alternative annotations (‘Corroborated’ Merged genes in Fig. 1d). E.g. Each point in the ‘B73 x W22’ comparison corresponds to a single PH207 gene. X-axis is the M2f value from the B73 split-gene candidate, and y-axis is the M2f value from the W22 split-gene candidate. Dotted lines indicate the M2f threshold values in part a. d Joint distribution of classifications across comparisons in part c
Summary of M2f distributions for split-gene candidates in each annotation. CV = coefficient of variation. N = number of tested candidates
| Split-genes | Mean | Median | Variance | CV | N |
|---|---|---|---|---|---|
| B73 | 2.45 | 2.09 | 2.49 | 0.693 | 506 |
| PH207 | 1.64 | 1.2 | 2.07 | 0.88 | 1129 |
| W22 | 2.05 | 1.66 | 2.42 | 0.759 | 614 |
Fig. 4Features of one-to-one genes as well as split-gene candidates. a Split-gene candidates are classified based on whether they were initially annotated as split or merged for a given genotype followed by the classification based on the M2f method. E.g. The ‘SS’ box for the B73 genotype are instances where multiple genes in B73 corresponded to a single gene in either PH207 or W22, and the multiple (split) genes of B73 were determined to be the correct annotation. Outliers were removed on all plots. b Length and Distance between genes. c AED calculated from MAKER-P for the B73 and PH207 annotations. For B73, multiple isoforms were annotated, and we took the max AED across all isoforms for a given gene model. d Number of IsoSeq cDNAs for genes in each category. Genes with no IsoSeq support were excluded and shown separately as a proportion on the right. IsoSeq cDNAs were filtered for mapping quality (MQ) > 20 and for coverage of at least 75% of the longest transcript sequence
Fig. 5Consequences of split-gene misannotations. a Comparing expression estimates across homologs. For the correct split-gene annotations (Split Supported (SS) in Fig. 4), expression of each split-gene is compared to the one expression value from the single gene to which they corresponded. b Exemplar of differential expression misinference when two distinct genes are incorrectly annotated as one. Expression differences (between immature ear and anther) of component genes cancel out resulting in no differential expression for the single rightmost gene. c Example of misinference for differential exon usage. Incorrect annotation as a single gene in PH207 should be two genes (split at location demarcated with the red X) as annotated in W22. Colored lines indicate separate tissues. d Median p-value across the per-exon tests of differential exon usage for each gene. Inflation of low p-values is observed when distinct genes are incorrectly treated as a single, merged gene (Merged is not supported (MNS) in Fig. 4)