| Literature DB >> 32392385 |
Seungill Kim1,2, Kyeongchae Cheong3, Jieun Park1, Myung-Shin Kim1,3, Jihyun Kim1, Min-Ki Seo1, Geun Young Chae2, Min Jeong Jang2, Hyunggon Mang1, Sun-Ho Kwon4, Yong-Min Kim5, Namjin Koo5, Cheol Woo Min6, Kwang-Soo Kim7, Nuri Oh7, Ki-Tae Kim8, Jongbum Jeon3, Hyunbin Kim3, Yoon-Young Lee9, Kee Hoon Sohn9,10, Honour C McCann11, Sang-Kyu Ye4, Sun Tae Kim6, Kyung-Soon Park7, Yong-Hwan Lee3,8, Doil Choi1,3.
Abstract
Whole-genome annotation error that omits essential protein-coding genes hinders further research. We developed Target Gene Family Finder (TGFam-Finder), an alternative tool for the structural annotation of protein-coding genes containing target domain(s) of interest in plant genomes. TGFam-Finder took considerably reduced annotation run-time and improved accuracy compared to conventional annotation tools. Large-scale re-annotation of 50 plant genomes identified an average of 150, 166 and 86 additional far-red-impaired response 1, nucleotide-binding and leucine-rich-repeat, and cytochrome P450 genes, respectively, that were missed in previous annotations. We detected significantly higher number of translated genes in the new annotations using mass spectrometry data from seven plant species compared to previous annotations. TGFam-Finder along with the new gene models can provide an optimized platform for comprehensive functional, comparative, and evolutionary studies in plants. ©2020 The Authors. New Phytologist ©2020 New Phytologist Trust.Entities:
Keywords: CYP450; FAR1; NLR; plant defense; plant genomics; structural gene annotation
Mesh:
Year: 2020 PMID: 32392385 PMCID: PMC7496378 DOI: 10.1111/nph.16645
Source DB: PubMed Journal: New Phytol ISSN: 0028-646X Impact factor: 10.151
Fig. 1Annotation process of tgfam‐finder. An automated process for new identification of target‐gene families using tgfam‐finder is depicted. The diagram shows serial processes starting from six‐frame translation to generation of the final gene model. The gray block of the diagram shows the determination of target regions containing target domain(s) and their flanking sequences for further annotation. The blue and pink blocks indicate structural annotation using proteins and transcriptomes (blue), and the ab initio method (pink), respectively. Names of representative tools (Slater & Birney, 2005; Stanke et al., 2006; Kim et al., 2015; Ghosh & Chan, 2016) for structural annotation are given in the blue and pink blocks. Initial gene models are integrated from the structural annotation as depicted in the white block.
Fig. 2Comparison of annotation accuracy for gene models grouped by distinct trials from tgfam‐finder, gemoma and maker2. (a, b) Average sensitivity and specificity (a) with average positive predictive values (PPVs) and negative predictive values (NPVs) (b) of annotated genes from tgfam‐finder, gemoma and maker2 in Arabidopsis, rice and maize are depicted. The x‐ and y‐axes represent trial names and average of those evaluation values, respectively. (c) The number of newly identified genes (i.e. predicted genes absent in references; x‐axis) and the number of missed genes (i.e. reference genes omitted in predicted gene models; y‐axis) are depicted as dot plots. (d) The ratio of the number of newly identified genes in predictions to the number of omitted reference genes for each trial. (e) The x‐ and y‐axes indicate the number of predicted genes overlapping with references and the ratio of the number of overlapping predicted genes to the number of overlapping reference genes, respectively. The left (right) plot depicts genes sharing any (over 90%) coding sequence regions between references and gene models.
Fig. 3Evaluation of annotation from tgfam‐finder, gemoma and maker2 considering families and species. (a, b) Sensitivity and negative predictive values (NPVs) of 33 gene models grouped by (a) families and (b) species from tgfam‐finder, gemoma and maker2 are depicted as line graphs. (c, d) The line graphs indicate the ratio of the number of newly identified genes (i.e. predicted genes absent in references) to the number of omitted genes (i.e. reference genes omitted in predicted gene models) for gene models grouped by (c) families and (d) species. (e, f) The dot plots represent the number of predicted genes overlapping with reference genes (x‐axis) and the ratio of the number of overlapping predicted genes to the number of overlapping reference genes (y‐axis) for gene models combined by families (e) and species (f).
Fig. 4Annotation run‐times of tgfam‐finder and maker2. Line graphs indicate average annotation run‐times of FAR1, NLR, and CYP450 families in six plant genomes using tgfam‐finder (red), gemoma (green) and maker2 (navy). Red, green and navy shadings represent maximum and minimum run‐times of tgfam‐finder and maker2, respectively. The numeric values between the line graphs mean differences in run‐time between tgfam‐finder and other tools. The gray bar graph represents genome size of the six plant species.
Fig. 5Re‐annotation of FAR1, NLR and CYP450 genes. The heat map indicates the number of existing target genes in representative loci of 50 plant genomes. Bar graphs show the number of newly annotated genes. Colors in the bar represent the number of newly annotated genes from protein or transcriptome evidence (orange) and ab initio model (navy blue).
Fig. 6Proteomic validation of the previously and newly annotated genes. (a–c) Bar graphs represent the percentages of protein‐coding genes in previously (sky blue) and newly (yellow) annotated genes, validated using mass spectrometry in seven plant genomes. Stars on the bar graphs indicate significant differences in protein‐coding gene abundance between the previously and newly annotated genes (Fisher's exact test, P < 0.05).