Literature DB >> 31245718

Maize GO Annotation-Methods, Evaluation, and Review (maize-GAMER).

Kokulapalan Wimalanathan1,2, Iddo Friedberg1,3, Carson M Andorf4,5, Carolyn J Lawrence-Dill1,2,6.   

Abstract

We created a new high-coverage, robust, and reproducible functional annotation of maize protein-coding genes based on Gene Ontology (GO) term assignments. Whereas the existing Phytozome and Gramene maize GO annotation sets only cover 41% and 56% of maize protein-coding genes, respectively, this study provides annotations for 100% of the genes. We also compared the quality of our newly derived annotations with the existing Gramene and Phytozome functional annotation sets by comparing all three to a manually annotated gold standard set of 1,619 genes where annotations were primarily inferred from direct assay or mutant phenotype. Evaluations based on the gold standard indicate that our new annotation set is measurably more accurate than those from Phytozome and Gramene. To derive this new high-coverage, high-confidence annotation set, we used sequence similarity and protein domain presence methods as well as mixed-method pipelines that were developed for the Critical Assessment of Function Annotation (CAFA) challenge. Our project to improve maize annotations is called maize-GAMER (GO Annotation Method, Evaluation, and Review), and the newly derived annotations are accessible via MaizeGDB (http://download.maizegdb.org/maize-GAMER) and CyVerse (B73 RefGen_v3 5b+ at doi.org/10.7946/P2S62P and B73 RefGen_v4 Zm00001d.2 at doi.org/10.7946/P2M925).

Entities:  

Keywords:  GO; functional annotation; gene ontology; genomics; maize

Year:  2018        PMID: 31245718      PMCID: PMC6508527          DOI: 10.1002/pld3.52

Source DB:  PubMed          Journal:  Plant Direct        ISSN: 2475-4455


INTRODUCTION

Maize is an agriculturally important crop species and model organism for genetics and genomics research (Lawrence, Dong, Polacco, Seigfried, & Brendel, 2004). Not only is maize historically important for genetics research, along with other model species, significant efforts have been made to transition existing datasets into a more sequence‐centric paradigm (Sen et al., 2009), thus enabling genomics approaches to be brought to bear on both basic research problems and applied breeding (Lawrence et al., 2008). In 2009, the maize genome's reference sequence was made available to the research community (Schnable et al., 2009). Since then, much work has gone into improving the utility of the genome sequence to scientists with a focus on sequence annotation. In practice, making a genome sequence useful involves three basic steps: assembling the genome sequence, assigning gene structures, and assigning functions to genes. The quality of data generated at each step influences downstream inferences, with high‐quality sequence, assembly, and gene structure assignments generally resulting in better functional annotations overall. Functional predictions serve as the basis for formulating hypotheses that are subsequently tested in the laboratory. As such, experimentalists have a great interest in high‐quality functional annotation sets that cover all or most of the genes in their species of interest. The Gene Ontology (GO) is a controlled vocabulary of semi‐hierarchically related terms that describe gene product function. It consists three categories: Biological Process (BP), Cellular Component (CC), and Molecular Function (MF). In the context of GO, functional annotation of a gene consists of the assignment of one or more GO terms from one or more of the GO categories to a given gene or gene model (here, we will refer to genes and gene models simply as “genes” for simplicity). For individual GO term associations to genes, Evidence Codes (ECs) are assigned to assert how the association of term to gene was made (Harris et al., 2004). GO Evidence Codes are aggregated into five general categories: Experimental, Computational Analysis, Curator Statement, Author Statement, and Automatically Assigned. See Table 1 and http://www.geneontology.org/page/guide-go-evidence-codes for a detailed explanation of the GO Evidence Codes.
Table 1

Evidence Codes used in gene ontology annotations

TypeEvidence codeDescription
ExperimentalEXPInferred from Experiment
IDAInferred from Direct Assay
IPIInferred from Physical Interaction
IMPInferred from Mutant Phenotype
IGIInferred from Genetic Interaction
IEPInferred from Expression Pattern
Computational AnalysisISSInferred from Sequence or structural Similarity
ISOInferred from Sequence Orthology
ISAInferred from Sequence Alignment
ISMInferred from Sequence Model
IGCInferred from Genomic Context
IBAInferred from Biological aspect of Ancestor
IBDInferred from Biological aspect of Descendant
IKRInferred from Key Residues
IRDInferred from Rapid Divergence
RCAInferred from Reviewed Computational Analysis
Author statementTASTraceable Author Statement
NASa Non‐traceable Author Statement
Curatorial StatementICInferred by Curator
NDa No biological Data available
Automatically AssignedIEAa Inferred from Electronic Annotation

Evidence Codes without either curation or biological data supporting them.

Evidence Codes used in gene ontology annotations Evidence Codes without either curation or biological data supporting them. The use of experimental ECs asserts that the assignment results from a physical characterization of the protein's function as described in a publication. Computational approaches are based on in silico analyses. One of the simplest and most commonly conducted computational approaches involves matching similar genes between an existing, well‐annotated genome and an unannotated genome. Once the matches are assigned, annotations are inferred to genes in the unannotated genome. Such assignments receive the ISS (Inferred from Sequence or Structural Similarity) EC. The ISS EC is also assigned if an uncharacterized sequence contains a characterized domain. In such instances, the presence of the domain itself can be used to predict function for the uncharacterized sequence. For Curator and Author Statements, included EC types are based on judgment by curators and scientists in their expert opinion. As such, they are considered to be reviewed annotation types, although these do include two ECs based on little data: NAS (Non‐traceable Author Statement) and ND (No biological Data available). The Automatically Assigned EC type contains only one EC: Inferred from Electronic Annotation (IEA). IEA is unique in that no reviewed analysis of the assignment is required. Put another way, no curatorial judgment is applied, making it the least supported EC of the group. Sequence‐based approaches to automated functional annotation generally fall into three basic categories: sequence similarity, domain‐based methods, and mixed methods. Sequence similarity‐based gene matching most often relies on BLAST (e.g., BLAST2GO) followed by limiting the number of accepted matches based on e‐value or a reciprocal best hit (RBH) strategy (Altschul, Gish, Miller, Myers, & Lipman, 1990; Conesa & Götz, 2008; Moreno‐Hagelsieb & Latimer, 2008). Domain‐based methods score sequences for the presence of well‐described protein domain such as those included in Pfam, PANTHER, and ProSite (Finn et al., 2017). InterProScan is a commonly used domain‐based GO annotation pipeline (Jones et al., 2014). Mixed methods combine sequence similarity, domain‐based approaches, and other evidence such as inferred orthology through phylogenetics to assign GO terms systematically (Clark & Radivojac, 2011; Falda et al., 2012; Koskinen, Törönen, Nokso‐Koivisto, & Holm, 2015). For more of the latest methods, see Jiang et al. (2016). For maize, two genome‐scale GO annotation sets exist for the B73 reference assembly and gene set (i.e., B73 RefGen_v3 and 5b+, respectively). These functional annotations are generated by and accessible from the Gramene (www.gramene.org (Tello‐Ruiz et al., 2016)) and Phytozome (phytozome.jgi.doe.gov (Goodstein et al., 2012)) projects and websites, respectively. Gramene annotations are based on the Ensembl annotation pipeline (http://ensemblgenomes.org/info/data/cross_references), which is a mixed‐method approach. The primary sources of the Ensembl annotations are from UniProtKB, community‐based annotations from MaizeGDB (Andorf et al., 2016), InterPro2GO, and projections from orthologs inferred from phylogenetic analyses. Phytozome has a two‐step process for GO annotation. First, Pfam domains are assigned to proteins. Second, GO annotations are determined based on the Pfam2GO mapping (Hunter et al., 2009). Given the wealth of functional descriptions derived from mutational analyses, many researchers rely on the available maize GO‐based functional annotations from large‐scale, high‐profile community resources such as Gramene and Phytozome for formulating experimental hypotheses and also as input datasets to transitively annotate predicted functions to newly sequenced grass species and crop genomes (e.g., Hirsch et al., 2016). However, if we compare the EC types for GO assignments between the model species Arabidopsis thaliana and the Gramene and Phytozome functional annotations of the maize reference line B73, it is clear that the evidence supporting GO term assignments for these maize datasets is comparatively lacking (see Figure 1). Both the Gramene and Phytozome maize annotations have few annotations beyond those Inferred from Electronic Annotation (IEA). This situation is not intuitive to researchers given that maize has a wealth of functional descriptions in the literature.
Figure 1

Numbers of annotations by EC category. Arabidopsis (TAIR10) shown in magenta and maize in green and orange for Gramene and Phytozome, respectively. Annotation counts on the y‐axis are shown in thousands. Each bar in the histogram is labeled with the actual count to show where counts are so small that no bar is visible

Numbers of annotations by EC category. Arabidopsis (TAIR10) shown in magenta and maize in green and orange for Gramene and Phytozome, respectively. Annotation counts on the y‐axis are shown in thousands. Each bar in the histogram is labeled with the actual count to show where counts are so small that no bar is visible Exacerbating this problem, transfers of predicted function often are based on sequence similarity alone with no restriction of input data to associations based on well‐documented EC types. Furthermore, although mixed‐method pipelines such as the Ensemble COMPARA pipeline used by Gramene and the Phytozome Pfam2GO (Goodstein et al., 2012; Herrero et al., 2016) mappings may seem reproducible in principle given that they are based on the use of specific systems and software, details including input files and parameters often are unavailable or incomplete, making it impossible for research groups outside the group that generated those annotation resources to reproduce the annotation sets. In addition, because many computational pipelines inherit functional annotations that were also purely computationally derived, a single errant annotation can be propagated to many genomes (Andorf, Dobbs, & Honavar, 2007), making it mistakenly appear that many genomes agree on the errant function. For these reasons, existing computational functional annotations of maize (and many other plant genomes) should be approached with skepticism. Given these issues with the maize functional annotation, we endeavored to create an improved annotation set. This task requires both application of robust and reproducible methods and a gold standard set of maize GO annotations to compare generated result sets to each other as well as to the Gramene and Phytozome maize functional annotations. One small dataset of well‐curated GO‐based functional annotations does exist for maize. It was initially created by curators at MaizeGDB for the purpose of enriching the MaizeCyc metabolic pathway database (Monaco et al., 2013) and expanded through manual literature curation. This dataset constitutes 1,621 genes and 2,002 GO terms. We annotated the maize B73 RefGen_v3 annotation set 5b+ using only experimentally based annotations by filtering out GO assignments with IEA, NAS, and ND ECs from the input data and assigned GO terms using multiple input datasets then compared the performance of sequence similarity, domain presence, and mixed methods based on how well the methods predicted function for genes included in the MaizeGDB gold standard dataset. For mixed methods, we used pipelines developed for the Critical Assessment of Functional Annotation (CAFA) challenge, a competition designed to evaluate the latest computational functional annotation methods and to promote improvement of methods for functional annotation (Jiang et al., 2016; Radivojac et al., 2013). Groups competing in the CAFA challenge create tools that are applied to a set of specified target sequences. GO assignments are subsequently evaluated based on accumulation of functional data in the literature for the target sequence set. Some CAFA tools use preprocessing steps combined with a number of different computational and statistical approaches to reduce the number of false‐positive and false‐negative annotations (Clark & Radivojac, 2011; Falda et al., 2012; Koskinen et al., 2015). Some mixed‐method pipelines performed better on average than other methods in the first iteration of the CAFA competition (Radivojac et al., 2013), indicating that the use of mixed‐method pipelines for large‐scale GO annotations could potentially improve the overall quality of the annotation sets. The project to evaluate and improve maize GO annotations is called GAMER: GO Annotation Method, Evaluation, and Review. We compared GAMER annotations to annotations based on sequence similarity, domain, and three CAFA mixed methods. Next, we combined GAMER outputs to generate an aggregate maize‐GAMER GO annotation set and compared it to the existing Phytozome and Gramene GO annotations based on the hF 1 score. The GAMER annotations had three major advantages compared to the Gramene and Phytozome annotations: (i) an increased number of maize genes annotated with GO terms; (ii) more than twice the number of annotations (GO terms assigned) for maize protein‐coding genes; and (iii) similar or better quality scores relative to existing annotations sets based on hF 1 score. The B73 RefGen_v3 5b+ maize‐GAMER functional annotation dataset described here is accessible via MaizeGDB (http://download.maizegdb.org/maize-GAMER) and CyVerse (doi.org/10.7946/P2S62P). Scripts used to generate the annotation are available via GitHub at https://github.com/Dill-PICL/maize-GAMER.

MATERIALS AND METHODS

Functional annotation of maize genes

Three sequence‐based approaches were used to annotate function to genes in the maize reference genome: sequence similarity, domain‐based, and mixed‐method pipelines (see Figure 2; also described in Sections 2.1.1, 2.1.2, and 2.1.3, respectively). The scripts (bash, R, and Python) used to generate the annotations for maize B73 RefGen_v3 are available at https://github.com/Dill-PICL/maize-GAMER. These scripts run free and open‐source tools on different inputs required for these tools to generate annotation datasets. Please refer to the reproducibility supplemental file for details on versions of software, version of input datasets used, and commands and parameters used to run these tools.
Figure 2

Overview of steps to produce the maize‐GAMER datasets. Three types of methods are used: sequence similarity (yellow), domain presence (blue), and CAFA mixed‐method pipelines (green). Within sequence similarity, two input datasets were subjected to reciprocal best hit against maize: TAIR10 (Arabidopsis) and UniProt (the ten most well‐annotated plant species). For domain presence, InterPro signatures were applied to maize using InterProScan (IPRS). From the CAFA mixed‐method pipelines, Argot2, FANN‐GO, and PANNZER were applied to maize. For each individual output, duplications and redundancies were removed, and then, the datasets were combined. A second round of duplication and redundancy removal was carried out to produce the maize‐GAMER Aggregate dataset

Overview of steps to produce the maize‐GAMER datasets. Three types of methods are used: sequence similarity (yellow), domain presence (blue), and CAFA mixed‐method pipelines (green). Within sequence similarity, two input datasets were subjected to reciprocal best hit against maize: TAIR10 (Arabidopsis) and UniProt (the ten most well‐annotated plant species). For domain presence, InterPro signatures were applied to maize using InterProScan (IPRS). From the CAFA mixed‐method pipelines, Argot2, FANN‐GO, and PANNZER were applied to maize. For each individual output, duplications and redundancies were removed, and then, the datasets were combined. A second round of duplication and redundancy removal was carried out to produce the maize‐GAMER Aggregate dataset The B73 genome and protein sequences for gene models included in the filtered gene set (FGS) were downloaded from Gramene Release 42 (Tello‐Ruiz et al., 2016). The downloaded protein FASTA file contained sequences for all FGS transcripts (e.g., the gene model X has transcript models X_T01, X_T02, and X_T03). For each gene model, only the longest translated protein sequence derived from the transcripts was analyzed. The gold standard annotations used for evaluations were obtained from MaizeGDB, and they encompass GO annotations for 1,619 gene models from RefGen_v3. The number of annotations for Cellular Component (CC), Molecular Function (MF), and Biological Process (BP) was 1,584, 88, and 323, respectively.

Sequence similarity

The sequence similarity‐based annotation method has three main steps: (i) calculation of sequence similarity, (ii) valid hit detection, and (iii) inheritance of high‐confidence GO annotations. BLASTP was used (Altschul et al., 1990) with default parameters to calculate sequence similarity between maize protein sequences and two other datasets: the “Arabidopsis” dataset from TAIR, The Arabidopsis Information Resource (Berardini et al., 2015), and the “Plant” dataset from UniProt (UniProt Consortium, 2015). Valid hits were detected using the RBH method from BLASTP results. GO terms with nonreviewed ECs (i.e., IEA, NAS, and ND described in the introduction) were removed from input datasets. All others were inherited between the RBH pairs of maize and the other plant. Arabidopsis has the largest number of reviewed (human curated) EC GO annotations among plant model organisms (see Table S1). A FASTA file of Arabidopsis protein sequences along with the cognate GO Annotation File (GAF) was downloaded from TAIR v.10 (Berardini et al., 2015). The TAIR protein file contained predicted protein sequences from all transcripts. This file was filtered to retain only the protein sequence derived from longest transcript for each gene. Retained protein sequences from TAIR were used to create the TAIR BLAST database, and maize protein sequences were used to create a maize BLAST database. Maize protein sequences were used to query the TAIR BLAST database. Likewise, TAIR sequences were used to query the maize BLAST database. Results from both searches were used to detect RBH pairs between Arabidopsis and maize. All nonreviewed EC GO annotations were removed, and remaining GO associations to Arabidopsis genes were inherited to maize genes for each RBH pair. This maize/Arabidopsis RBH ortholog dataset is called “maize‐TAIR GO annotations”. All reviewed EC GO annotations and protein sequences for all plants from the UniProt‐GOA database were downloaded using the QuickGO tool hosted at EBI (Binns et al., 2009). Protein sequences and reviewed EC GO annotations were downloaded separately. The UniProt plant GO annotation dataset containing 304,426 annotations from 75,537 unique protein sequences. The protein sequences downloaded spanned 292 taxa. Only ten species had more than 1,000 annotations (see Table S1). Annotations from the top 10 species (in terms of number of reviewed GO annotations) were retained for our analyses. The process to annotate maize genes using UniProt plant data was similar to that for Arabidopsis. Maize protein sequences were matched against protein sequences from each species separately using BLASTP. Putative orthologs were determined using RBH for each maize‐plant pair. Terms annotated to the other plant protein were inherited to the maize protein sequence for each putative ortholog pair. GO annotations inherited from each plant species were concatenated together. The derived dataset is called the “maize‐UniProt GO annotations”.

Domain presence

InterProScan5 (IPRS) version 5.16‐55.0 was used to create domain‐based GO annotation of maize protein‐coding genes (Jones et al., 2014). IPRS was used to annotate GO terms to maize genes to produce the “maize‐IPRS GO annotations”.

Mixed‐method pipelines

At the beginning of this project, the first iteration of the CAFA challenge (CAFA1; described in the Introduction) had been completed. The results from the challenge indicated that CAFA1 mixed‐method pipelines performed as well or better than standard methods (Radivojac et al., 2013). To determine their predictive power for functional annotation in plants, the top‐performing mixed‐method pipelines from CAFA1 were reviewed to identify a group that could be implemented based upon availability of code and sufficient documentation. Three tools were selected as follows: Argot2, FANN‐GO, and PANNZER (Clark & Radivojac, 2011; Falda et al., 2012; Koskinen et al., 2015).

Argot2

Argot2 has a batch processing tool that can annotate up to 5,000 preprocessed input sequences. There are two different preprocessing steps for Argot2: (i) querying the UniProt database for sequence similarity matches to the input sequences and (ii) querying the Pfam database for putative domains present in the input sequences. The maize sequences were split into multiple FASTA files containing a maximum of 5,000 sequences. The eight FASTA files resulting from the previous step were used to query the UniProt database using BLASTP for matches, and the output was saved. HMMER was used to search a local Pfam database for potential hits for all the input protein sequences (Finn, Clements, & Eddy, 2011). Preprocessing each input FASTA resulted in a pair of input files for Argot2: BLAST and HMMER files. Each pair of preprocessed files was compressed and submitted to Argot2 batch processing tool. Results from each pair of preprocessed data were downloaded and concatenated to create the “maize‐Argot2 GO annotations.”

FANN‐GO

The file containing maize protein sequences was imported into MATLAB using a built‐in function (MATLAB:2017). The MAIN function from FANN‐GO was used to preprocess and annotate maize protein sequences. FANN‐GO uses BLASTP to query FANN‐GO training sequence dataset (derived from UniProt) for potential matches for the input sequences and converts the results to input feature vectors. The FANN‐GO predictor built from the training dataset is then used to process the input feature vectors and calculate the probability that a particular protein is associated with a particular GO term. These probabilities are represented in a matrix where rows represent sequences and columns represent GO terms. The matrix was converted to a GAF (GO Annotation File Format) file to be used for subsequent evaluations. This dataset is referred to as the “maize‐FANN‐GO annotations”.

PANNZER

Maize protein sequences were preprocessed using BLASTP to query a local UniProt protein BLAST database, and the output was saved in XML format as required by PANNZER. PANNZER was run on the xml file output from the previous step, and the output was converted into a GAF file. This dataset from PANNZER is referred to as the “maize‐PANNZER GO annotations.”

Metrics used in maize‐GAMER

A number of metrics defined and described by the AIGO (Analysis and the Inter‐comparison of GO functional annotations) library were used to select high‐confidence annotations, clean, and evaluate the maize‐GAMER‐derived annotation sets (Defoin‐Platel et al., 2011). AIGO has defined two types of metrics: analysis metrics and comparison metrics (see Table 2).
Table 2

Metrics used to analyze and evaluate the annotation datasets

MetricDescriptionFormula
Analysis metricsa
Duplication (%)Proportion of duplicate annotations in a dataset #of Total Annotations#of Unique Annotations#of Total Annotations
Redundancy (%)Proportion of redundant terms in a dataset y=1M#of Ancestral terms annotated to Gene(y)#of total Annotations to Gene(y)×100M
Coverage (%)Proportion of Genes annotated in a dataset #of Genes withone GO annotation#of Genes
SpecificitySpecificity of the Annotation for a given gene x=1N#of Ancestral terms inferred for Annotation(x)#of total Annotations
Comparison metricsb
hPr Hierarchical Precision calculated by evaluating against the gold standard GO terms predicted in ASGO terms annotated in GSGO terms predicted in AS
hRc Hierarchical Recall calculated by evaluating against the gold standard GO terms predicted in ASGO terms annotated in GSGO terms annotated in GS
hF 1 Harmonic mean of hPr and hRc 2×hPr×hRchPr+hRc

N, Total # of Annotations, M, Total # of Genes, AS, Annotation Set, GS, Gold Standard.

Comparison Metrics used here have been described in detail in (Defoin‐Platel et al., 2011).

See Supplementary document for precise steps to calculate hPr and hRc.

Metrics used to analyze and evaluate the annotation datasets N, Total # of Annotations, M, Total # of Genes, AS, Annotation Set, GS, Gold Standard. Comparison Metrics used here have been described in detail in (Defoin‐Platel et al., 2011). See Supplementary document for precise steps to calculate hPr and hRc.

Analysis metrics

Analysis metrics defined by AIGO measure features of a given annotation set. Four AIGO analysis metrics were used for maize‐GAMER: Duplication, Redundancy, Coverage, and Specificity (see Table 2). With respect to calculating AIGO metrics, an annotation is defined as a single gene–GO term pair. Duplication is the proportion of annotations that are not unique to a given annotation set. Duplication is calculated for each annotation set as described in Table 2. The collection of more general GO terms that can be inherited from a specific GO term is called “ancestral”. Redundancy occurs when a GO term and one or more of its ancestral terms are annotated to the same gene in a dataset. Redundancy metric was calculated by obtaining the mean of the proportion of ancestral terms annotated for each gene in a particular dataset. Coverage is the proportion of genes that have at least one GO term assigned. Specificity is measured by counting the number of ancestral terms that exist for a given annotation, then averaging those counts across all annotations in the set. An ideally cleaned annotation dataset would have no Duplication, no Redundancy, high Coverage, and high Specificity.

Comparison metrics

Comparison metrics defined by AIGO measure how well a given set of annotations match with another set of annotations. The AIGO comparison metrics hierarchical Precision (hPr) and hierarchical Recall (hRc) were used to evaluate annotation sets against gold standard annotations from MaizeGDB (see Table 2 and Supplementary Materials). Different metrics have been defined for the evaluation of GO annotations against a gold standard (Clark & Radivojac, 2013; Defoin‐Platel et al., 2011; Jiang et al., 2016; Radivojac et al., 2013). AIGO provided a set of well‐described evaluation metrics which were adapted by maize‐GAMER and was utilized for large number of annotations produced by mixed‐method pipelines. Both hPr and hRc evaluations start with propagating the GO terms in the annotations to the root. hPr is the proportion of the GO terms (directly annotated and inferred by propagation) in an annotation set which is shared with the GO terms (directly annotated and inferred by propagation) in the gold standard. hRc is the proportion of the GO terms in the gold standard, which are found in the annotation set. hPr and hRc were calculated for the genes in the gold standard dataset and were calculated independently for each GO category. If a gene was annotated in the gold standard, but was not annotated in the annotations set then both hPr and hRc were set to 0. See Supplementary Materials for precise steps used to calculate hPr and hRc. Harmonic mean (hF 1) of hPr and hRc was calculated for each annotation set for each GO category to use a single number to compare different annotation methods.

Cleaning and combining component datasets

Score threshold selection for mixed methods

Mixed‐method pipelines used in the maize‐GAMER project provide a confidence score for each GO annotation. The confidence score ranges from 0.0 to 1.0, where a higher score indicates more confidence for a given annotation. A score threshold which maximizes hF 1 (hFmax) will select the optimal set of annotations which reduces the total number of false positives and false negatives (see Section 2.2 for description of the metrics). The range of annotation scores from mixed‐method pipelines did not span the whole 0.0–1.0 range, so the scores were normalized to fall between 0.0 and 1.0 independently for each annotation set. A set of thresholds (every 0.05 from 0.0 to 1.0; i.e., 0.00, 0.05, 0.1, 0.1,…., 0.95, 1.00) were selected. hF 1 score for each GO category, and each threshold was calculated by selecting the annotations with a normalized score that was ≥ to the threshold and evaluating against the gold standard annotations. hFmax for each GO category was determined by getting the highest hF 1 obtained from the previous step. The score thresholds which resulted in hFmax were used to select a subset of annotations from each mixed‐method pipeline (see Table S2). The maize‐Argot2, maize‐FANN‐GO, and maize‐PANNZER GO annotations described in subsequent sections refer to the subset of annotations selected via this selection step.

Removing redundancy and duplication

Duplication is the presence of two or more instances of the same gene–GO term pair in a single annotation set. Redundancy is the presence of an ancestral GO term in the annotations of a gene which also contains a specific annotation from which the ancestral GO term can be inferred by propagation. Component annotation sets from all methods described above were cleaned by removing redundancy and duplication for each annotation set across all three GO categories. Duplication was cleaned by replacing multiple instances of a gene–GO term pair with a single instance for a given annotation set. Duplicate annotations from all six raw annotation sets were removed, and files with nonduplicate annotations were created for each annotation set. Redundancy was cleaned by removing annotations containing GO terms that could be inferred from other terms based on the GO hierarchy, and only retaining the annotations with GO terms that cannot be inferred.

The maize‐GAMER Aggregate dataset

Clean (nonredundant and nonduplicated) annotation sets from all component methods were merged to generate the maize‐GAMER Aggregate annotation set. Redundancy and duplication introduced by concatenating multiple datasets were removed. A new genome assembly (B73 RefGen_v4) and annotation set (Zm00001.2) for maize inbred line B73 were recently released (Jiao et al., 2017). Because this dataset has not been available for long, only few published analyses are available and the research community is only now in the process of transitioning to general use of RefGen_v4 for large‐scale analyses. As such, analyses and results described here derive from the well‐annotated v3 assembly and annotation set. To extend outcomes of the work described here for future v4 efforts, maize‐GAMER Aggregate annotations have also been created for the maize B73 RefGen_v4, which can be accessed at MaizeGDB (http://download.maizegdb.org/maize-GAMER) and via CyVerse (doi.org/10.7946/P2M925).

Evaluation of GAMER‐derived annotation sets

Component and aggregate annotation sets were compared at two levels; a general comparison and a GO category‐specific comparison. Comparison metrics mentioned in Section 2.2 were calculated for the general comparison (see Table 2). All metrics were calculated independently for each annotation set and compared among component annotation sets as well as the aggregate annotation set. Coverage and the number of annotations were calculated directly for each annotation set. Specificity was calculated for each annotation, and the mean across all annotations is reported. The annotations from component annotation sets and the aggregate annotation set were divided into specific GO categories, and category‐specific annotations were evaluated separately. Three different metrics were used for GO category‐specific evaluations: coverage, number of annotations, and hF 1 (see Section 2.2 for more details). Coverage and the number of annotations were calculated individually for each GO category for each dataset. hF 1 score was calculated for each annotation set for each GO category.

Comparisons among the maize‐GAMER Aggregate, gramene, and phytozome annotation sets

The existing Gramene, Phytozome, and maize‐GAMER annotations were compared to each other. Redundancy and duplication were removed from the Gramene and Phytozome annotation sets before evaluations were performed. Evaluation and comparisons were identical to the analyses performed in the previous Section 2.4. General evaluations for the maize‐GAMER, Gramene, and Phytozome annotation sets were based on coverage, number of annotations, and specificity. These metrics were calculated as described in the previous Section 2.4. The Gramene, Phytozome, and maize‐GAMER annotation sets were also compared in a GO category‐specific manner to account for biases in performance among different categories (i.e., CC, BP, and MF). Comparisons were made based on coverage, number of annotations, and mean hF 1 score.

Case study of the gene nana plant1 (na1)

The gene na1 (GRMZM2G449033) had the most terms (7 GO terms) associated with it in the gold standard dataset and all the terms were from BP GO category. Annotations for na1 from the three maize annotation sets were obtained. The ancestral nodes were inferred from the leaf nodes for each annotation set, and a subgraph for the BP ontology was generated (see Figure 5). The nodes in the subgraph were compared to gold standard, and nodes shared between a given annotation set and the gold standard dataset were identified. Nodes exclusively found only in a given annotation set or the gold standard were also identified. Illustrations of the subgraphs without node labels were drawn to compare among the three different GO annotation sets.
Figure 5

Biological Process GO graph for maize na1. Leaf terms are toward the bottom, and root terms are toward the top. Terms covered only by the gold standard are shown in orange (labeled G), those in the dataset but absent from the gold standard are shown in blue (labeled D), and those that appear in both are shown in green (labeled DG). Leaf terms in each subgraph have an * next to them. Phytozome graph is shown at the top (a), Gramene graph is shown in the middle (b), and maize‐GAMER Aggregate graph is shown at the bottom (c)

RESULTS

Evaluation of maize‐GAMER‐derived component annotation sets

The maize‐GAMER‐derived component annotation sets (i.e., the TAIR, UniProt, IPRS, Argot2, FANN‐GO, and PANNZER) and the maize‐GAMER Aggregate annotation set were evaluated across GO categories as well as within each GO category using metrics described in Table 2.

General evaluation of maize‐GAMER component annotation sets

Initial evaluations and comparisons of datasets created by the maize‐GAMER pipeline were assessed based on coverage, number of annotations, and specificity among all clean component annotation sets as well as the aggregate annotation set (see Tables 2 and 3). The TAIR and UniProt annotation sets had the lowest coverage and number of annotations among all maize‐GAMER component annotation sets (Table 3). The Argot2, FANN‐GO, and PANNZER annotation sets had the highest number of annotations compared to other annotation sets, as well as higher coverage compared to other annotation sets. Notably, FANN‐GO had the highest coverage at 100% of genes, and Argot2 had annotations for more than 90% of the genes. The IPRS annotation set had a lower number of annotations compared to the CAFA mixed‐method pipelines, but covered more genes than sequence similarity methods. Although sequence similarity methods and IPRS covered a lower number of genes, they had higher specificity compared to mixed‐method pipelines in general. Of the three mixed‐method pipelines, only PANNZER had comparable specificity to the methods, but had lower coverage than both Argot2 and FANN‐GO. Both Argot2 and FANN‐GO had lower average specificity, but had higher coverage than other methods. The maize‐GAMER Aggregate annotation (made up of all component annotation sets) covered all maize genes with at least one annotation (as expected given that the FANN‐GO component annotation set also covers 100% of genes). In addition, the aggregate annotation set contains more than double the number of annotations that occur in any component annotation set. This indicates that different component methods assign different GO terms to genes. Therefore, combining annotations from different methods results in increased diversity of GO term assignments. Moreover, the aggregate annotation set has higher specificity than the mixed‐method pipelines which have higher coverage, but has lower specificity than all other component annotation sets.
Table 3

Results from the general evaluation of maize‐GAMER‐derived annotation sets

Method typeSequence similarityDomainCAFA mixed methodsUnion
Annotation setTAIR‐RBHPlant‐RBHIPRSArgot2FANN‐GOPANNZERAggregate
Raw annotations61,528106,053200,324450,013618,3121,091,123
Duplication (%)9.1165.8972.180.331.070.97
Redundancy (%)14.189.1710.9148.2559.8473.09
Clean annotations35,79132,08546,599224,827187,850219,984515,059
Coverage (%)23.7020.1048.4892.64100.0047.86100.00
Specificity11.5412.2010.458.545.3011.589.56
Results from the general evaluation of maize‐GAMER‐derived annotation sets Genes that are annotated with at least one GO term from each component annotation set were compared among the three different method types (i.e., sequence similarity, domain‐based, and mixed methods; see Figure 3a). This comparison revealed that less than a quarter of genes had been annotated by all three methods, but more than half were annotated by two different methods. The remainder were only annotated by mixed‐method pipelines. Sequence similarity and domain‐based methods resulted in annotations to genes that were also annotated by mixed‐method pipelines. The number of genes annotated by domain‐based methods and mixed‐method pipelines and not annotated by sequence similarity methods is higher than genes annotated by all three methods. In contrast, sequence similarity methods shared more genes with both other annotation sets than only with mixed‐method pipelines. Moreover, only mixed‐method pipelines annotate at least one GO term to all genes in the maize FGS.
Figure 3

GO assignment metrics for each method type. Sequence similarity in yellow, domain presence in blue, and mixed‐method pipeline in green. (a) Number of genes with at least one GO term annotated. (b) Number of GO terms with at least one gene annotated. (c) Percent coverage, number of annotations, and average 1 score for each annotation set across the three GO graphs (i.e., Cellular Component, Molecular Function, and Biological Process). Color codes as used in (a) and (b), with the aggregate dataset shown in orange

GO assignment metrics for each method type. Sequence similarity in yellow, domain presence in blue, and mixed‐method pipeline in green. (a) Number of genes with at least one GO term annotated. (b) Number of GO terms with at least one gene annotated. (c) Percent coverage, number of annotations, and average 1 score for each annotation set across the three GO graphs (i.e., Cellular Component, Molecular Function, and Biological Process). Color codes as used in (a) and (b), with the aggregate dataset shown in orange Although the mixed‐method pipelines annotated all genes, they did not capture all GO terms annotated to genes by the other methods. GO term assignments were compared to evaluate the diversity of the GO terms present in the three types of GO annotation methods (see Figure 3b). GO terms annotated directly and ancestral terms inferred from the direct terms annotated to genes were compared among the three GO annotation methods used in maize‐GAMER. The number of GO terms annotated by sequence similarity, domain‐based, and mixed‐method pipelines was 3,794, 8,145, and 14,225, respectively. The number of GO terms annotated by the mixed‐method pipelines is significantly higher than both other methods; however, there are a small number of GO terms that are only annotated by sequence similarity (721) and domain presence (16) methods. Only a small proportion (23.05%) of the total (15,028) GO terms are annotated by all three methods.

GO category‐specific evaluations of maize‐GAMER component annotation sets

CAFA1 indicated that annotations for some GO categories are easier to predict than others (Radivojac et al., 2013). This indicated that the GO category‐specific evaluations could provide a more accurate comparison between component methods. This would also allow unbiased comparison of tools which do not predict certain categories (e.g., FANN‐GO does not predict the CC category). Therefore, maize‐GAMER‐derived annotation sets were divided into specific GO categories (i.e., CC, BP, and MF) and each category was evaluated separately based on coverage, number of annotations, and hF 1. Mixed‐method pipelines had higher coverage across all three GO categories. Argot2 covered more than 80% genes across all three categories (see Figure 3c). FANN‐GO does not annotate GO terms for CC category, but had 100% coverage in BP category and covered about 50% genes in MF category. PANNZER had the lowest coverage compared to the other mixed‐method pipelines, covered only 30%–50% of genes across different categories, and had highest coverage in BP. Sequence similarity methods consistently had lowest coverage compared to other methods in BP and MF, but IPRS had the lowest coverage in CC. IPRS covered higher number of genes than sequence similarity methods in BP and MF, but had lower coverage than mixed‐method pipelines. When comparing IPRS coverage across three GO categories, the coverage was highest in MF. Aggregate annotation set covered slightly more genes than the component annotation sets with highest coverage in each category and covered more than 88% of the FGS genes in all categories. In the BP category, the aggregate annotation set annotated all genes from maize FGS with at least one annotation. Mixed‐method pipelines produce a higher number of annotations than other methods in all three GO categories. Moreover, the number of annotations from mixed‐method pipelines loosely correlates with coverage in different GO categories. The only exception was PANNZER, which annotated more GO terms per gene in BP category (data not shown), than any other component annotation set. The number of annotations from sequence similarity methods and IPRS was consistently lower than mixed‐method pipelines. The variation in the number of annotations was proportional to the number of genes annotated in sequence similarity and IPRS methods. The lowest number of annotations was seen in the CC category from IPRS, and sequence similarity methods in other GO categories. As the union of all component method annotations, the aggregate annotation set had a higher number of annotations in all three GO categories. The highest number of annotations for the aggregate annotation set was from the BP GO category, followed by CC and MF. We used hF 1 scores as a representation of the quality of annotations and annotation sets. As described in Section 2, the hF 1 was calculated individually for all genes in the gold standard dataset and then averaged across all genes within an annotation set. There are clear differences in hF 1 across different GO categories. The highest performance was seen in the MF category, and the lowest performance is seen in the BP category. This fits the observation from CAFA1 (Radivojac et al., 2013). Mixed‐method pipelines outperformed other methods in all three GO categories. PANNZER produced the highest hF 1 within the MF category, but Argot2 had the highest hF 1 scores in CC and BP. IPRS outperformed sequence similarity methods in both MF and BP categories, but was the lowest performing method in the CC category. Comparison between two sequence similarity methods indicated that maize‐UniProt method performs better than the maize‐TAIR method in MF and BP categories. On the other hand, maize‐TAIR method performs better than maize‐UniProt method in the CC category. Aggregating the component annotations from maize‐GAMER increased the performance in the CC category. In contrast, aggregating the component annotation sets did not increase the performance compared to the top‐performing tool in other categories.

Evaluation of existing maize GO annotation sets and comparison with the maize‐GAMER Aggregate annotation set

Two existing maize GO annotation sets, Gramene and Phytozome, were downloaded, evaluated, cleaned (i.e., redundancies and duplicates were removed), and compared with maize‐GAMER Aggregate annotations (referred to as the “maize‐GAMER annotation set”). The same metrics used for the evaluation of maize‐GAMER‐derived annotation sets were used for the comparison among the existing maize GO annotation sets and maize‐GAMER Aggregate annotation set.

General evaluation of public maize GO annotation sets

The maize‐GAMER Aggregate annotations covered all genes in the maize FGS with at least one GO term, but Gramene and Phytozome covered only about half the genes (see Table 4). Phytozome covered the fewest genes (less than half of the genes), and Gramene covered slightly more than half of the genes (see Table 4). The maize‐GAMER annotation set had more annotations than both Gramene and Phytozome. Gramene had twofold more annotations than Phytozome, and the maize‐GAMER had severalfold more annotations than Gramene. While the maize‐GAMER has higher coverage and a higher number of annotations, it has lower average specificity than Gramene and Phytozome. Gramene has the highest average specificity of all three annotation sets.
Table 4

Overall results from maize‐GAMER and other existing maize datasets

Analysis metricExistingmaize‐GAMER
GramenePhytozomeAggregate
Raw annotations111,20366,709
Duplication (%)0.0037.90
Redundancy (%)23.257.24
Clean annotations81,31536,987515,059
Coverage (%)55.5540.87100.00
Specificity10.9010.419.56
Overall results from maize‐GAMER and other existing maize datasets Genes with annotations from each set were compared to see the distribution of annotated genes among different annotations (see Figure 4a). Genes from Gramene and Phytozome annotations were a subset of the maize‐GAMER annotations. Less than half of the genes were annotated in all three sets, and slightly more than half of the genes were annotated in at least two sets. Comparison of Gramene and Phytozome annotations shows that most of the genes that were annotated were shared. Both Gramene and Phytozome had genes that were annotated in only one of the two (i.e., Gramene or Phytozome but not both; See Figure 4a).
Figure 4

GO assignment metrics for Gramene, Phytozome, and maize‐GAMER. Gramene in green, Phytozome in rust, and maize‐GAMER in tan. (a) Number of genes with at least one GO term annotated. (b) Number of GO terms with at least one gene annotated. (c) Percent coverage, number of annotations, and 1 score for each annotation dataset across the three GO graphs (i.e., Cellular Component, Molecular Function, and Biological Process)

GO assignment metrics for Gramene, Phytozome, and maize‐GAMER. Gramene in green, Phytozome in rust, and maize‐GAMER in tan. (a) Number of genes with at least one GO term annotated. (b) Number of GO terms with at least one gene annotated. (c) Percent coverage, number of annotations, and 1 score for each annotation dataset across the three GO graphs (i.e., Cellular Component, Molecular Function, and Biological Process) GO terms annotated directly to genes by different methods and ancestral GO terms propagated from these annotations were compared among the three annotation sets. The number of GO terms annotated in each set varied greatly. The least diverse set in terms of number of GO terms annotated was Phytozome, which was annotated with only 3,234 GO terms (approximately 7% of total GO terms). Gramene has annotated 7,215 GO terms (approximately 16%) and was more diverse than Phytozome, but had lower diversity than maize‐GAMER. maize‐GAMER had the highest diversity and contained 15,028 GO terms (approximately 33%). A small number of GO terms were used by all three annotation sets, and the majority of terms from Phytozome were shared across all three annotation sets. Only a single GO term was exclusive to the Phytozome annotations, and small number of terms was found to be exclusive to Gramene annotations. Approximately 50% of the GO terms from maize‐GAMER were unique. The maize‐GAMER Aggregate annotations shared a higher number of GO terms with Gramene than Phytozome.

GO category‐specific evaluations of maize‐GAMER and existing maize GO annotation sets

Annotations from the three maize GO annotation sets were analyzed in a GO category‐specific manner to identify differences in performance among the different categories (see Figure 4c). As was true for the component annotation sets, three different metrics were used for evaluation and comparison: coverage, number of annotations, and hF 1 score. Comparison of coverage across GO categories indicated that all annotation sets had lower coverage in CC category, compared to other categories. Both Gramene and Phytozome had lower coverage in BP than MF, but maize‐GAMER had higher coverage in BP than MF. Lowest coverage for all annotation sets and categories was seen in the CC category for the Phytozome annotation set, and the highest coverage was seen in the maize‐GAMER Aggregate annotation set in the BP category. Comparison among the three maize annotation sets indicates that the maize‐GAMER annotation set had the highest coverage in all three categories by a large margin. Coverage from maize‐GAMER was almost twofold that of Gramene, which had the next highest coverage in all GO categories. Gramene had higher coverage than Phytozome in all three categories. When the number of annotations was compared across different GO categories, the lowest number of annotations for Gramene and Phytozome annotation sets was seen in the CC category. In contrast, maize‐GAMER had the lowest number of annotations in the MF category. Moreover, both Gramene and Phytozome both had a higher number of annotations in the MF category, whereas maize‐GAMER had the highest number of annotations in the BP category. Comparison among the annotations sets illustrated that the maize‐GAMER annotation set has the highest number of annotations in all three categories. Phytozome had the fewest annotations in all three GO categories. Number of annotations loosely correlated with coverage in different GO categories for both Gramene and Phytozome. Furthermore, maize‐GAMER had the highest number of annotations in the BP category. The number of annotations from the maize‐GAMER annotations for the BP was severalfold higher than other annotation sets (approximately 9× that of Gramene and approximately 28× that of Phytozome). The hF 1 score reflects the overall quality of annotations produced by different pipelines used by the three maize annotation projects. Comparing performance of different pipelines across the three GO categories revealed a similar trend that was seen in the previous section. All pipelines had higher hF 1 scores in the MF category and had lower hF 1 scores in the BP category. The only pipeline that did not fit this trend was Phytozome, which had lowest performance in the CC category. maize‐GAMER had a higher hF 1 score than other pipelines in the CC category. maize‐GAMER also had higher performance than Phytozome in other categories, but performed slightly lower than Gramene in those categories. Gramene performed better than other pipelines in the MF and BP categories. Phytozome consistently had lower performance than other pipelines across all three GO categories. Phytozome's performance was especially low in the CC category, which was the lowest hF 1 score seen for any annotation set in any GO category. To visualize how GO annotation methods perform, comparatively, the distribution of metrics hPr, hRc, and hF 1 can be calculated across all genes included in the gold standard dataset. In Figure S1, lower coverage by sequence similarity and domain presence methods is illustrated by the high number of genes with a value of zero. In general, mixed‐method pipelines are shown to perform better than other methods we used. They cover more genes, and two of the three mixed‐method pipelines have higher hPr and hRc than all other methods. The FANN‐GO distribution is different from both other mixed methods: In general, it has lower performance than other methods. This can be attributed to that fact FANN‐GO annotations have lower specificity than other methods, and lower specificity results in lower values for hPr and hRc. In addition, the maize‐GAMER has fewer genes with a value of 0 than both Phytozome and Gramene.

Example annotations from nana plant1 (na1)

The gene nana plant1 (na1; GRMZM2G449033) has more annotations than any other gene in the gold standard dataset. A classical maize mutant with a dwarf phenotype (Hartwig et al., 2011), the na1 recessive mutant results from a loss‐of‐function mutation in the gene that affects the brassinosteroid (BR) biosynthetic pathway where BR is a plant hormone that is required for normal plant growth (Hartwig et al., 2011). In the gold standard dataset, na1 had seven biological process GO terms annotated. Annotations for na1 from different maize annotation sets were compared to the gold standard, and a subgraph for each annotation set and gold standard dataset was plotted (see Figure 5). Phytozome did not annotate any GO terms to na1 (see Figure 5a), but both Gramene (see Figure 5b) and maize‐GAMER (see Figure 5c) have annotated BP GO terms for na1. Gramene annotates three GO terms na1, while maize‐GAMER has annotated 13 GO terms to na1. Two GO terms from the gold standard are known to be related to na1 dwarf phenotype from previous studies, “brassinosteroid biosynthetic process” (GO:0016132) and “unidimensional cell growth” (GO:0009826). While both of these were annotated correctly by maize‐GAMER (see Figure 5c), only one of them was correctly annotated by Gramene (see Figure 5b). Comparison of overlapping nodes indicates that the maize‐GAMER Aggregate annotation set also contains a number of less specific nonleaf terms which overlap with nodes inferred from gold standard dataset. Overall, the maize‐GAMER has larger proportion of overlapping nodes with the gold standard than the Gramene for the BP GO category. Biological Process GO graph for maize na1. Leaf terms are toward the bottom, and root terms are toward the top. Terms covered only by the gold standard are shown in orange (labeled G), those in the dataset but absent from the gold standard are shown in blue (labeled D), and those that appear in both are shown in green (labeled DG). Leaf terms in each subgraph have an * next to them. Phytozome graph is shown at the top (a), Gramene graph is shown in the middle (b), and maize‐GAMER Aggregate graph is shown at the bottom (c) The different approaches taken by the pipelines from Gramene and maize‐GAMER result in different annotations for the example case study of the maize na1 gene. Gramene has a lower number of GO terms annotated to na1 than maize‐GAMER. The average specificity of GO terms annotated in the BP category for na1 (see Figure 5) is not significantly different between GAMER (mean = 12.154) and Gramene (mean = 12.667) pipelines (two‐sided two‐group Wilcoxon rank‐sum test; p = .89). This example from na1 indicates that the specificity of the annotations is not significantly different for specific instances, but is different when compared overall. We further compared the GAMER component and aggregate dataset creation methodologies annotate gene function to that of Gramene and Phytozome by comparing the hPr, hRc, and hF 1 metrics for the na1 gene (see Figure S2). Phytozome's method stands out because it has no annotations for na1; thus, all the metrics have a value of 0. The GAMER aggregate dataset has higher hF 1 score compared to Gramene and has higher hPr and hRc as well. Among the component datasets, PANNZER has the highest hF 1 score and highest hRc. InterProScan has the highest hPr, but its lower hRc reduces the hF 1. FANN‐GO and Argot2 have lower metrics due to lower specificity of annotations compared to other methods.

DISCUSSION

In keeping with our goal, through the maize‐GAMER project, we were able to improve the GO annotation dataset for maize and to document inputs, methods, and results at a level that enables both reproducibility and reuse of the pipeline for future genome versions. To determine how best to create an improved maize annotation dataset, we tried out multiple different methods and compared the resulting datasets based on a gold standard set of gene functions. This also enabled us to better understand differences in term assignments among the methods we used. Using the same gold standard, we also were able to compare resulting datasets to those produced by and available from Gramene and Phytozome. We used the hFmax metric to select high‐confidence annotations from mixed‐method pipelines and to evaluate annotation sets resulting from all methods under evaluation. We found that mixed‐method pipelines developed for the CAFA1 challenge outperformed RBH and domain presence methods for GO annotation (Radivojac et al., 2013). They covered more genes with annotations, produced higher number of annotations, and had higher hF 1 score than both sequence similarity and domain‐based methods. The higher performance from mixed‐method pipelines is the outcome of advanced statistical (Falda et al., 2012; Koskinen et al., 2015) and machine learning approaches (Clark & Radivojac, 2011) used to reduce the false‐positive and false‐negative annotations. Mixed‐method pipelines do have a limitation: They have higher coverage, but annotations are less specific in general when compared with datasets produced using other approaches. This could be due to the dearth of training dataset for the more specific GO terms, which is required for training machine learning methods. When we aggregated the predictions from RBH, domain‐based methods, and three tools from CAFA1, we produced the maize‐GAMER Aggregate dataset, which covers more gene space than the datasets produced by Gramene and Phytozome, and with similar or better accuracy. While the higher coverage could cause concern, evaluating the annotations using the gold standard has shown that the performance is similar or better than existing datasets. This also indicates that maize‐GAMER annotations are no less reliable than Gramene and are in fact are better than those for Phytozome. Removing less specific GO terms annotated by some methods in cases where there were more specific GO terms annotated to the same genes was important for aggregating different datasets. In certain cases, more than one annotation with lower specificity was replaced by a single annotation with higher specificity. As with any computational approach to annotate GO terms, the current maize‐GAMER dataset should be considered as an initial step in improving the GO annotations in maize. As future iterations of the CAFA competition evaluate new tools and methods for GO annotations, we anticipate that the quality of computational maize GO annotations could be iteratively improved in a reproducible manner by continuing to apply the newest, best performing methods. To enable better reproducibility, we have generated a supplementary document with exact parameters and commands used to generate the maize dataset. We are currently in the process of formalizing the code used to generate the maize GO annotation set into a reusable pipeline called GO‐MAP. Once completed, the GO‐MAP pipeline can be used for GO annotation of newly sequenced plant genomes as well as existing plant genomes. The pipeline will be made freely available and will utilize the same methods and datasets used for maize. The set of manually reviewed gene function annotations for maize that we call the gold standard is both incomplete and sparse. This situation does not reflect the amount of published literature describing gene function for maize. Instead, this situation is due to limited curation of gene function into GO terms. While tools exist at MaizeGDB that enable researchers to assign GO terms to genes directly, these tools remain poorly utilized. In an effort to improve community engagement and to upgrade the Evidence Codes for GO assignments, our next step for maize‐GAMER will be to develop and deploy a tool to enable experts in the maize community to review existing GO annotations. By enabling GO annotation review through expert crowdsourcing, term assignments produced by computational pipelines including GAMER can be upgraded from IEA (Inferred from Electronic Annotation) to RCA (reviewed computational analysis). In this way, we will enable the transfer of collective knowledge members of the maize community have generated over the years to produce higher‐quality functional annotation datasets for maize with clear extension of this practice for other species.

CONFLICT OF INTERESTS

None declared.

AUTHORS’ CONTRIBUTIONS

K.W. carried out all of the research and development efforts. K.W., C.M.A., I.F., and C.J.L.D. designed the experiments and wrote the manuscript. Click here for additional data file. Click here for additional data file.
  24 in total

1.  Drought-Responsive ZmFDL1/MYB94 Regulates Cuticle Biosynthesis and Cuticle-Dependent Leaf Permeability.

Authors:  Giulia Castorina; Frédéric Domergue; Matteo Chiara; Massimo Zilio; Martina Persico; Valentina Ricciardi; David Stephen Horner; Gabriella Consonni
Journal:  Plant Physiol       Date:  2020-07-14       Impact factor: 8.340

2.  Diagnostic genes and immune infiltration analysis of colorectal cancer determined by LASSO and SVM machine learning methods: a bioinformatics analysis.

Authors:  Yan-Rong Li; Ke Meng; Guang Yang; Bao-Hai Liu; Chu-Qiao Li; Jia-Yuan Zhang; Xiao-Mei Zhang
Journal:  J Gastrointest Oncol       Date:  2022-06

3.  The widespread nature of Pack-TYPE transposons reveals their importance for plant genome evolution.

Authors:  Jack S Gisby; Marco Catoni
Journal:  PLoS Genet       Date:  2022-02-24       Impact factor: 5.917

4.  easyMF: A Web Platform for Matrix Factorization-Based Gene Discovery from Large-scale Transcriptome Data.

Authors:  Wenlong Ma; Siyuan Chen; Yuhong Qi; Minggui Song; Jingjing Zhai; Ting Zhang; Shang Xie; Guifeng Wang; Chuang Ma
Journal:  Interdiscip Sci       Date:  2022-05-18       Impact factor: 3.492

5.  Evolutionary Variation in MADS Box Dimerization Affects Floral Development and Protein Abundance in Maize.

Authors:  María Jazmín Abraham-Juárez; Amanda Schrager-Lavelle; Jarrett Man; Clinton Whipple; Pubudu Handakumbura; Courtney Babbitt; Madelaine Bartlett
Journal:  Plant Cell       Date:  2020-09-01       Impact factor: 11.277

6.  Meta Gene Regulatory Networks in Maize Highlight Functionally Relevant Regulatory Interactions.

Authors:  Peng Zhou; Zhi Li; Erika Magnusson; Fabio Gomez Cano; Peter A Crisp; Jaclyn M Noshay; Erich Grotewold; Candice N Hirsch; Steven P Briggs; Nathan M Springer
Journal:  Plant Cell       Date:  2020-03-17       Impact factor: 11.277

7.  Contrasting transcriptional responses to Fusarium virguliforme colonization in symptomatic and asymptomatic hosts.

Authors:  Amy Baetsen-Young; Huan Chen; Shin-Han Shiu; Brad Day
Journal:  Plant Cell       Date:  2021-04-17       Impact factor: 11.277

8.  PANNZER-A practical tool for protein function prediction.

Authors:  Petri Törönen; Liisa Holm
Journal:  Protein Sci       Date:  2021-10-14       Impact factor: 6.725

9.  Single-parent expression complementation contributes to phenotypic heterosis in maize hybrids.

Authors:  Jutta A Baldauf; Meiling Liu; Lucia Vedder; Peng Yu; Hans-Peter Piepho; Heiko Schoof; Dan Nettleton; Frank Hochholdinger
Journal:  Plant Physiol       Date:  2022-06-27       Impact factor: 8.005

10.  Systematic analysis of differentially expressed ZmMYB genes related to drought stress in maize.

Authors:  Peng-Yu Zhang; Xiao Qiu; Jia-Xu Fu; Guo-Rui Wang; Li Wei; Tong-Chao Wang
Journal:  Physiol Mol Biol Plants       Date:  2021-05-29
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.