Literature DB >> 30943207

A quick guide for student-driven community genome annotation.

Prashant S Hosmani¹, Teresa Shippy², Sherry Miller², Joshua B Benoit³, Monica Munoz-Torres^4,5, Mirella Flores-Gonzalez¹, Lukas A Mueller¹, Helen Wiersma-Koch⁶, Tom D'Elia⁶, Susan J Brown², Surya Saha¹.

Abstract

High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms and usually have minor or major errors. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions.

Entities: Chemical Disease Species

Mesh：

Year: 2019 PMID： 30943207 PMCID： PMC6447164 DOI： 10.1371/journal.pcbi.1006682

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

This guide describes the workflow for a community genome annotation project that connects undergraduate students with bioinformaticians, faculty and peer mentors to foster educational development and produce quality student-driven annotation. In this guide, annotation or curation is defined as the manual improvement of computationally-predicted gene structure and function associated with a genome. When annotation is integrated into undergraduate education, students get an authentic research experience and the opportunity to contribute to scientific discovery. Instructors can scale annotation projects according to the number of students and learning outcomes. The low cost of adding web-based annotation exercises to existing genome-based courses makes it an attractive option, especially for smaller institutions with limited budgets. There is substantial evidence that undergraduate research experiences that include gene annotation not only promote better understanding of genetics but also produce annotations that greatly improve existing resources [1-5]. Successful examples include large scale, course-based programs that feature annotation as a key component [3,6]. There are many benefits for students who participate in these programs, including the development of a better understanding of genomics [3,7], retention and persistence in the sciences [6,8], and inclusion on peer-reviewed publications [9]. Bioinformatics skills required for detailed analysis of gene families and pathways align with curriculum guidelines established for biotechnology-based training [10,11]. The manual curation of gene families promotes a deeper understanding of underlying biology, a critical learning experience for undergraduate students. This provides a means to reinforce important analytical practices and standards of rigor that complement textbook learning to enhance the student’s overall education. Annotation also introduces undergraduate students to the challenges of conducting research in non-model systems. Students are routinely presented with problems that are unique to the organism, gene family or pathway they are annotating. These obstacles foster the development of critical thinking, problem solving skills and resilience in a technology-based atmosphere. Students also learn to adapt to the complexities of collaborative research, a challenge that requires students to hone their teamwork and communication skills. Opportunities also exist to give students a positive research experience with sense of ownership through presentations at scientific meetings and contribution to publications. We describe a workflow to establish curation resources, train undergraduate students, curate gene families, perform quality control and finally publish the results. The success of this community annotation model is based on a roadmap that includes building a collaborative ecosystem (see 1.1–3), recruiting new students (see 2.1) and providing them with initial training followed by continued support (see 2.3 and 3.1). The student-generated gene annotations are then reviewed before changes are committed to the official gene set. Following completion of the annotation and quality review, the student-driven annotation data is compiled for publication (see 3.3). We utilized these methods to establish the first official gene set for the insect vector of citrus greening disease [9]. S1 Table contains a list of infrastructure and resources required for manual gene annotation projects of varying scope. When successfully implemented, the student-driven annotation community model outlined here provides significant undergraduate-based educational opportunities that will yield a well-trained student population and also provide the scientific research community with quality curated data sets.

1. Build a collaborative ecosystem

1.1. Build an ecosystem that provides supporting resources and an integrated toolkit

Manual curation is a time intensive process, but its efficiency can be improved by providing the annotators with a solid foundation of supporting resources. Most importantly, a high-quality error-corrected and near-complete genome assembly will enable the annotators to identify and correct inaccurate gene models generated by automated prediction software. The completeness of the genome assembly can be evaluated using BUSCO, a popular metric based on the expected presence of a phylogenetically appropriate set of single copy genes [12]. Validating the quality of the assembly, on the other hand, is more difficult. Alignment of paired-end RNAseq or DNAseq reads to the genome can be used to evaluate assembly errors. In general, the concordant mapping rate is an indicator of assembly quality and can be used to detect misassembled regions. The goal of manual curation is to utilize different sources of evidence to produce the most accurate gene model possible. Automated gene prediction software, such as the MAKER annotation pipeline [13], provide a good starting point by producing consensus gene models that can be refined during manual curation. MAKER uses raw data from a variety of sources, including output from multiple ab-initio gene predictors, RNAseq reads and homologous proteins aligned to the genome. These evidence sources can also be used in the manual annotation process. Table 1 describes the utility of these and other resources. These resources can be prioritized based on funding and the goals of the annotation project.

Table 1

Use of evidence sets and other resources for manual curation.

Type of data	Application
DNAseq	Aligned reads can help to evaluate integrity of the assembly and correct SNPs and insertion or deletion errors
Consensus gene predictions (e.g. MAKER [13], Prokka [14])	Primary source of gene models for manual curation
Models from ab-initio gene prediction tools (e.g. Augustus [15], SNAP [16])	Alternative sources of gene models that are more comprehensive but may contain false positives
RNAseq	Illumina short reads aligned to the genome can act as raw data for curation. They provide evidence for splicing and exon structure. RNAseq data from different tissues, organs, life stages or conditions is helpful to discern alternative transcripts.
Transcriptome assemblies (e.g. Trinity [17] or StringTie [18])	These provide a condensed representation of the aligned RNAseq reads and assist in discovery of multiple isoforms. De novo assembled transcriptomes are a critical secondary resource to search for genes missing from the genome.
Homologous proteins	Well-annotated proteins from related species offer additional source evidence for validating the structure of genes. This is helpful in case of insufficient RNAseq coverage or lowly expressed genes. Moreover, these can provide functional descriptions for the gene.
Full-length cDNA sequences	Pacbio or Nanopore sequencing of full-length transcripts is very useful for clearly deciphering multiple isoforms for a gene eliminating the ambiguity from partial transcripts assembled from short reads.
Proteomics data	Peptides identified by mass spectrometry from different tissues of the organism can provide evidence of translation of genes predicted by ab initio gene predictors.

The cost of generating each data type varies according to the genome size and other factors. Resources that can be used in multiple steps are the most cost-effective. DNAseq data used for generating the genome assembly can be reused for evaluating the assembled contigs. Paired-end RNAseq data from one or two Illumina Hiseq runs that includes multiple biological and life stages (e.g. male, female, juvenile and adult samples) can provide sufficient coverage of the gene space to be used both in an independent transcriptome assembly and for genome assembly quality checks. The RNAseq reads can also be mapped to the genome to provide evidence for expression and splicing. Pacbio Iso-Seq data is relatively expensive to produce as compared to Illumina sequence, but provides the most reliable evidence of gene structure since most reads correspond to full-length transcripts.

1.2. Identify curation targets according to project goals

Annotating all the genes in any genome assembly is a daunting task, so genes should be prioritized according to the aims of the project and available resources. Annotation can be targeted to major pathways of functional interest or gene families can be selected according to the expertise of annotators. We found it helpful for team leaders to compile an initial list of pathways or gene families to be annotated, from which students could choose genes of interest. If at all possible, members of the relevant research community should be consulted to identify the most critical targets for annotation. Not only can their interests inform the selection of genes to be annotated, but interaction with researchers who are utilizing the gene annotations helps student annotators see the significance of their work. Once particular pathways have been selected for annotation, a metabolic pathway database can be used to classify gene families by pathway providing a useful resource for the curation effort. For example, DiaphorinaCyc [19] contains pathways for Diaphorina citri, the insect vector of citrus greening disease, and is used by curators for discovering genes involved in a particular pathway [20-24]. In cases where resources are available, an alternative annotation strategy is to annotate genes by walking through each scaffold [3,7,25], but this requires many annotators working simultaneously. This method also complicates background research, since investigating genes whose only connection is their position on a chromosome is much less efficient than researching gene families or metabolic pathways. Moreover, the results are not conducive to subsequent analyses, as the genes annotated likely have very little functional relatedness.

1.3. Tools for collaboration

Manual curation tends to involve teams from multiple institutions located in different parts of the country or even the world. This creates a need for robust and user-friendly platforms for collaboration and frequent meetings. Guidelines for establishing effective bioinformatics communities strongly emphasize the importance of communication and openness [26]. Various platforms are available for working in a collaborative virtual environment and should be selected based on features, previous user experience and cost. Apollo [27] is one of the most commonly used web-based manual curation tools and is open-source. There are offline gene curation tools like Artemis [28] that are also popular, but may be limited by lack of functions required for collaboration. Selecting an annotation tool with an active user community is useful for getting support from others who have faced similar issues in the past. A majority of the communication needs can be met by using a combination of free online tools provided by Google, file sharing services like Dropbox, and a Wiki site. However, coordinating a large team of annotators located in different organizations may require project management solutions such as Basecamp, Atlassian Confluence or Asana that offer a common interface for managing multiple projects, user access, file sharing and forums. These may be available for free to educational institutions in some cases. Video conference platforms such as Google Hangouts, Skype and Zoom can be utilized for meetings.

2. Train annotators and formalize curation practices

2.1. Recruiting annotators—Harnessing the crowd

Undergraduate annotators can be recruited by offering annotation as part of their coursework. An effective strategy is to use annotation as part of a capstone or senior research experience, which gives students enough time to learn annotation skills along with an incentive to complete their analysis and provide a thorough report of their gene family or pathway. A sustainable program introduces annotation over the period of a semester and then utilizes a second semester or summer for completion of gene annotation, comparative analysis and writing of final reports. This expanded time frame helps to create experienced annotators who can mentor the next cohort of students. Additional incentives for undergraduate students include the potential to present their work at scientific meetings and to contribute to peer-reviewed publications. These motivating factors help recruit responsible students who are interested in research as they begin to consider their post-undergraduate career and education options. Depending on funding, paid internships for students to participate in annotation also assists with recruitment. Overall, providing a research project focused on learning coupled with opportunities to publish builds a sense of ownership, reduces student turnover and ensures high quality annotation by undergraduate students.

2.2. Build teams according to expertise and annotation targets

Building teams of annotators has proven to be successful in our experience. If the annotation group is just starting, it is important to have a mix of undergraduate and graduate students or postdocs where the senior personnel help in defining annotation goals. At a minimum, undergraduate student participants should have completed an introductory course in genetics with an emphasis on molecular biology. Prior sequence analysis or familiarity with genome sequencing is not required and should not be a prerequisite for participation. Depending on the level of instructional support available, completion of more advanced coursework in bioinformatics may be helpful in reducing the learning curve. However, regardless of the students’ educational background, the annotation process provides an effective means for students to expand their understanding of molecular and genome biology. Teaming group leaders, experienced annotators and new annotators creates an environment that fosters learning and produces quality annotation. If sufficient numbers of students are available, it might be worthwhile to implement gamification and set up competing teams that can motivate each another to curate more gene models. This has been applied in the CACAO community functional curation project (https://gowiki.tamu.edu/wiki/index.php/Category:CACAO).

2.3. Train annotators and start curation

Student-driven annotation requires initial introduction to the annotation process and continued educational instruction within a framework of peer support. In-person workshops and webinars are effective for initial training and providing students with an overview of the annotation protocols. Continued instruction is necessary to explain the detailed biology related to gene structure and interpretation of the evidence required to make more complex decisions for structural annotation. Having experienced annotators develop tutorial resources (such as PowerPoint presentations or workbook style protocols) is also very helpful for new annotators. S2 Table contains a list of free online training resources and guidelines for genome annotation. A peer mentoring network that connects new and experienced annotators is a valuable means to instruct new students and encourage teamwork. Defining regular meeting times and weekly objectives further increases the benefits of peer mentoring. Ultimately, the process of annotation is best instructed using active learning strategies. Live annotation by students in weekly meetings with group leaders, or through video conferences, is useful in getting students over hurdles they encounter during annotation. Solving these issues in a live group setting makes students more comfortable with the annotation process and provides the foundation of knowledge required to solve problems on their own.

2.4. Establish the protocols for curation

The workflow of the annotation project greatly depends on the available resources. After meeting minimum requirements of data and tools as described in the previous sections, stipulating detailed procedures and minimum standards of annotation will guide the novice curators. Annotation procedures followed by expert annotators vary based on personal preferences and also change according to the gene family under consideration. In our recent publication [9], we defined a project-specific annotation workflow that has been generalized in Fig 1 for a broader audience. Despite the potential differences in protocols followed by each annotator, we recommend that minimum evidence such as RNAseq and ortholog support are required for all manually curated genes. An evaluation process should be established (see 3.1) to ensure that these criteria are met.

Fig 1

Annotation workflow describing various steps in manual curation of protein-coding genes.

Annotation workflows can be broadly grouped into three sections (i) Obtaining orthologs from closely related species, (ii) Curating gene models in an annotation editor like Apollo and (iii) Reporting the structural and functional annotation in the form of gene family or pathway reports culminating in an official gene set for the organism. Obtaining well-curated orthologs from model species will aid in structural curation. It is helpful to provide annotators with a list of closely related organisms with good quality genomic resources, from which they can collect orthologs. At this stage, a thorough literature review is recommended to gain a better understanding of the gene family or pathway and gather information relevant to the specific genes being annotated. In particular, reports of changes in gene copy number or domain organization during evolution should be noted. Student annotators should be instructed on how to keep a detailed record of their work in a lab notebook. This record can be as simple as log entries kept in a word processing document that features a cloud-based backup. Indeed, all results should be saved using a cloud-based service to prevent loss. Updates to the lab notebook should be monitored regularly, if possible, and completion should be encouraged for continuing annotation. This documentation will be essential for writing gene reports later (see 3.3). Orthologs from related species can be used to identify candidate gene models using the BLAT [29] sequence search tool in the Apollo genome annotation editor or by BLAST [30] to organism-specific databases. Reciprocal BLAST searches should be used to verify orthology. Gene models can then be refined using available evidence tracks. Table 1 summarizes use of different evidence tracks for structural curation of gene models. The accuracy of automatically predicted and manually annotated gene models depends on the quality of the genome assembly. If the genome is highly fragmented, a de novo transcriptome should be used to independently validate the gene models for both structure and presence or absence. The annotators should be aware that transcriptome assembly from short read data may also produce spurious and partial transcripts.

3. Quality evaluation and publication

3.1. Iterative evaluation through peer and expert review to improve annotation

We propose a curation procedure with multiple rounds of error-checking and evaluation since the annotations are primarily performed by annotators in-training. Curation of gene models that are correctly predicted by automated gene callers and have well-curated orthologs is not prone to errors. However, lack of consensus from multiple sources of evidence and misassemblies in the genome can complicate the structural and functional annotation of a gene model. It is advantageous to have student’s present updates, even if they are minor, at regular coordination meetings, so that any challenges are identified early and the students can make steady progress. Refining the annotated gene models in consultation with peers and if available, senior scientists is a useful learning experience for the students. Similar review processes can also be implemented for gene family reports, where undergraduate students evaluate each other’s reports before they are presented to the senior scientists. In some cases, there may not be sufficient evidence for even expert annotators to make an informed decision about a gene model. We advise that these models be deemed putative or partial and detailed documentation kept, so that they can be resolved once new evidence or an improved genome assembly is available. The process of manual curation has been divided into specific tasks (Table 2) for each step of annotation according to Fig 1. This list can be used by instructors for grading the students through the course. We have also included an example concept inventory test (S1 File) to evaluate students after the annotation course.

Table 2

Assessment plan for students with description of student objectives and related assessments to measure student annotation progress and quality.

	Objective	Assessment types and descriptions
Finding Orthologs	Obtain orthologs: Collect orthologous sequences from appropriate organisms. Identification of conserved domains: Use online tools and databases to identify conserved domains. Structural assessment: Evaluate the structural organization of the gene of interest and copy numbers in closely related organisms.	Electronic lab book documentation of notes and work describing: 1 Names, organisms and accession numbers of orthologous sequences. Include database where orthologous sequences were collected. 2 Names of conserved domains, size and organization within protein. Record bioinformatics tools and database used to analyze domains. 3 Structural organization of the gene and copy number in closely related organisms. 1-3 Prepare a short report (PowerPoint or written) of the gene family/pathway, share with lab group or peers in class. Reports should include: literature review and determination of gene family/pathway function, copy number of genes, conservation in related organisms, estimation of number of each gene expected to be in the family/pathway.
Apollo	Genome Search via BLAST or BLAT: Perform and interpret BLAT results from Apollo or BLAST results of databases. Structural Curation: Complete manual structural annotation using evidence tracks in Apollo. Functional Curation: Propose the functional classification of the annotated gene and show support from comparative analysis with homologous sequences, identification of conserved domains, phylogenetic analyses or other lines of evidence.	Electronic lab book documentation of notes and work describing: 1 Details of BLAT or BLAST results, including: Similarity or identity scores, E values, query coverage and genome coordinates of matching sequences. 2 Record status of predicted models and evidence tracks for gene to be annotated. Record changes made to predicted model. Evaluate structural annotation by comparison of final sequence to orthologs and data collected on conserved domains to determine the completeness of the annotation. 3 Document comparative analysis to homologous proteins that supports the functional characterization. Record organisms, accession numbers and sequence similarity. Provide results of analyses using BLAST, multiple sequence alignments or phylogenetic analysis. 1-3 Iterative annotation with review: Examine accuracy of annotations through peer review and presentation of short reports (PowerPoint or written) to faculty and scientist mentors.
Reporting	Gene/pathway report: Compose a written report to justify accuracy of final annotation and to detail results of the completed annotation in an evolutionary and genomic context. Official gene set: Report data to lead scientists for official gene set. Publication: Assemble reports and summaries of annotation and comparative analysis for peer-reviewed publication.	Written report, poster presentations or oral presentations (class or professional meetings) that include Overview of gene family/pathway Description of the annotated genes, processes used, support and evidence collected Gene copy tables for each gene in family/pathway Pairwise comparisons of genes in other organisms Phylogenetic trees of genes with sequence/copy number different from those in orthologs Analysis of biological significance of genes in family/pathway based on evidence from related organisms Contribute information required for establishing the official gene set. Contribute reports and information required for preparing peer-reviewed publications.

Assessment plan for students with description of student objectives and related assessments to measure student annotation progress and quality.

Objectives are outlined to ensure students follow the workflow in Fig 1. Students should be able to perform the activities at each step before starting the next phase of the workflow. Objective numbers correspond to the appropriate assessment type and descriptions. Obtain orthologs: Collect orthologous sequences from appropriate organisms. Identification of conserved domains: Use online tools and databases to identify conserved domains. Structural assessment: Evaluate the structural organization of the gene of interest and copy numbers in closely related organisms. Names, organisms and accession numbers of orthologous sequences. Include database where orthologous sequences were collected. Names of conserved domains, size and organization within protein. Record bioinformatics tools and database used to analyze domains. Structural organization of the gene and copy number in closely related organisms. Prepare a short report (PowerPoint or written) of the gene family/pathway, share with lab group or peers in class. Reports should include: literature review and determination of gene family/pathway function, copy number of genes, conservation in related organisms, estimation of number of each gene expected to be in the family/pathway. Genome Search via BLAST or BLAT: Perform and interpret BLAT results from Apollo or BLAST results of databases. Structural Curation: Complete manual structural annotation using evidence tracks in Apollo. Functional Curation: Propose the functional classification of the annotated gene and show support from comparative analysis with homologous sequences, identification of conserved domains, phylogenetic analyses or other lines of evidence. Details of BLAT or BLAST results, including: Similarity or identity scores, E values, query coverage and genome coordinates of matching sequences. Record status of predicted models and evidence tracks for gene to be annotated. Record changes made to predicted model. Evaluate structural annotation by comparison of final sequence to orthologs and data collected on conserved domains to determine the completeness of the annotation. Document comparative analysis to homologous proteins that supports the functional characterization. Record organisms, accession numbers and sequence similarity. Provide results of analyses using BLAST, multiple sequence alignments or phylogenetic analysis. Iterative annotation with review: Examine accuracy of annotations through peer review and presentation of short reports (PowerPoint or written) to faculty and scientist mentors. Gene/pathway report: Compose a written report to justify accuracy of final annotation and to detail results of the completed annotation in an evolutionary and genomic context. Official gene set: Report data to lead scientists for official gene set. Publication: Assemble reports and summaries of annotation and comparative analysis for peer-reviewed publication. Written report, poster presentations or oral presentations (class or professional meetings) that include Overview of gene family/pathway Description of the annotated genes, processes used, support and evidence collected Gene copy tables for each gene in family/pathway Pairwise comparisons of genes in other organisms Phylogenetic trees of genes with sequence/copy number different from those in orthologs Analysis of biological significance of genes in family/pathway based on evidence from related organisms Contribute information required for establishing the official gene set. Contribute reports and information required for preparing peer-reviewed publications.

3.2. Finalize annotation and evaluate quality of entire official gene set

Manually curated genes should be merged with the models from automated gene predictors after each round of annotation to create the official gene set for public release to the research community. Curated models selected for public release should be carefully screened for errors by expert curators. Tools such as the GFF3toolkit (https://github.com/NAL-i5K/GFF3toolkit/) are useful to identify errors in the curated gene models and automate the merging process. Updates to gene annotations across annotation releases can be tracked by using unique gene identifiers and version numbers. Version numbers should be incremented only if the sequence has been modified in the new annotation. Submission to public databases like the NCBI and ENA is recommended, but the process can be time consuming. There are other options like Figshare (https://figshare.com/), Dryad (http://datadryad.org/) and Ag Data Commons (https://data.nal.usda.gov/), as well as clade-specific databases like i5k (https://i5k.nal.usda.gov/ [31]) and SOL genomics (https://solgenomics.net/ [32]). Metrics for measuring improvements over the entire gene set are limited by the availability of gold standard annotations for comparison and the inherent complexity of annotation. Annotation Edit Distance (AED) gives a measure of the transcript evidence supporting a gene model. MAKER [13] calculates the annotation edit distance for all the gene models in an annotation set. We have shown that this metric can act as a good measure for quality of annotations [9]. AED has also been adopted for model organisms like Arabidopsis, where it replaced the five-star based ranking method used by TAIR [33]. A genome-independent de novo transcriptome can also be used to validate the structure of gene models in the official gene set.

3.3. Publication

The goals for publication depend on the scope of the annotation project. Annotation followed by phylogenetic analysis and functional characterization of biologically important gene families can be presented as a course project report or a poster at scientific conferences [20-24]. Results from a larger community curation project can reflect a significant research contribution that can warrant a journal publication [25,31]. In either case, we recommend that the undergraduate annotators summarize all their findings in gene reports that can then be iteratively revised in consultation with peers and senior scientists. The gene report can be structured like a mini manuscript with an introduction followed by literature review, methods, results and discussion (See supplementary data in Saha et al., 2017 [9]). It is critical to provide uniform report guidelines, so reports can easily be merged together for publication (e.g. supplementary materials). The discussion should focus on structural features and domain organization of the gene family, in addition to copy number analysis. A phylogenetic comparison with related species can be used to provide evolutionary context to the structure and function. This exercise is helpful in training undergraduate students to present their own work and introduces them to scientific writing.

Conclusion

The guidelines presented here provide a framework to build a successful student-driven annotation community that can contribute to ongoing research projects. A community that consists of experts, instructors and peer mentors provides the ideal framework to train and supervise undergraduate students so they can make a meaningful contribution. Benefits to undergraduate student participants include an increase in learning, critical thinking and problem-solving abilities related to molecular biology and genomics. Community curation provides knowledge and skills that help students progress in their undergraduate courses. Moreover, a research experience also encourages exploration and pursuit of graduate education. The inherent need for communication and teamwork in a diverse and sometimes virtual community also develops skills that are transferable to a wide range of careers. Students are excited to participate in research projects that have tangible scientific outcomes. This builds a sense of ownership and responsibility, resulting in students who are eager to annotate and also mentor beginners to sustain the community. It is advisable for organizers to plan early for the turnover of annotators so there is sufficient overlap of the incoming cohort with experienced annotators. We effectively deployed this strategy during a three-year period to train over 40 student annotators from four different institutions including three universities, a state college, and a research institute. The entire community manually curated approximately 530 gene models. Other accomplishments include creation of the first official gene set for D. citri, an important insect vector, and a peer-reviewed publication featuring student annotators as contributing authors [9]. These results demonstrate that the student-driven community model is fully capable of producing high quality gene models while providing a supportive and valuable educational experience for undergraduate students.

List of suggested personnel, hardware and software resources for community curation projects of differing scope.

(DOCX) Click here for additional data file.

Websites with training resources and guidelines for genome annotation.

(DOCX) Click here for additional data file.

Concept inventory test for genome annotation to be taken by students at the end of the course.

(DOCX) Click here for additional data file.

27 in total

1. What skills should students of undergraduate biochemistry and molecular biology programs have upon graduation?

Authors: Harold B White; Marilee A Benore; Takita F Sumter; Benjamin D Caldwell; Ellis Bell
Journal: Biochem Mol Biol Educ Date: 2013-09-10 Impact factor: 1.160

2. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads.

Authors: Mihaela Pertea; Geo M Pertea; Corina M Antonescu; Tsung-Cheng Chang; Joshua T Mendell; Steven L Salzberg
Journal: Nat Biotechnol Date: 2015-02-18 Impact factor: 54.908

3. Artemis: an integrated platform for visualization and analysis of high-throughput sequence-based experimental data.

Authors: Tim Carver; Simon R Harris; Matthew Berriman; Julian Parkhill; Jacqueline A McQuillan
Journal: Bioinformatics Date: 2011-12-22 Impact factor: 6.937

4. The Sol Genomics Network (SGN)--from genotype to phenotype to breeding.

Authors: Noe Fernandez-Pozo; Naama Menda; Jeremy D Edwards; Surya Saha; Isaak Y Tecle; Susan R Strickler; Aureliano Bombarely; Thomas Fisher-York; Anuradha Pujar; Hartmut Foerster; Aimin Yan; Lukas A Mueller
Journal: Nucleic Acids Res Date: 2014-11-26 Impact factor: 16.971

5. Web Apollo: a web-based genomic annotation editing platform.

Authors: Eduardo Lee; Gregg A Helt; Justin T Reese; Monica C Munoz-Torres; Chris P Childers; Robert M Buels; Lincoln Stein; Ian H Holmes; Christine G Elsik; Suzanna E Lewis
Journal: Genome Biol Date: 2013-08-30 Impact factor: 13.583

6. An inclusive Research Education Community (iREC): Impact of the SEA-PHAGES program on research outcomes and student learning.

Authors: David I Hanauer; Mark J Graham; Laura Betancur; Aiyana Bobrownicki; Steven G Cresawn; Rebecca A Garlena; Deborah Jacobs-Sera; Nancy Kaufmann; Welkin H Pope; Daniel A Russell; William R Jacobs; Viknesh Sivanathan; David J Asai; Graham F Hatfull
Journal: Proc Natl Acad Sci U S A Date: 2017-12-05 Impact factor: 11.205

7. Improved annotation of the insect vector of citrus greening disease: biocuration by a diverse genomics community.

Authors: Surya Saha; Prashant S Hosmani; Krystal Villalobos-Ayala; Sherry Miller; Teresa Shippy; Mirella Flores; Andrew Rosendale; Chris Cordola; Tracey Bell; Hannah Mann; Gabe DeAvila; Daniel DeAvila; Zachary Moore; Kyle Buller; Kathryn Ciolkevich; Samantha Nandyal; Robert Mahoney; Joshua Van Voorhis; Megan Dunlevy; David Farrow; David Hunter; Taylar Morgan; Kayla Shore; Victoria Guzman; Allison Izsak; Danielle E Dixon; Andrew Cridge; Liliana Cano; Xiaolong Cao; Haobo Jiang; Nan Leng; Shannon Johnson; Brandi L Cantarel; Stephen Richards; Adam English; Robert G Shatters; Chris Childers; Mei-Ju Chen; Wayne Hunter; Michelle Cilia; Lukas A Mueller; Monica Munoz-Torres; David Nelson; Monica F Poelchau; Joshua B Benoit; Helen Wiersma-Koch; Tom D'Elia; Susan J Brown
Journal: Database (Oxford) Date: 2017-01-01 Impact factor: 3.451

8. Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors: Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal: Nat Biotechnol Date: 2011-05-15 Impact factor: 54.908

9. Whole genome comparison of a large collection of mycobacteriophages reveals a continuum of phage genetic diversity.

Authors: Welkin H Pope; Charles A Bowman; Daniel A Russell; Deborah Jacobs-Sera; David J Asai; Steven G Cresawn; William R Jacobs; Roger W Hendrix; Jeffrey G Lawrence; Graham F Hatfull
Journal: Elife Date: 2015-04-28 Impact factor: 8.140

10. A broadly implementable research course in phage discovery and genomics for first-year undergraduate students.

Authors: Tuajuanda C Jordan; Sandra H Burnett; Susan Carson; Steven M Caruso; Kari Clase; Randall J DeJong; John J Dennehy; Dee R Denver; David Dunbar; Sarah C R Elgin; Ann M Findley; Chris R Gissendanner; Urszula P Golebiewska; Nancy Guild; Grant A Hartzog; Wendy H Grillo; Gail P Hollowell; Lee E Hughes; Allison Johnson; Rodney A King; Lynn O Lewis; Wei Li; Frank Rosenzweig; Michael R Rubin; Margaret S Saha; James Sandoz; Christopher D Shaffer; Barbara Taylor; Louise Temple; Edwin Vazquez; Vassie C Ware; Lucia P Barker; Kevin W Bradley; Deborah Jacobs-Sera; Welkin H Pope; Daniel A Russell; Steven G Cresawn; David Lopatto; Cheryl P Bailey; Graham F Hatfull
Journal: MBio Date: 2014-02-04 Impact factor: 7.867

7 in total

1. G-OnRamp: Generating genome browsers to facilitate undergraduate-driven collaborative genome annotation.

Authors: Luke Sargent; Yating Liu; Wilson Leung; Nathan T Mortimer; David Lopatto; Jeremy Goecks; Sarah C R Elgin
Journal: PLoS Comput Biol Date: 2020-06-04 Impact factor: 4.475

2. Twelve quick steps for genome assembly and annotation in the classroom.

Authors: Hyungtaek Jung; Tomer Ventura; J Sook Chung; Woo-Jin Kim; Bo-Hye Nam; Hee Jeong Kong; Young-Ok Kim; Min-Seung Jeon; Seong-Il Eyun
Journal: PLoS Comput Biol Date: 2020-11-12 Impact factor: 4.475

3. Undergraduate Virtual Engagement in Community Genome Annotation Provides Flexibility to Overcome Course Disruptions.

Authors: Surya Saha; Teresa D Shippy; Susan J Brown; Joshua B Benoit; Tom D'Elia
Journal: J Microbiol Biol Educ Date: 2021-03-31

4. Crowdsourcing biocuration: The Community Assessment of Community Annotation with Ontologies (CACAO).

Authors: Jolene Ramsey; Brenley McIntosh; Daniel Renfro; Suzanne A Aleksander; Sandra LaBonte; Curtis Ross; Adrienne E Zweifel; Nathan Liles; Shabnam Farrar; Jason J Gill; Ivan Erill; Sarah Ades; Tanya Z Berardini; Jennifer A Bennett; Siobhan Brady; Robert Britton; Seth Carbon; Steven M Caruso; Dave Clements; Ritu Dalia; Meredith Defelice; Erin L Doyle; Iddo Friedberg; Susan M R Gurney; Lee Hughes; Allison Johnson; Jason M Kowalski; Donghui Li; Ruth C Lovering; Tamara L Mans; Fiona McCarthy; Sean D Moore; Rebecca Murphy; Timothy D Paustian; Sarah Perdue; Celeste N Peterson; Birgit M Prüß; Margaret S Saha; Robert R Sheehy; John T Tansey; Louise Temple; Alexander William Thorman; Saul Trevino; Amy Cheng Vollmer; Virginia Walbot; Joanne Willey; Deborah A Siegele; James C Hu
Journal: PLoS Comput Biol Date: 2021-10-28 Impact factor: 4.779

5. Using multiple reference genomes to identify and resolve annotation inconsistencies.

Authors: Patrick J Monnahan; Jean-Michel Michno; Christine O'Connor; Alex B Brohammer; Nathan M Springer; Suzanne E McGaugh; Candice N Hirsch
Journal: BMC Genomics Date: 2020-04-08 Impact factor: 3.969

6. Facilitating Growth through Frustration: Using Genomics Research in a Course-Based Undergraduate Research Experience.

Authors: David Lopatto; Anne G Rosenwald; Justin R DiAngelo; Amy T Hark; Matthew Skerritt; Matthew Wawersik; Anna K Allen; Consuelo Alvarez; Sara Anderson; Cindy Arrigo; Andrew Arsham; Daron Barnard; Christopher Bazinet; James E J Bedard; Indrani Bose; John M Braverman; Martin G Burg; Rebecca C Burgess; Paula Croonquist; Chunguang Du; Sondra Dubowsky; Heather Eisler; Matthew A Escobar; Michael Foulk; Emily Furbee; Thomas Giarla; Rivka L Glaser; Anya L Goodman; Yuying Gosser; Adam Haberman; Charles Hauser; Shan Hays; Carina E Howell; Jennifer Jemc; M Logan Johnson; Christopher J Jones; Lisa Kadlec; Jacob D Kagey; Kimberly L Keller; Jennifer Kennell; S Catherine Silver Key; Adam J Kleinschmit; Melissa Kleinschmit; Nighat P Kokan; Olga Ruiz Kopp; Meg M Laakso; Judith Leatherman; Lindsey J Long; Mollie Manier; Juan C Martinez-Cruzado; Luis F Matos; Amie Jo McClellan; Gerard McNeil; Evan Merkhofer; Vida Mingo; Hemlata Mistry; Elizabeth Mitchell; Nathan T Mortimer; Debaditya Mukhopadhyay; Jennifer Leigh Myka; Alexis Nagengast; Paul Overvoorde; Don Paetkau; Leocadia Paliulis; Susan Parrish; Mary Lai Preuss; James V Price; Nicholas A Pullen; Catherine Reinke; Dennis Revie; Srebrenka Robic; Jennifer A Roecklein-Canfield; Michael R Rubin; Takrima Sadikot; Jamie Siders Sanford; Maria Santisteban; Kenneth Saville; Stephanie Schroeder; Christopher D Shaffer; Karim A Sharif; Diane E Sklensky; Chiyedza Small; Mary Smith; Sheryl Smith; Rebecca Spokony; Aparna Sreenivasan; Joyce Stamm; Rachel Sterne-Marr; Katherine C Teeter; Justin Thackeray; Jeffrey S Thompson; Stephanie Toering Peters; Melanie Van Stry; Norma Velazquez-Ulloa; Cindy Wolfe; James Youngblom; Brian Yowler; Leming Zhou; Janie Brennan; Jeremy Buhler; Wilson Leung; Laura K Reed; Sarah C R Elgin
Journal: J Microbiol Biol Educ Date: 2020-02-28

7. Double triage to identify poorly annotated genes in maize: The missing link in community curation.

Authors: Marcela K Tello-Ruiz; Cristina F Marco; Fei-Man Hsu; Rajdeep S Khangura; Pengfei Qiao; Sirjan Sapkota; Michelle C Stitzer; Rachael Wasikowski; Hao Wu; Junpeng Zhan; Kapeel Chougule; Lindsay C Barone; Cornel Ghiban; Demitri Muna; Andrew C Olson; Liya Wang; Doreen Ware; David A Micklos
Journal: PLoS One Date: 2019-10-28 Impact factor: 3.240

7 in total