| Literature DB >> 30943207 |
Prashant S Hosmani1, Teresa Shippy2, Sherry Miller2, Joshua B Benoit3, Monica Munoz-Torres4,5, Mirella Flores-Gonzalez1, Lukas A Mueller1, Helen Wiersma-Koch6, Tom D'Elia6, Susan J Brown2, Surya Saha1.
Abstract
High quality gene models are necessary to expand the molecular and genetic tools available for a target organism, but these are available for only a handful of model organisms that have undergone extensive curation and experimental validation over the course of many years. The majority of gene models present in biological databases today have been identified in draft genome assemblies using automated annotation pipelines that are frequently based on orthologs from distantly related model organisms and usually have minor or major errors. Manual curation is time consuming and often requires substantial expertise, but is instrumental in improving gene model structure and identification. Manual annotation may seem to be a daunting and cost-prohibitive task for small research communities but involving undergraduates in community genome annotation consortiums can be mutually beneficial for both education and improved genomic resources. We outline a workflow for efficient manual annotation driven by a team of primarily undergraduate annotators. This model can be scaled to large teams and includes quality control processes through incremental evaluation. Moreover, it gives students an opportunity to increase their understanding of genome biology and to participate in scientific research in collaboration with peers and senior researchers at multiple institutions.Entities:
Mesh:
Year: 2019 PMID: 30943207 PMCID: PMC6447164 DOI: 10.1371/journal.pcbi.1006682
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Use of evidence sets and other resources for manual curation.
| Type of data | Application |
|---|---|
| DNAseq | Aligned reads can help to evaluate integrity of the assembly and correct SNPs and insertion or deletion errors |
| Consensus gene predictions (e.g. MAKER [ | Primary source of gene models for manual curation |
| Models from | Alternative sources of gene models that are more comprehensive but may contain false positives |
| RNAseq | Illumina short reads aligned to the genome can act as raw data for curation. They provide evidence for splicing and exon structure. RNAseq data from different tissues, organs, life stages or conditions is helpful to discern alternative transcripts. |
| Transcriptome assemblies (e.g. Trinity [ | These provide a condensed representation of the aligned RNAseq reads and assist in discovery of multiple isoforms. |
| Homologous proteins | Well-annotated proteins from related species offer additional source evidence for validating the structure of genes. This is helpful in case of insufficient RNAseq coverage or lowly expressed genes. Moreover, these can provide functional descriptions for the gene. |
| Full-length cDNA sequences | Pacbio or Nanopore sequencing of full-length transcripts is very useful for clearly deciphering multiple isoforms for a gene eliminating the ambiguity from partial transcripts assembled from short reads. |
| Proteomics data | Peptides identified by mass spectrometry from different tissues of the organism can provide evidence of translation of genes predicted by |
Fig 1Annotation workflow describing various steps in manual curation of protein-coding genes.
Assessment plan for students with description of student objectives and related assessments to measure student annotation progress and quality.
Objectives are outlined to ensure students follow the workflow in Fig 1. Students should be able to perform the activities at each step before starting the next phase of the workflow. Objective numbers correspond to the appropriate assessment type and descriptions.
| Objective | Assessment types and descriptions | |
|---|---|---|
|
| Electronic lab book documentation of notes and work describing: Names, organisms and accession numbers of orthologous sequences. Include database where orthologous sequences were collected. Names of conserved domains, size and organization within protein. Record bioinformatics tools and database used to analyze domains. Structural organization of the gene and copy number in closely related organisms. Prepare a short report (PowerPoint or written) of the gene family/pathway, share with lab group or peers in class. Reports should include: literature review and determination of gene family/pathway function, copy number of genes, conservation in related organisms, estimation of number of each gene expected to be in the family/pathway. | |
|
| Electronic lab book documentation of notes and work describing: Details of BLAT or BLAST results, including: Similarity or identity scores, E values, query coverage and genome coordinates of matching sequences. Record status of predicted models and evidence tracks for gene to be annotated. Record changes made to predicted model. Evaluate structural annotation by comparison of final sequence to orthologs and data collected on conserved domains to determine the completeness of the annotation. Document comparative analysis to homologous proteins that supports the functional characterization. Record organisms, accession numbers and sequence similarity. Provide results of analyses using BLAST, multiple sequence alignments or phylogenetic analysis. Iterative annotation with review: Examine accuracy of annotations through peer review and presentation of short reports (PowerPoint or written) to faculty and scientist mentors. | |
|
|
Written report, poster presentations or oral presentations (class or professional meetings) that include
Overview of gene family/pathway Description of the annotated genes, processes used, support and evidence collected Gene copy tables for each gene in family/pathway Pairwise comparisons of genes in other organisms Phylogenetic trees of genes with sequence/copy number different from those in orthologs Analysis of biological significance of genes in family/pathway based on evidence from related organisms Contribute information required for establishing the official gene set. Contribute reports and information required for preparing peer-reviewed publications. |