| Literature DB >> 32421357 |
Daniel R Zerbino1, Adam Frankish1, Paul Flicek1.
Abstract
Our understanding of the human genome has continuously expanded since its draft publication in 2001. Over the years, novel assays have allowed us to progressively overlay layers of knowledge above the raw sequence of A's, T's, G's, and C's. The reference human genome sequence is now a complex knowledge base maintained under the shared stewardship of multiple specialist communities. Its complexity stems from the fact that it is simultaneously a template for transcription, a record of evolution, a vehicle for genetics, and a functional molecule. In short, the human genome serves as a frame of reference at the intersection of a diversity of scientific fields. In recent years, the progressive fall in sequencing costs has given increasing importance to the quality of the human reference genome, as hundreds of thousands of individuals are being sequenced yearly, often for clinical applications. Also, novel sequencing-based assays shed light on novel functions of the genome, especially with respect to gene expression regulation. Keeping the human genome annotation up to date and accurate is therefore an ongoing partnership between reference annotation projects and the greater community worldwide.Entities:
Keywords: annotation; genes; genome; human; regulatory elements; variants
Mesh:
Year: 2020 PMID: 32421357 PMCID: PMC7116059 DOI: 10.1146/annurev-genom-121119-083418
Source DB: PubMed Journal: Annu Rev Genomics Hum Genet ISSN: 1527-8204 Impact factor: 8.929
Figure 1Gene annotation process.
Gene annotation uses diverse orthogonal data types to determine first the structure and then the most likely functional class of the transcript and gene locus. Long transcriptomic data aligned to the reference genome identify the overall exon-intron structure of the transcript, while short RNA sequencing reads give confidence to the annotation of precise intron/exon boundaries and extensions at the ends of the transcripts (5′ and 3′ untranslated regions), especially where coverage from longer reads is low. Some transcript structures may be annotated entirely based on RNA sequencing data, again where coverage from longer reads is low. Terminal short-read data sets help define the 5′ and 3′ ends of transcripts, which is important from both a structural and functional point of view; where the termini of a transcript can be identified with confidence, lending certainty of the structural annotation, the annotators gain greater confidence in their determination of functional annotation. The presence of high-quality proteomic data and evidence of the evolutionary conservation of coding sequence informs the annotation of coding potential.
Evidence relevant to the annotation of different types of genes
| Biotype | Transcription data (INSDC, RNA-seq, PacBio, ONT) | Terminal transcription data (CAGE, RAMPAGE, polyA-seq) | Protein homology data (UniProt) | Protein experimental data (MS, ribo-seq) | Conservation data (PhyloCSF, PhastCons, GERP) | RNA secondary structure data (Infernal) | External expert database (miRBase, Rfam, IMGT) |
|---|---|---|---|---|---|---|---|
| Protein coding | Yes | Yes | Yes | Yes | Yes | No | No |
| lncRNA | Yes | Yes | No | No | No | No | No |
| sRNA | Yes | No | No | No | Yes | Yes | Yes |
| Pseudogene | No[ | No[ | Yes | No | Yes [ | No | No |
| IG/TR | No | No | No | Yes | No | No | Yes |
This table illustrates the evidence types generally used by manual annotators in the Ensembl team to determine the correct structure and function of a transcript model. Protein-coding genes require transcriptomic evidence to define structure and terminal transcription data sets to define transcript start and end coordinates. Homology with UniProt and proteomics data informs or validates the decision to assign a transcript or locus the protein-coding biotype—that is, to decide whether a functional protein is encoded. Similarly, evolutionary conservation of sequence and of protein-coding potential also informs this decision. Decisions about protein-coding genes do not generally use RNA secondary structure or other expert databases, although they may be consulted on a case-by-case basis. The annotation of lncRNAs utilizes the same transcriptomic data sets as protein-coding genes; however, the absence of protein homology, experimental proteomics data, and conservation is a key determinant in choosing not to annotate a transcript as protein coding. For sRNAs, transcriptomic data sets, conservation data, RNA secondary structure data, and expert external databases are utilized. Pseudogenes are annotated based solely on their homology to annotated protein sequences, although transcriptomic data are used to support the transcribed pseudogene biotypes. IG/TR gene segments are annotated on the basis of protein experimental data and homology to IG/TR sequences from the IMGT database. Abbreviations: CAGE, cap analysis gene expression; GERP, Genomic Evolutionary Rate Profiling; IG, immunoglobulin; IMGT, International Immunogenetics; Infernal, Inference of RNA Alignment; INSDC, International Nucleotide Sequence Database Collaboration; lncRNA, long noncoding RNA; miRBase, MicroRNA Database; MS, mass spectrometry; ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; PhyloCSF, Phylogenetic Codon Substitution Frequencies; polyA-seq, polyA sequencing; RAMPAGE, RNA annotation and mapping of promoters for the analysis of gene expression; ribo-seq, ribosome profiling; RNA-seq, RNA sequencing; sRNA, small RNA; TR, T cell receptor.
For nontranscribed pseudogenes only; transcribed pseudogenes may be supported by these data.
While pseudogenes are not conserved over large evolutionary distances, known artifacts in the whole-genome alignments on which conservation detection is based permit their identification with care.
Figure 2Organizations that support the GRC assembly and its gene annotations.
Abbreviations: e!, Ensembl Project; GRC, Genome Reference Consortium; HGNC, Human Genome Organisation (HUGO) Gene Nomenclature Committee; INSDC, International Nucleotide Sequence Database Collaboration; NCBI, National Center for Biotechnology Information; UCSC, University of California, Santa Cruz.
Figure 3A locus whose identification was possible only through the analysis of recent orthologous data types.
The locus lacks any support from transcript evidence deposited in INSDC databases, and as such, it is not represented in any reference annotation database. Only by identifying the intersection of PhyloCSF data (to identify conserved protein-coding potential), RNA-seq data (to provide evidence of transcription and tissue specificity), Intropolis RNA-seq-supported intron-spanning reads (to provide evidence for precise split junctions and support tissue specificity from other datasets), CAGE data (to define transcript 5′ ends and tissue specificity support), and polyA-seq data (to define transcript 3′ ends and tissue specificity support) could a correctly splicing transcript model be built and the correct coding sequence added. Given the expectation of conservation, protein-coding genes identified by this annotation process were also annotated in mouse to provide an additional check on their validity Abbreviations: CAGE, cap analysis gene expression; INSDC, International Nucleotide Sequence Database Collaboration; PhyloCSF, Phylogenetic Codon Substitution Frequencies; polyA-seq, polyA sequencing; RNA-seq, RNA sequencing.
Figure 4Progress in the annotation of gene loci in Ensembl/GENCODE.
(a) The number of protein-coding genes annotated has generally fallen over time but appears to be generally stable in recent years. The number of pseudogene loci increased rapidly during the annotation of the whole genome (2007–2012) and has maintained slow growth subsequently, while the number of lncRNA experienced a similar pattern of increase but continues to rise. Small-RNA locus totals are generally stable, only changing when there is a significant update to their automated annotation pipeline, and the relatively few IG and TR segments have remained broadly stale since their initial annotation. (b) The number of transcripts continues to increase over time, particularly for protein-coding genes and lncRNA loci, and given the availability of high-quality long-read data sets, this trend is expected to continue. (c,d) The changes to protein-coding gene counts underlying the relatively stable headline totals for human and mouse, respectively, in three recent Ensembl/GENCODE annotation releases. Protein-coding genes were both added and removed in every human and mouse release, with a total of 33 additions and 48 removals in human and 80 additions and 188 removals in mouse, suggesting that the final gene annotation for protein-coding genes has not yet been settled. Abbreviations: IG, immunoglobulin; lncRNA, long noncoding RNA; TR, T cell receptor.