Literature DB >> 31097009

Next-generation genome annotation: we still struggle to get it right.

Abstract

While the genome sequencing revolution has led to the sequencing and assembly of many thousands of new genomes, genome annotation still uses very nearly the same technology that we have used for the past two decades. The sheer number of genomes necessitates the use of fully automated procedures for annotation, but errors in annotation are just as prevalent as they were in the past, if not more so. How are we to solve this growing problem?

Entities: Species

Mesh：

Year: 2019 PMID： 31097009 PMCID： PMC6521345 DOI： 10.1186/s13059-019-1715-2

Source DB: PubMed Journal: Genome Biol ISSN： 1474-7596 Impact factor: 13.583

Introduction

When the first complete bacterial genome, Haemophilus influenzae, appeared in 1995, the 1.83 megabase (Mb) sequence was accompanied by annotation of 1742 protein-coding genes along with a small complement of transfer RNAs (tRNAs) and ribosomal RNAs [1]. This genome paper, and the dozen or so that followed in the next few years, defined genome annotation as it still exists today: the process of decorating the genome with information about where the genes are and what those genes (might) do. Over the years, efforts to expand the scope of annotation have flourished, and today we have information about a wide range of other functional elements, including noncoding RNAs, promoter and enhancer sequences, DNA methylation sites, and more. Nonetheless, the core feature of genome annotation is still the gene list, particularly the protein-coding genes. With hundreds of eukaryotic genomes and well over 100,000 bacterial genomes now residing in GenBank, and many thousands more soon to come, annotation is a critical element to help us understand the biology of genomes. Paradoxically, the incredibly rapid improvements in genome sequencing technology have made genome annotation less, not more, accurate. The main challenges can be divided into two categories: (i) automated annotation of large, fragmented “draft” genomes remains very difficult, and (ii) errors and contamination in draft assemblies lead to errors in annotation that tend to propagate across species. Thus, the more “draft” genomes we produce, the more errors we create and propagate. Fortunately, technological advances give us some hope that we can mitigate these problems, even if a full solution is still beyond our reach.

High-throughput annotation of eukaryotic genomes

Finding genes in bacteria is relatively easy, in large part because bacterial genomes are approximately 90% protein-coding, with relatively short intergenic stretches in between every pair of genes. The gene-finding problem is mostly about deciding which of the six possible reading frames (three in each direction) contains the protein, and computational gene finders take advantage of this to produce highly accurate results. Thus, although we still don’t know the functions of many bacterial genes, at least we can be confident that we have their amino acid sequences correct. In eukaryotes, by contrast, the gene-finding problem is far more difficult, because (i) genes are few and far between, and (ii) genes are interrupted by introns. Thus, while 90% of a typical bacterial genome is covered by protein-coding sequences, only about 1.3% of the human genome (40.2 Mb in the CHESS 2.2 database [2]) comprises protein-coding exons. The percentage is even lower in larger genomes, such as the mega-genomes of pine trees and other conifers. For this reason and others, the best automated gene finders are far less accurate on eukaryotes. Manual curation will not solve this quandary, for the obvious reason that it does not scale, and the less-obvious reason that even careful human analysis does not always provide a clear answer. To illustrate the latter point: in a recent comparison of all the protein-coding and lncRNA transcripts in the RefSeq and Gencode human gene databases, only 27.5% of the Gencode transcripts had exactly the same introns as the corresponding RefSeq genes [2]. Thus, even after 18 years of effort, the precise exon–intron structure of many human protein-coding genes is not settled. The annotation of most other eukaryotes—with the exception of small, intensively studied model organisms like yeast, fruit fly and Arabidopsis—is in worse shape than human annotation. One high-throughput solution provides at least a partial solution to this problem: RNA sequencing (RNA-seq). Prior to the invention of RNA-seq, scientists worked hard to generate full-length transcripts that could provide a “gold standard” annotation for a species. The idea was that if we had the full-length messenger RNA sequence for a gene, we could simply align it to the genome to reveal the gene’s exon–intron structure. The Mammalian Gene Collection, an effort to obtain these RNAs for humans and a few other species, concluded in 2009 with the announcement that 92% of human protein-coding genes had been captured [3]. That project, though extremely useful, was very expensive, not easily scalable, and still not comprehensive. (Notably, the Mammalian Gene Collection only attempted to capture a single isoform of each gene. We now know that most human genes have multiple isoforms.) RNA-seq technology, in contrast, provides a rapid way to capture most of the expressed genes for any species. By aligning RNA-seq reads to a genome and then assembling those reads, we can construct a reasonably good approximation (including alternative isoforms) of the complete gene content of a species, as my colleagues and I have done for the human genome [2]. Thus, a modern annotation pipeline such as MAKER [4] can use RNA-seq data, combined with alignments to databases of known proteins and other inputs, to do a passably good job of finding all genes and even assigning names to many of them. This solution comes with several major caveats. First, RNA-seq does not precisely capture all of the genes in a genome. Some genes are expressed at low levels or in only a few tissues, and they might be missed entirely unless the RNA sequencing data are truly comprehensive. In addition, many of the transcripts expressed in a tissue sample are not genes: they might represent incompletely spliced transcripts, or they might simply be noise. Therefore, we need independent verification before we can be certain that any expressed region is a functional gene. Even for genes that are repeatedly expressed at high levels, determining whether they encode proteins or instead represent noncoding RNAs is a still-unsolved problem. The current Gencode human annotation (version 30), for example, contains more RNA genes than proteins [5], but no one knows what most of those RNA genes do. Another caveat is that because draft genomes may contain thousands of disconnected contigs, many genes will be broken up among several contigs (or scaffolds) whose order and orientation are unknown. The problem occurs in all species, but it is much worse for draft genomes where the average contig size is smaller than the span of a typical gene. This makes it virtually impossible for annotation software to put genes together correctly; instead, the software will tend to annotate many gene fragments (residing on different contigs) with the same descriptions, and the total gene count might be vastly overinflated. Even where they don’t have gaps, some draft genomes have high error rates that may introduce erroneous stop codons or frame shifts in the middle of genes. There is no way that annotation software can easily fix these problems: the only solution is to improve the assemblies and re-annotate.

Errors in assembly cause errors in annotation

Sequencing a bacterial genome or a small eukaryote is so fast and inexpensive today that a relatively small lab can easily afford the cost of deep whole-genome shotgun sequencing. After generating 100-fold coverage in 100–150 bp Illumina reads, a scientist can assemble the data into a draft genome using any of several genome assemblers. Ironically, though, the ease of sequencing and assembly presents another challenge for annotation: contamination of the assembly itself. When a genome is assembled into thousands of contigs, the person doing the assembly has no easy way to ensure that every one of those contigs truly represents the target species. In some recent projects, draft genomes contained hundreds of contigs from foreign species; e.g., the tardigrade genome, which was sequenced from DNA collected from multiple whole animals. (This was a necessary step because a single tardigrade does not yield sufficient DNA for whole-genome sequencing.) The first publication of the tardigrade erroneously claimed that its contaminants represented an astounding number of horizontal gene transfer events; fortunately, a much better assembly was published very soon after the first one, in which the contaminants were identified and removed [6]. Other draft genomes have yielded similar claims of horizontal gene transfer, many of which are false positives due to contamination [7]. And many draft genome assemblies are contaminated with common bacteria [8], sequencing vectors, or even human DNA [9], all of which are ubiquitous presences in sequencing labs. Although automated annotation is essential to keep pace with the vast number of new genomes, any error in existing annotation—whether it be a mistaken gene name, or a gene labeled as belonging to the wrong species, or a non-genic sequence being called a gene—is likely to be quickly propagated to other species. This presents one more (and growing) annotation challenge: when an annotation error is found and corrected in one species, any other annotation that relied upon it needs to be corrected as well. Currently there is no way to achieve this; indeed, public annotation databases do not record the source of every gene assignment.

Coming soon: direct RNA sequencing

Finally, a newly emerging technology, direct sequencing of RNA [10], offers the possibility of dramatically improving gene annotation in the future. Although still in early development, nanopore sequencing technology can been used to sequence RNA without first converting it to DNA, unlike RNA-seq and other methods. With direct RNA sequencing, we may soon have the ability to generate full-length transcripts in a truly high-throughput manner, replacing years-long efforts of the past [3] with a rapid, low-cost solution that will be within the reach of many individual scientific labs. This approach, although not a panacea, promises to greatly improve our ability to describe the full complement of genes for every species.

9 in total

1. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes.

Authors: Brandi L Cantarel; Ian Korf; Sofia M C Robb; Genis Parra; Eric Ross; Barry Moore; Carson Holt; Alejandro Sánchez Alvarado; Mark Yandell
Journal: Genome Res Date: 2007-11-19 Impact factor: 9.043

2. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors: R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal: Science Date: 1995-07-28 Impact factor: 47.728

3. The completion of the Mammalian Gene Collection (MGC).

Authors: Gary Temple; Daniela S Gerhard; Rebekah Rasooly; Elise A Feingold; Peter J Good; Cristen Robinson; Allison Mandich; Jeffrey G Derge; Jeanne Lewis; Debonny Shoaf; Francis S Collins; Wonhee Jang; Lukas Wagner; Carolyn M Shenmen; Leonie Misquitta; Carl F Schaefer; Kenneth H Buetow; Tom I Bonner; Linda Yankie; Ming Ward; Lon Phan; Alex Astashyn; Garth Brown; Catherine Farrell; Jennifer Hart; Melissa Landrum; Bonnie L Maidak; Michael Murphy; Terence Murphy; Bhanu Rajput; Lillian Riddick; David Webb; Janet Weber; Wendy Wu; Kim D Pruitt; Donna Maglott; Adam Siepel; Brona Brejova; Mark Diekhans; Rachel Harte; Robert Baertsch; Jim Kent; David Haussler; Michael Brent; Laura Langton; Charles L G Comstock; Michael Stevens; Chaochun Wei; Marijke J van Baren; Kourosh Salehi-Ashtiani; Ryan R Murray; Lila Ghamsari; Elizabeth Mello; Chenwei Lin; Christa Pennacchio; Kirsten Schreiber; Nicole Shapiro; Amber Marsh; Elizabeth Pardes; Troy Moore; Anita Lebeau; Mike Muratet; Blake Simmons; David Kloske; Stephanie Sieja; James Hudson; Praveen Sethupathy; Michael Brownstein; Narayan Bhat; Joseph Lazar; Howard Jacob; Chris E Gruber; Mark R Smith; John McPherson; Angela M Garcia; Preethi H Gunaratne; Jiaqian Wu; Donna Muzny; Richard A Gibbs; Alice C Young; Gerard G Bouffard; Robert W Blakesley; Jim Mullikin; Eric D Green; Mark C Dickson; Alex C Rodriguez; Jane Grimwood; Jeremy Schmutz; Richard M Myers; Martin Hirst; Thomas Zeng; Kane Tse; Michelle Moksa; Merinda Deng; Kevin Ma; Diana Mah; Johnson Pang; Greg Taylor; Eric Chuah; Athena Deng; Keith Fichter; Anne Go; Stephanie Lee; Jing Wang; Malachi Griffith; Ryan Morin; Richard A Moore; Michael Mayo; Sarah Munro; Susan Wagner; Steven J M Jones; Robert A Holt; Marco A Marra; Sun Lu; Shuwei Yang; James Hartigan; Marcus Graf; Ralf Wagner; Stanley Letovksy; Jacqueline C Pulido; Keith Robison; Dominic Esposito; James Hartley; Vanessa E Wall; Ralph F Hopkins; Osamu Ohara; Stefan Wiemann
Journal: Genome Res Date: 2009-09-18 Impact factor: 9.043

4. No evidence for extensive horizontal gene transfer in the genome of the tardigrade Hypsibius dujardini.

Authors: Georgios Koutsovoulos; Sujai Kumar; Dominik R Laetsch; Lewis Stevens; Jennifer Daub; Claire Conlon; Habib Maroon; Fran Thomas; Aziz A Aboobaker; Mark Blaxter
Journal: Proc Natl Acad Sci U S A Date: 2016-03-24 Impact factor: 11.205

5. Horizontal gene transfer is not a hallmark of the human genome.

Authors: Steven L Salzberg
Journal: Genome Biol Date: 2017-05-08 Impact factor: 13.583

6. Human Contamination in Public Genome Assemblies.

Authors: Kirill Kryukov; Tadashi Imanishi
Journal: PLoS One Date: 2016-09-09 Impact factor: 3.240

7. CHESS: a new human gene catalog curated from thousands of large-scale RNA sequencing experiments reveals extensive transcriptional noise.

Authors: Mihaela Pertea; Alaina Shumate; Geo Pertea; Ales Varabyou; Florian P Breitwieser; Yu-Chi Chang; Anil K Madugundu; Akhilesh Pandey; Steven L Salzberg
Journal: Genome Biol Date: 2018-11-28 Impact factor: 13.583

8. Removing contaminants from databases of draft genomes.

Authors: Jennifer Lu; Steven L Salzberg
Journal: PLoS Comput Biol Date: 2018-06-25 Impact factor: 4.475

9. GENCODE reference annotation for the human and mouse genomes.

Authors: Adam Frankish; Mark Diekhans; Anne-Maud Ferreira; Rory Johnson; Irwin Jungreis; Jane Loveland; Jonathan M Mudge; Cristina Sisu; James Wright; Joel Armstrong; If Barnes; Andrew Berry; Alexandra Bignell; Silvia Carbonell Sala; Jacqueline Chrast; Fiona Cunningham; Tomás Di Domenico; Sarah Donaldson; Ian T Fiddes; Carlos García Girón; Jose Manuel Gonzalez; Tiago Grego; Matthew Hardy; Thibaut Hourlier; Toby Hunt; Osagie G Izuogu; Julien Lagarde; Fergal J Martin; Laura Martínez; Shamika Mohanan; Paul Muir; Fabio C P Navarro; Anne Parker; Baikang Pei; Fernando Pozo; Magali Ruffier; Bianca M Schmitt; Eloise Stapleton; Marie-Marthe Suner; Irina Sycheva; Barbara Uszczynska-Ratajczak; Jinuri Xu; Andrew Yates; Daniel Zerbino; Yan Zhang; Bronwen Aken; Jyoti S Choudhary; Mark Gerstein; Roderic Guigó; Tim J P Hubbard; Manolis Kellis; Benedict Paten; Alexandre Reymond; Michael L Tress; Paul Flicek
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

9 in total

63 in total

1. A comprehensive annotation and differential expression analysis of short and long non-coding RNAs in 16 bat genomes.

Authors: Nelly F Mostajo; Marie Lataretu; Sebastian Krautwurst; Florian Mock; Daniel Desirò; Kevin Lamkiewicz; Maximilian Collatz; Andreas Schoen; Friedemann Weber; Manja Marz; Martin Hölzer
Journal: NAR Genom Bioinform Date: 2019-09-30

Review 2. Design and analysis of CRISPR-Cas experiments.

Authors: Ruth E Hanna; John G Doench
Journal: Nat Biotechnol Date: 2020-04-13 Impact factor: 54.908

3. Faraway, so close. The comparative method and the potential of non-model animals in mitochondrial research.

Authors: Liliana Milani; Fabrizio Ghiselli
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-12-02 Impact factor: 6.237

4. Comparative transcriptomics of the venoms of continental and insular radiations of West African cones.

Authors: Samuel Abalde; Manuel J Tenorio; Carlos M L Afonso; Rafael Zardoya
Journal: Proc Biol Sci Date: 2020-06-17 Impact factor: 5.349

Review 5. Pan-genomics in the human genome era.

Authors: Rachel M Sherman; Steven L Salzberg
Journal: Nat Rev Genet Date: 2020-02-07 Impact factor: 53.242

6. Evolutionary genomic and bacteria GWAS analysis of Mycobacterium avium subsp. paratuberculosis and dairy cattle Johne's disease phenotypes.

Authors: Vincent P Richards; Annette Nigsch; Paulina Pavinski Bitar; Qi Sun; Tod Stuber; Kristina Ceres; Rebecca L Smith; Suelee Robbe Austerman; Ynte Schukken; Yrjo T Grohn; Michael J Stanhope
Journal: Appl Environ Microbiol Date: 2021-02-05 Impact factor: 4.792