Literature DB >> 23047562

Real time metagenomics: using k-mers to annotate metagenomes.

Robert A Edwards¹, Robert Olson, Terry Disz, Gordon D Pusch, Veronika Vonstein, Rick Stevens, Ross Overbeek.

Abstract

Annotation of metagenomes involves comparing the individual sequence reads with a database of known sequences and assigning a unique function to each read. This is a time-consuming task that is computationally intensive (though not computationally complex). Here we present a novel approach to annotate metagenomes using unique k-mer oligopeptide sequences from 7 to 12 amino acids long. We demonstrate that k-mer-based annotations are faster and approach the sensitivity and precision of blastx-based annotations without loosing accuracy. A last-common ancestor approach was also developed to describe the members of the community.

Entities: Species

Mesh：

Year: 2012 PMID： 23047562 PMCID： PMC3519453 DOI： 10.1093/bioinformatics/bts599

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Metagenomics has revolutionized microbial ecology. The extraction, purification and sequencing steps have been trivialized by next generation sequencing approaches, and environmental samples are routinely processed from collection to DNA sequence in a matter of days (Dinsdale ). The bottleneck in metagenomics approaches has become the analysis of the sequences. The computational comparison of sequences against all of the known proteins using blastx is limited by computational resources (Meyer ; Wilkening ). Here, we describe a novel approach to analysing metagenomic sequences using unique signature k-mers that represent members of a protein family. Limited additional computational resources are required when the size of the underlying database doubles, as in the best case the search is dependent on the length of the k-mer. We implemented an API and Web servers to support the annotation of metagenomes using k-mers.

2 METHODS

Fellowship for the Interpretation of Genomes (FIG) protein families, FIGfams, were constructed as described previously (Meyer ). To identify the signature k-mers that represent members of a protein family, all amino acid oligomers from 7 to 12 amino acids were identified that were (i) present in one of more members of the FIGfam and (ii) were not present in any other FIGfam. These oligos are unambiguous representatives of the family. A binary tree was built to allow rapid searching of the k-mers and to identify their cognate protein families. DNA sequences from metagenomes are assigned functions based on the FIGfams that match. To search the k-mers, the DNA sequence is translated in all six frames, and an exact matching algorithm is used to find identical amino acid strings. Requiring either multiple independent k-mer matches from a single family or a minimum number of k-mer matches over a minimum sequence length is used to adjust the sensitivity of the match. The search reports the first and last positions in the query sequence where the k-mers match, and the number of k-mers that match that region. These matches can be combined into subsystems (Overbeek ). The last common ancestor of the organisms in each family is also identified. To validate whether the k-mer approach could be used to annotate metagenomes, simulated metagenomes were made from 70 different microbial genomes representing a diverse selection of organisms that had been annotated using Rapid Annotation Using Subsystems Technology (RAST) but had not been included in the FIGfams (Supplementary Table 1). The metagenomes were constructed with Grinder (Angly ) and were designed with median DNA fragment lengths of 30, 50, 75, 100, 250 and 500 bp. Metagenomes were annotated by searching for genes using the k-mers, and blastx searches of either the seed-nr database or the database of proteins used to generate the k-mer library. Not every protein in the seed-nr is included in a FIGfam (e.g. singleton proteins are not members of a family).

3 RESULTS

The most recent build of the FIGfams (Release 59 constructed July, 2012) contains 11 856 938 proteins organized into 178 208 families. The FIGfams have a mean of 66.5 protein members and a median of 6 protein members. Both the FIGfams and k-mers are available for download from ftp://ftp.theseed.org/FIGfams/. The coverage of oligomers is shown in Table 1.

Table 1.

Statistics of the k-mers and FIGfams

k-mer size	Number of k-mers	Number of families	Mean number of k-mers per family	Median number of k-mers per family
7	207 362 319	171 606	1208.4	110
8	639 234 488	173 332	3687.9	325
9	812 679 565	173 513	4683.7	404
10	866 382 763	173 561	4991.8	425
11	896 943 566	173 587	5167.1	434
12	921 081 710	173 606	5305.6	441

Statistics of the k-mers and FIGfams The sensitivity and specificity of the k-mer approach was measured using synthetic metagenomes constructed using genomes that were not included in the FIGfams build (Supplementary Fig. 1). The sensitivity [TP/(TP + FN)] measures whether genes that are there (i.e. in the complete genome annotation) are found; for very short DNA sequences, most genes that are there are missed. The k-mers are much more sensitive than blastx, finding almost 1/5 of genes present when the fragment lengths are only 30 bp. When the fragment length exceeds 50 bp, blastx performance improves rapidly as the high scoring pairs exceed the threshold for inclusion, and approaches that of the most sensitive k-mer searches. All approaches reach a sensitivity plateau once the sequence length exceeds 100 bp, and the best methods only find ∼70% of the genes on the fragments. For the longest fragment lengths (500 bp), the mean open reading frame length on each fragment was only 340 bp as genes may start and end off the fragment. In all cases, longer sequence reads resulted in more sensitive assignment of annotations to the DNA sequences, as seen before (Wommack ). The precision [TP/(TP + FP)] reports whether too many genes are being called on a fragment. With very short fragments, both short k-mers and blastx overcall genes. However, longer k-mers result in more confident calls, regardless of fragment length. BLAST precision is improved by only using the set of confidently called proteins—those that are in families and used to make the k-mers. Accuracy measures whether the genes that are identified are correctly annotated (Overbeek ). This measure of accuracy is testing whether the function assigned by the best blastx hit or k-mer reflects of the ‘true’ function of the protein as annotated in the genome. k-mer searches were on average 860 times faster than blastx because the k-mer approach neither extends the matches nor calculates alignment statistics for the resulting matches. DNA sequence annotations using k-mers are as sensitive, precise and accurate as blastx searches, especially for shorter reads. The main disadvantage of using unique k-mers is as the read-length increases the blastx sensitivity exceeds that of k-mers. As shown in Supplementary Figure 1, the length of the k-mer has a strong impact on the precision, sensitivity and accuracy of the search. Based on these data and empirical observations, we recommend that users require at least two k-mer matchers per sequence (but generally no more than four k-mer matchers), and eight- or nine-amino acid k-mers. These parameters provide reasonable estimates of metagenome composition. The k-mers represent a protein family, but not a specific organism from that family. However, most families only contain a few proteins (the median size of the protein families is only six), and thus most protein families only come from a handful of species and very few genera. To assign taxonomic groups to sequences, we identify the last common ancestor of the organisms whose proteins make up a family from their taxonomy. This is an approximation that provides for a rapid assessment of the members of the community. The SEED annotation servers (http://servers.theseed.org/) provide programmatic access to the k-mer annotation algorithm via an API. These servers support the assignment of functions to protein or DNA sequences. Detailed examples are provided at that page and at http://edwards.sdsu.edu/RTMg. The Web interface was built to provide rapid interpretation of a metagenome sample. The samples are analysed in groups of sequences (currently the default is 10 000 sequences at a time), in a round-robin fashion. Users see the results of their annotation as it is being performed. At any time, all of the raw data can be downloaded as raw text to import into any other analysis platform or package. In practice we use this system to assess the quality of the metagenome and visualize similarities to the sample, leaving more detailed and thorough analysis until more time-consuming comparisons are complete. However, the results of the k-mer-based analysis are generally recapitulated in downstream analyses.

6 in total

1. Functional metagenomic profiling of nine biomes.

Authors: Elizabeth A Dinsdale; Robert A Edwards; Dana Hall; Florent Angly; Mya Breitbart; Jennifer M Brulc; Mike Furlan; Christelle Desnues; Matthew Haynes; Linlin Li; Lauren McDaniel; Mary Ann Moran; Karen E Nelson; Christina Nilsson; Robert Olson; John Paul; Beltran Rodriguez Brito; Yijun Ruan; Brandon K Swan; Rick Stevens; David L Valentine; Rebecca Vega Thurber; Linda Wegley; Bryan A White; Forest Rohwer
Journal: Nature Date: 2008-03-12 Impact factor: 49.962

2. Metagenomics: read length matters.

Authors: K Eric Wommack; Jaysheel Bhavsar; Jacques Ravel
Journal: Appl Environ Microbiol Date: 2008-01-11 Impact factor: 4.792

3. Grinder: a versatile amplicon and shotgun sequence simulator.

Authors: Florent E Angly; Dana Willner; Forest Rohwer; Philip Hugenholtz; Gene W Tyson
Journal: Nucleic Acids Res Date: 2012-03-19 Impact factor: 16.971

4. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes.

Authors: Ross Overbeek; Tadhg Begley; Ralph M Butler; Jomuna V Choudhuri; Han-Yu Chuang; Matthew Cohoon; Valérie de Crécy-Lagard; Naryttza Diaz; Terry Disz; Robert Edwards; Michael Fonstein; Ed D Frank; Svetlana Gerdes; Elizabeth M Glass; Alexander Goesmann; Andrew Hanson; Dirk Iwata-Reuyl; Roy Jensen; Neema Jamshidi; Lutz Krause; Michael Kubal; Niels Larsen; Burkhard Linke; Alice C McHardy; Folker Meyer; Heiko Neuweger; Gary Olsen; Robert Olson; Andrei Osterman; Vasiliy Portnoy; Gordon D Pusch; Dmitry A Rodionov; Christian Rückert; Jason Steiner; Rick Stevens; Ines Thiele; Olga Vassieva; Yuzhen Ye; Olga Zagnitko; Veronika Vonstein
Journal: Nucleic Acids Res Date: 2005-10-07 Impact factor: 16.971

5. The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.

Authors: F Meyer; D Paarmann; M D'Souza; R Olson; E M Glass; M Kubal; T Paczian; A Rodriguez; R Stevens; A Wilke; J Wilkening; R A Edwards
Journal: BMC Bioinformatics Date: 2008-09-19 Impact factor: 3.169

6. FIGfams: yet another set of protein families.

Authors: Folker Meyer; Ross Overbeek; Alex Rodriguez
Journal: Nucleic Acids Res Date: 2009-09-17 Impact factor: 16.971

6 in total

23 in total

1. Modeling ecological drivers in marine viral communities using comparative metagenomics and network analyses.

Authors: Bonnie L Hurwitz; Anton H Westveld; Jennifer R Brum; Matthew B Sullivan
Journal: Proc Natl Acad Sci U S A Date: 2014-07-07 Impact factor: 11.205

Review 2. Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors: Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal: Funct Integr Genomics Date: 2021-10-18 Impact factor: 3.410

Review 3. Ancient and modern environmental DNA.

Authors: Mikkel Winther Pedersen; Søren Overballe-Petersen; Luca Ermini; Clio Der Sarkissian; James Haile; Micaela Hellstrom; Johan Spens; Philip Francis Thomsen; Kristine Bohmann; Enrico Cappellini; Ida Bærholm Schnell; Nathan A Wales; Christian Carøe; Paula F Campos; Astrid M Z Schmidt; M Thomas P Gilbert; Anders J Hansen; Ludovic Orlando; Eske Willerslev
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2015-01-19 Impact factor: 6.237

Review 4. Whole-Genome Sequencing of Bacterial Pathogens: the Future of Nosocomial Outbreak Analysis.

Authors: Scott Quainoo; Jordy P M Coolen; Sacha A F T van Hijum; Martijn A Huynen; Willem J G Melchers; Willem van Schaik; Heiman F L Wertheim
Journal: Clin Microbiol Rev Date: 2017-10 Impact factor: 26.132

5. RASTtk: a modular and extensible implementation of the RAST algorithm for building custom annotation pipelines and annotating batches of genomes.

Authors: Thomas Brettin; James J Davis; Terry Disz; Robert A Edwards; Svetlana Gerdes; Gary J Olsen; Robert Olson; Ross Overbeek; Bruce Parrello; Gordon D Pusch; Maulik Shukla; James A Thomason; Rick Stevens; Veronika Vonstein; Alice R Wattam; Fangfang Xia
Journal: Sci Rep Date: 2015-02-10 Impact factor: 4.379

Review 6. Recovering full-length viral genomes from metagenomes.

Authors: Saskia L Smits; Rogier Bodewes; Aritz Ruiz-González; Wolfgang Baumgärtner; Marion P Koopmans; Albert D M E Osterhaus; Anita C Schürch
Journal: Front Microbiol Date: 2015-10-01 Impact factor: 5.640

7. Multivariate analysis of functional metagenomes.

Authors: Elizabeth A Dinsdale; Robert A Edwards; Barbara A Bailey; Imre Tuba; Sajia Akhter; Katelyn McNair; Robert Schmieder; Naneh Apkarian; Michelle Creek; Eric Guan; Mayra Hernandez; Katherine Isaacs; Chris Peterson; Todd Regh; Vadim Ponomarenko
Journal: Front Genet Date: 2013-04-02 Impact factor: 4.599

8. The Pacific Ocean virome (POV): a marine viral metagenomic dataset and associated protein clusters for quantitative viral ecology.

Authors: Bonnie L Hurwitz; Matthew B Sullivan
Journal: PLoS One Date: 2013-02-28 Impact factor: 3.240

9. SEED servers: high-performance access to the SEED genomes, annotations, and metabolic models.

Authors: Ramy K Aziz; Scott Devoid; Terrence Disz; Robert A Edwards; Christopher S Henry; Gary J Olsen; Robert Olson; Ross Overbeek; Bruce Parrello; Gordon D Pusch; Rick L Stevens; Veronika Vonstein; Fangfang Xia
Journal: PLoS One Date: 2012-10-24 Impact factor: 3.240

10. LAF: Logic Alignment Free and its application to bacterial genomes classification.

Authors: Emanuel Weitschek; Fabio Cunial; Giovanni Felici
Journal: BioData Min Date: 2015-12-08 Impact factor: 2.522