| Literature DB >> 24101916 |
Jordan A Fish1, Benli Chai, Qiong Wang, Yanni Sun, C Titus Brown, James M Tiedje, James R Cole.
Abstract
Ribosomal RNA genes have become the standard molecular markers for microbial community analysis for good reasons, including universal occurrence in cellular organisms, availability of large databases, and ease of rRNA gene region amplification and analysis. As markers, however, rRNA genes have some significant limitations. The rRNA genes are often present in multiple copies, unlike most protein-coding genes. The slow rate of change in rRNA genes means that multiple species sometimes share identical 16S rRNA gene sequences, while many more species share identical sequences in the short 16S rRNA regions commonly analyzed. In addition, the genes involved in many important processes are not distributed in a phylogenetically coherent manner, potentially due to gene loss or horizontal gene transfer. While rRNA genes remain the most commonly used markers, key genes in ecologically important pathways, e.g., those involved in carbon and nitrogen cycling, can provide important insights into community composition and function not obtainable through rRNA analysis. However, working with ecofunctional gene data requires some tools beyond those required for rRNA analysis. To address this, our Functional Gene Pipeline and Repository (FunGene; http://fungene.cme.msu.edu/) offers databases of many common ecofunctional genes and proteins, as well as integrated tools that allow researchers to browse these collections and choose subsets for further analysis, build phylogenetic trees, test primers and probes for coverage, and download aligned sequences. Additional FunGene tools are specialized to process coding gene amplicon data. For example, FrameBot produces frameshift-corrected protein and DNA sequences from raw reads while finding the most closely related protein reference sequence. These tools can help provide better insight into microbial communities by directly studying key genes involved in important ecological processes.Entities:
Keywords: amplicon analysis; amplification primers; biogeochemical cycles; functional genes; microbial ecology; phylogeny
Year: 2013 PMID: 24101916 PMCID: PMC3787254 DOI: 10.3389/fmicb.2013.00291
Source DB: PubMed Journal: Front Microbiol ISSN: 1664-302X Impact factor: 5.640
List of Abbreviations.
| AAI | Average Amino Acid Identity |
| BIOM | Biological Observation Matrix |
| ENA | European Nucleotide Archive |
| FGP | RDP Functional Gene Pipeline |
| FGR | RDP Functional Gene Repository |
| HMM | Hidden Markov Model |
| HMP | Human Microbiome Project |
| indel | insertion or deletion |
| IUPAC | International Union of Pure and Applied Chemistry |
| JNI | Java Native Interface |
| MID | Multiplex Identifier |
| MSA | Multiple Sequence Alignment |
| NEON | National Ecological Observatory Network |
| NGS | Next Generation Sequencing |
| OTU | Operational Taxonomic Unit |
| RDP | Ribosomal Database Project |
| rQ | Read Quality |
| SFF | Standard Flowgram Format |
| UPGMA | Unweighted Pair Group Method with Arithmetic Mean |
| WGS | Whole-Genome Shotgun |
Names and sites for all first-party tools used in the FunGene Pipeline.
| Full pipeline scripts | |
| RDPTools | |
| Initial process | |
| Defined community analysis | |
| Dereplicator | See mcClust |
| FrameBot | |
| mcClust | |
| Rarefaction/Diversity measures |
Names and sites for all third-party tools used in the FunGene Pipeline.
| USEARCH 6.0 (UCHIME) | |
| HMMER3 |
Data columns on gene family pages.
| [+] | Click to view the protein sequence aligned to HMM and reference consensus sequence |
| Select | Check to mark the selection for analysis (selections are saved in the researcher's session and are not lost when navigating to a new page) |
| Score | The HMM alignment score in bits saved (the higher the score, the higher the probability this sequence is a member of this gene family) |
| New_Hit | Marker for sequence records new to the current release |
| Environmental | Marker for non-cultured, environmental samples |
| Prot_Accno | GenBank protein accession number (also links to the actual protein record in GenBank or FASTA format depending on the researcher's current display settings) |
| Nuc_Accno | GenBank nucleotide accession number (also links to the actual nucleotide record in GenBank or FASTA format depending on the researcher's current display settings) |
| Organism | Name of source organism as annotated in the GenBank record |
| Definition | Gene and product as annotated in the GenBank record |
| Reference | Publication from GenBank record, links to NCBI PubMed live records when available |
| Size | Protein length (number of amino acids, proxy for completeness of the sequence) |
| HMM_Coverage | How completely the sequence covers the model's length (measured by the percentage of HMM positions to which the sequence is aligned, can be used to filter out partial sequences and poor HMM matches) |
Figure 1Inputs, outputs, and the filters applied, in order, by Initial Process.
Number of amplicon reads that passed Initial Process assigned to each NIFH defined community organism.
| ACL19109.1 | 1 | 0 | 3784 | |
| ACL19859.1 | 2 | 2 | 4 | |
| ACL19409.1 | 4 | 6 | 0 | |
| ACL19588.1 | 1 | 3 | 0 | |
| BAB73411.1 | 1 | 1 | 405 | |
| BAB72831.1 | 1 | 2 | 2 | |
| YP_553849.1 | 0 | 0 | 1310 | |
These two sequences were poor matches at the amino acid level. Further testing found that both were chimeric sequences between Nostoc sp. 7120 and B. xenovorans LB400.
Figure 2. Bases in gray were not used as the primer sequence inputs in the Initial Process in order to capture the same gene region from each read.
Number of amplicon reads that passed Initial Process assigned to each BUK defined community organism.
| NP_811465.1 | 3 | 3 | 0 | 0 | |
| YP_079736.1 | 0 | 0 | 749 | 363 | |
| NP_349675.1 | 0 | 0 | 68 | 229 | |
| NP_348286.1 | 0 | 1 | 38 | 22 | |
| YP_001086582.1 | 0 | 1 | 1183 | 82 | |
| YP_697036.1 | 0 | 0 | 244 | 1110 | |
Summary information for the three defined community samples.
| NIFH | 5509 | 321 | 0.13 | 76.3 | 14.3 | 25 | 1 |
| BUK1 | 2334 | 420 | 0.41 | 28.1 | 54.6 | 2 | 0 |
| BUK2 | 2206 | 421 | 1.2 | 3.0 | 85.9 | 5 | 11 |
Figure 3Indels and substitutions varying by differences to reverse primer. (A) Fraction of reads with the given number of indels for each reverse primer difference. (B) Fraction of reads with the given numbers of substitutions for each reverse primer difference. (C) Fraction of total reads with exactly the specified number of differences to the reverse primer. Eight BUK1, 57 BUK2, and no NIFH reads had two primer differences (NIFH and BUK1 with two differences were not shown).
Figure 4(A) Expected error rate and observed error rate per base (nucleotide insertions, deletions, and substitutions) from the three defined community data sets. Only data points with more than 5 reads are shown. The expected error rate was calculated based on the formula in the Materials and Methods. (B–D) Percentage of sequences matching each defined community organism by read Q score for Samples NIFH, BUK1, and BUK2, respectively. Symbol “×” represents the relative abundance of reads at each read Q score, and solid lines represent the percent of sequences above the read Q score cutoff from each sample.