| Literature DB >> 21304707 |
David M Tanenbaum, Johannes Goll, Sean Murphy, Prateek Kumar, Nikhat Zafar, Mathangi Thiagarajan, Ramana Madupu, Tanja Davidsen, Leonid Kagan, Saul Kravitz, Douglas B Rusch, Shibu Yooseph.
Abstract
The JCVI metagenomics analysis pipeline provides for the efficient and consistent annotation of shotgun metagenomics sequencing data for sampling communities of prokaryotic organisms. The process can be equally applied to individual sequence reads from traditional Sanger capillary electrophoresis sequences, newer technologies such as 454 pyrosequencing, or sequence assemblies derived from one or more of these data types. It includes the analysis of both coding and non-coding genes, whether full-length or, as is often the case for shotgun metagenomics, fragmentary. The system is designed to provide the best-supported conservative functional annotation based on a combination of trusted homology-based scientific evidence and computational assertions and an annotation value hierarchy established through extensive manual curation. The functional annotation attributes assigned by this system include gene name, gene symbol, GO terms, EC numbers, and JCVI functional role categories.Entities:
Keywords: Global Ocean Sampling; J. Craig Venter Institute; Sargasso Sea; environmental sequencing; functional annotation; prokaryotic shotgun metagenomics
Year: 2010 PMID: 21304707 PMCID: PMC3035284 DOI: 10.4056/sigs.651139
Source DB: PubMed Journal: Stand Genomic Sci ISSN: 1944-3277
Figure 1Metagenomics annotation process diagram. This overview covers both the structural (yellow through blue) and functional (blue through green) components of the JCVI prokaryotic metagenomics processing pipeline. Attributes assigned include common name, gene symbol, EC number, GO term, JCVI role category, along with transmembrane character and lipoprotein motifs, as applicable.
Third party tools, cutoffs, and parameters used in this pipeline
| | | | |
|---|---|---|---|
| Structural Annotation | |||
| tRNA identification | tRNAscan-SE (1.23) | tRNAscan-SE -q -b -G | |
| ncRNA finder stage 1 | BLAST | blastall -p blastn -i -d -e 0.1 -F "T" | |
| ncRNA finder stage 2 | BLAST | blastall -p blastn -i -d -e 1e-4 -F "m L" | |
| Protein Identification | MetaGeneAnnotator | -m | |
| Functional Annotation | |||
| Protein annotation | BLAST | blastall -p blastp -v 10 -b 10 -X 15 -e 1e-5 -M BLOSUM62 – | |
| Protein annotation | hmmpfam | ||
| Protein annotation | lipoprotein_motif | --is_micoplasm 0 | |
| Protein annotation | tmhmm | ||
| EC assignment | PRIAM | rpsblast -i -d -m 8 -e 1e-10 |
BLASTP evidence classes.
| | |
|---|---|
| High Confidence | 35% identity or greater, across 85% or more of the length |
| Putative | less than 35% identity, but across 80% or more of the length |
| Conserved Domain | 35% identity or greater, but across less than 80% of the length |
| Low Confidence | less than 35% identity across less than 80% of the length |
These have been established through extensive manual curation and validation efforts, and are ordered by trustworthiness.
HMM hit isotype classes.
| | |
|---|---|
| Equivalog | All proteins scoring above the trusted cutoff have the exact same function |
| Equivalog Domain | All domains scoring above the trusted cutoff have the same function; can be part of a multi-function protein |
| Domain | Defines a region of homology that may or may not have a known function, and need not be the full length of the polypeptide |
| Subfamily | Hits in this category represent full-length homology within a subgroup comtained within a gene family |
| Superfamily | This defines a set of proteins with full-length homology of sequence and domain architecture, but not necessarily the same function |
| Hypothetical - isotype | Unknown function |
| Uncharacterized | PFAM model cannot be assigned |
JCVI classifies HMMs into more than a dozen categories (isology types, or “isotypes”), each of which represents a different degree of confidence about the functional classification.
Metagenomics annotation hierarchy.
| | | |
|---|---|---|
| 1 | HMM | TIGRfam Equivalog |
| 2 | HMM | Pfam Equivalog |
| 3 | HMM | TIGRfam Hypothetical Equivalog |
| 4 | HMM | Pfam Hypothetical Equivalog |
| 5 | HMM | TIGRfam Domain |
| 6 | PRIAM | PRIAM |
| 7 | HMM | TIGRfam Subfamily |
| 8 | HMM | TIGRfam Superfamily |
| 9 | HMM | TIGRfam EquivalogDomain |
| 10 | HMM | TIGRfam Hypothetical Equivalog Domain |
| 11 | HMM | TIGRfam Subfamily Domain |
| 12 | HMM | Pfam Subfamily |
| 13 | HMM | Pfam Superfamily |
| 14 | HMM | Pfam Equivalog Domain |
| 15 | HMM | Pfam Hypothetical Equivalog Domain |
| 16 | HMM | Pfam Subfamily Domain |
| 17 | BLAST | Panda BLASTP High Confidence |
| 18 | HMM | TIGRfam Domain |
| 19 | HMM | Pfam Domain |
| 20 | HMM | Pfam Uncharacterized |
| 21 | BLAST | Panda BLASTP Putative |
| 22 | BLAST | Panda BLASTP Conserved Domain |
| 23 | TMHMM | TMHMM |
| 24 | LIPOPROTEIN | Lipoprotein Motif |
| 25 | DEFAULT | Hypothetical |
Output format of the JCVI prokaryotic metagenomics functional annotation multi fasta file header, with example entries.
| | |
|---|---|
| User Id | GCA1659448.b1 |
| Peptide Id | JCVI PEP 5160785.1 |
| Common Name Section Starts | common name |
| Common Name | glutamine synthetase, catalytic domain |
| Common Name Evidence | PF00120 |
| Gene Symbol Section Starts | gene symbol |
| Gene Symbol | glnT |
| Gene Symbol Evidence | RF|YP 266724.1|71084004|NC 007205 |
| GO Section Starts | GO |
| GO Terms | GO:0004356 // GO:0006542 |
| GO Term Evidence | PF00120 // PF00120 |
| EC Section Starts | EC |
| EC Id | 6.3.1.2 |
| EC evidence | PF00120 |
| TIGR Role Section starts | TIGR role |
| Tigr Role Id | 73 |
| Tigr Role Evidence | PF00120 |