| Literature DB >> 26148002 |
Thomas Junier1, Vincent Hervé2, Tina Wunderlin3, Pilar Junier3.
Abstract
We present software package for classifying protein or nucleotide sequences to user-specified sets of reference sequences. The software trains a model using a multiple sequence alignment and a phylogenetic tree, both supplied by the user. The latter is used to guide model construction and as a decision tree to speed up the classification process. The software was evaluated on all the 16S rRNA gene sequences of the reference dataset found in the GreenGenes database. On this dataset, the software was shown to achieve an error rate of around 1% at genus level. Examples of applications based on the nitrogenase subunit NifH gene and a protein-coding gene found in endospore-forming Firmicutes is also presented. The programs in the package have a simple, straightforward command-line interface for the Unix shell, and are free and open-source. The package has minimal dependencies and thus can be easily integrated in command-line based classification pipelines.Entities:
Mesh:
Substances:
Year: 2015 PMID: 26148002 PMCID: PMC4492669 DOI: 10.1371/journal.pone.0129384
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
The MLgsc programs and their functions, inputs, and outputs.
| Program | Function | Inputs | Outputs |
|---|---|---|---|
| mlgsc_xval | cross-validation | multiple alignment | predictions |
| mlgsc_train | training classifier | multiple alignment | classifier |
| Mlgsc | classification | query sequences | predictions |
(a) A multiple alignment provided by the user of known sequences from the gene or protein of interest, in Fasta format.
(b) A phylogeny of the reference taxa provided by the user in Newick format.
(c) A text output, containing (among others) the actual and predicted taxon names, thus enabling the detection of classification errors.
(d) This step is conditioned on a successful cross-validation (mlgsc_xval).
(e) This is a binary file, to save space and to save reading and parsing time
(f) A Fasta file containing the sequences to be classified, usually unaligned.
(g) The output of mlgsc_train.
(h) A text file that contains the predicted assignment to reference taxa for each query sequence (among other information).
Fig 1Training the Model.
A tree of position-specific weight matrices (bottom) constructed from a phylogeny (top left) and a multiple alignment (top right). Each taxon in the tree is represented by at least one (preferably more) sequence(s) in the alignment. Column i in a taxon matrix contains the relative frequencies of residues at position i, computed over all sequences of that taxon. Matrices at inner nodes are averages of the matrices of the node’s children. The tree need not be bifurcating, but a fully bifurcating tree offers the best speed performance (see Discussion).
Fig 2Classifying.
The aligned query sequence (a) is first scored by the matrices at the root’s direct children nodes, in this example the matrix for taxa 1 and 2 (b) and the one for taxa 3 and 4 (c). Matrix (b) is found to yield the better score (solid arrow). Therefore, the query is now scored against matrix (b)’s children, namely taxa 1 and 2. The former is found to yield the better score, and OTU 1 is reported as the most likely (d). The shaded parts of the tree (matrices for taxa 3 and 4) are never tested. In a balanced, fully bifurcating tree of n nodes, only 2log2(n) matrices are tested.
Comparison of speed and accuracy of MLgsc versus classifying methods implemented in Mothur for the 16S rRNA gene.
| Method | Run Time [s] | Classified | % Classified | Wrong | Error rate [%] |
|---|---|---|---|---|---|
| Mothur (RDP) | 5,590 | 11,935 | 99.84 | 382 | 3.2 |
| Mothur (KNN, k-mer) | 396 | 8,165 | 68.3 | 0 | 0 |
| Mothur (KNN, BLASTN) | 17,162 | 8,432 | 70.54 | 4 | 0.047 |
| Mothur (KNN, suffix tree) | 3,543 | 7,995 | 66.88 | 0 | 0 |
| Mlgsc (no ER cutoff)(
| 35 | 11,954 | 100 | 297 | 2.5 |
| MLgsc (ER cutoff 10)(
| 35 | 11,247 | 94.09 | 65 | 0.58 |
| MLgsc (ER cutoff 20) (
| 35 | 10,041 | 84 | 15 | 0.15 |
(a) For the Mothur methods, a query was considered not classifiable if the corresponding output line did not indicate a genus (g__ prefix). For MLgsc, a query was considered not classifiable if any node in the corresponding output line had an evidence ratio (ER) below the cutoff. The use of evidence ratio do detect sequences that are not confidently classified is described in section Procedure, subsection Validation.
(b) Mothur command: classify.seqs(fasta = aln, template = gg_13_8_99.fasta, taxonomy = gg_13_8_99.gg.tax, iters = 1000, method = wang, ksize = 8, processors = 1)
(c) Mothur command: classify.seqs(fasta = aln1, template = gg_13_8_99.fasta, taxonomy = gg_13_8_99.gg.tax, iters = 1000, method = knn, numwanted = 10, search = kmer, ksize = 8, processors = 1)
(d) Mothur command: classify.seqs(fasta = aln2, template = gg_13_8_99.fasta, taxonomy = gg_13_8_99.gg.tax, iters = 1000, method = knn, numwanted = 10, search = blast, processors = 1)
(e) Mothur command: classify.seqs(fasta = aln3, template = gg_13_8_99.fasta, taxonomy = gg_13_8_99.gg.tax, iters = 1000, method = knn, numwanted = 10, search = suffix, processors = 16)
(f) MLgsc command: mlgsc-A 16S.fasta 16S_classifier.bcls
Comparison of speed and accuracy of MLgsc versus FrameBot for the nitrogenase gene nifH.
| Method | Run Time [s] | Classified | % Classified | Wrong | Error rate [%] |
|---|---|---|---|---|---|
| FrameBot | 100.87 | 422 | 100 | 103 | 24.41 |
| MLgsc (no ER cutoff) | 4.9 | 383 | 100 | 52 | 13.58 |
| MLgsc (ER cutoff 10) | 4.9 | 318 | 83.03 | 8 | 2.52 |
(a) For MLgsc, a query was considered not classifiable if any node in the corresponding output line had an evidence ratio (ER) below the cutoff. The use of evidence ratio do detect sequences that are not confidently classified is described in section Procedure, subsection Validation.
(b) A FrameBot classification was evaluated by examining the lines starting by STATS in the *framebot.txt output file:, which contains the predicted genus name and the test sequence's ID, for example in
STATS 454423|B|1|1594_2487_L23514 Nostoc_commune_UTEX_584_nitrogen_fixation_protein_nifUAAA21838
the predicted genus is Nostoc and the query ID is AAA21838. Using the ID to look up the genus in the reference file, we consider the prediction correct if the two genera match.
(c) FrameBot command: java-jar dist/FrameBot.jar framebot-N-o test_nifH refset/nifh_prot_ref.fasta ENA_nifH_full_cleanHdr.dna
(d) MLgsc command: mlgsc ENA_nifH_full_cleanHdr_orf800_bestORF.pep NifH_ref_clean_train.bcls