| Literature DB >> 23267010 |
Leelavati Narlikar1, Nidhi Mehta, Sanjeev Galande, Mihir Arjunwadkar.
Abstract
The structural simplicity and ability to capture serial correlations make Markov models a popular modeling choice in several genomic analyses, such as identification of motifs, genes and regulatory elements. A critical, yet relatively unexplored, issue is the determination of the order of the Markov model. Most biological applications use a predetermined order for all data sets indiscriminately. Here, we show the vast variation in the performance of such applications with the order. To identify the 'optimal' order, we investigated two model selection criteria: Akaike information criterion and Bayesian information criterion (BIC). The BIC optimal order delivers the best performance for mammalian phylogeny reconstruction and motif discovery. Importantly, this order is different from orders typically used by many tools, suggesting that a simple additional step determining this order can significantly improve results. Further, we describe a novel classification approach based on BIC optimal Markov models to predict functionality of tissue-specific promoters. Our classifier discriminates between promoters active across 12 different tissues with remarkable accuracy, yielding 3 times the precision expected by chance. Application to the metagenomics problem of identifying the taxum from a short DNA fragment yields accuracies at least as high as the more complex mainstream methodologies, while retaining conceptual and computational simplicity.Entities:
Mesh:
Year: 2012 PMID: 23267010 PMCID: PMC3562003 DOI: 10.1093/nar/gks1285
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The AIC- and BIC-based order selection procedure illustrated. The y-axis represents the (A) AIC or (B) BIC score minus its minimum value for given genome. The order that minimizes a given score is considered optimal; this is the AIC- or the BIC-predicted optimal order. Optimal orders for these and other complete genomes are listed in Table 1.
APO and BPO orders for a number of genomes, arranged by the increasing genome length
| Genome (UCSC Version) | Length (Mb) | APO order | BPO order |
|---|---|---|---|
| Yeast (sacCer2) | 12.2 | 7 | 4 |
| Nematode (ce6) | 100.3 | 10 | 7 |
| Fruitfly (dm3) | 120.3 | 9 | 7 |
| Fugu (fr2) | 351.2 | 11 | 8 |
| Stickleback (gasAcu1) | 446.6 | 11 | 8 |
| Chicken (galGal3) | 984.9 | 11 | 8 |
| Zebra Finch (taeGut1) | 1112.7 | 11 | 8 |
| Xenopus (xenTro2) | 1359.4 | 12 | 10 |
| Zebrafish (danRer5) | 1523.3 | 13 | 10 |
| Cat (felCat3) | 1642.7 | 12 | 10 |
| Lizard (anoCar1) | 1741.5 | 12 | 10 |
| Platypus (ornAna1) | 1842.2 | 12 | 10 |
| Dog (canFam2) | 2385.0 | 12 | 10 |
| Horse (equCab2) | 2428.8 | 12 | 10 |
| Rat (rn4) | 2533.3 | 12 | 10 |
| Mouse (mm9) | 2558.5 | 12 | 10 |
| Macaque (rheMac2) | 2646.7 | 12 | 10 |
| Guinea pig (cavPor3) | 2663.4 | 13 | 10 |
| Cow (bosTau4) | 2731.8 | 12 | 10 |
| Orangutan (ponAbe2) | 2788.0 | 12 | 10 |
| Chimp (panTro2) | 2802.8 | 12 | 10 |
| Human (hg18) | 2858.0 | 12 | 10 |
| Marmoset (calJac1) | 2929.1 | 12 | 10 |
| Opossum (monDom5) | 3501.7 | 13 | 11 |
Genome source: UCSC Genome Browser (http://genome.ucsc.edu/). Maximum order considered for selection: 14; only one strand was used. Optimal order (AIC or BIC) generally increases with the length and the complexity of a genome.
Figure 2.Class-normalized sensitivity at each taxonomic level for 100 bp fragments tested across classifiers built from 13 different Markov orders. Thirteen models were built during each of the 5 folds of cross-validation. The held out set of organisms was tested with each model (‘Materials and Methods’ section) to identify their taxa, by taking 10 random fragments of length 100 bp from their genome. The best average sensitivity for all five taxonomic levels is achieved at orders between 8 and 10. The BIC-predicted optimal order is equal to 9.
Class-normalized accuracy of three taxa predictors in percentages
| Rank | Markov | Markov | Phylopythia | Phymm |
|---|---|---|---|---|
| (100 bp) | (1000 bp) | (1000 bp) | (100 bp) | |
| Domain | 73.0 (2) | 85.2 (2) | 57.7 (3) | N/A |
| Phylum | 38.1 (14) | 56.8 (14) | 40.6 (14) | 36.7 (14) |
| Class | 35.6 (21) | 60.9 (21) | 30.7 (22) | 37.4 (21) |
| Order | 31.3 (39) | 60.2 (39) | 6.4 (29) | 32.8 (34) |
| Genus | 48.1 (27) | 75.0 (27) | 4.4 (31) | 25.0 (53) |
The numbers in the parenthesis indicate the number of clades considered by the program. The highest accuracy (class-normalized sensitivity; Supplementary Methods) achieved for lengths 100 and 1000 bp using Markov models is shown in the first two columns. Class-normalized sensitivity as published by phylopythia for lengths 1000 bp and those computed from phymm (21; Supplementary Tables , therein) for 100 bp are shown in the adjacent columns.
Figure 3.Number of motifs identified correctly by priority for yeast and human promoter data sets. Priority was run on each promoter set 20 times. A barplot of the number of times the returned motifs matched the literature consensus motif is shown here for each order. A match is determined by the condition that the Euclidean distance between a found motif and the literature consensus motif be less than a predetermined threshold; we used 0.24 and 0.18 as the thresholds. The highest number of matches occurs at (A) order 5 or 6 for the 156 yeast promoter sets (BIC-predicted optimal order = 5) and (B) order 7 for the 19 human promoter sets (BIC-predicted optimal order = 7).
Figure 4.Accuracy of tissue-specificity classifiers based on Markov models of different orders. Here, accuracy is defined as the number of tissue-specific promoters predicted correctly by a classifier at a Markov order divided by the total number of promoters. Total number of tissues: 12, number of tissue-specific promoter sequences for each tissue: 50. On the full training set, the accuracy of the classifier increases with the order and reaches the maximum at order 6 (gray curve, right scale). On test data, however, the order 3 classifier performs the best (solid black curve, left scale), with the predictive power vanishing order 6 onward. Addition of a pseudocount while computing the probability distribution, as described in ‘Materials and Methods’ section, improves the performance of the classifier at higher orders (dashed curve, left scale), but cannot surpass the performance of the without pseudocount classifier at the BIC-predicted optimal order equal to 3.