Literature DB >> 35451885

A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

Shaokun An1, Jie Ren2, Fengzhu Sun2, Lin Wan1.   

Abstract

The statistical inference of high-order Markov chains (MCs) for biological sequences is vital for molecular sequence analyses but can be hindered by the high dimensionality of free parameters. In the seminal article by Bühlmann and Wyner, variable length Markov chain (VLMC) model was proposed to embed the full-order MC in a sparse structured context tree. In the key procedure of tree pruning of their proposed context algorithm, the word count-based statistic for each branch was defined and compared with a fixed cutoff threshold calculated from a common chi-square distribution to prune the branch of the context tree. In this study, we find that the word counts for each branch are highly intercorrelated, resulting in non-negligible effects on the distribution of the statistic of interest. We demonstrate that the inferred context tree based on the original context algorithm by Bühlmann and Wyner, which uses a fixed cutoff threshold based on a common chi-square distribution, can be systematically biased and error prone. We denote the original context algorithm as VLMC-Biased (VLMC-B). To solve this problem, we propose a new context tree inference algorithm using an adaptive tree-pruning scheme, termed VLMC-Consistent (VLMC-C). The VLMC-C is founded on the consistent branch-specific mixed chi-square distributions calculated based on asymptotic normal distribution of multiple word patterns. We validate our theoretical branch-specific asymptotic distribution using simulated data. We compare VLMC-C with VLMC-B on context tree inference using both simulated and real genome sequence data and demonstrate that VLMC-C outperforms VLMC-B for both context tree reconstruction accuracy and model compression capacity.

Entities:  

Keywords:  biological sequence analyses; consistent context algorithm; variable length Markov chains; word count statistics

Mesh:

Year:  2022        PMID: 35451885      PMCID: PMC9419963          DOI: 10.1089/cmb.2021.0604

Source DB:  PubMed          Journal:  J Comput Biol        ISSN: 1066-5277            Impact factor:   1.549


  13 in total

Review 1.  Probabilistic and statistical properties of words: an overview.

Authors:  G Reinert; S Schbath; M S Waterman
Journal:  J Comput Biol       Date:  2000 Feb-Apr       Impact factor: 1.479

2.  First and second moment of counts of words in random texts generated by Markov chains.

Authors:  J Kleffe; M Borodovsky
Journal:  Comput Appl Biosci       Date:  1992-10

3.  Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Authors:  Jie Ren; Kai Song; Minghua Deng; Gesine Reinert; Charles H Cannon; Fengzhu Sun
Journal:  Bioinformatics       Date:  2015-06-30       Impact factor: 6.937

4.  Linguistics of nucleotide sequences. I: The significance of deviations from mean statistical characteristics and prediction of the frequencies of occurrence of words.

Authors:  P A Pevzner; A A Mironov
Journal:  J Biomol Struct Dyn       Date:  1989-04

5.  Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis.

Authors:  J Arnold; A J Cuticchia; D A Newsome; W W Jennings; R Ivarie
Journal:  Nucleic Acids Res       Date:  1988-07-25       Impact factor: 16.971

6.  Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis.

Authors:  J Hong
Journal:  Nucleic Acids Res       Date:  1990-03-25       Impact factor: 16.971

7.  Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.

Authors:  Lin Wan; Xin Kang; Jie Ren; Fengzhu Sun
Journal:  Quant Biol       Date:  2020-05-25

8.  A new coronavirus associated with human respiratory disease in China.

Authors:  Fan Wu; Su Zhao; Bin Yu; Yan-Mei Chen; Wen Wang; Zhi-Gang Song; Yi Hu; Zhao-Wu Tao; Jun-Hua Tian; Yuan-Yuan Pei; Ming-Li Yuan; Yu-Ling Zhang; Fa-Hui Dai; Yi Liu; Qi-Min Wang; Jiao-Jiao Zheng; Lin Xu; Edward C Holmes; Yong-Zhen Zhang
Journal:  Nature       Date:  2020-02-03       Impact factor: 49.962

9.  One size does not fit all: on how Markov model order dictates performance of genomic sequence analyses.

Authors:  Leelavati Narlikar; Nidhi Mehta; Sanjeev Galande; Mihir Arjunwadkar
Journal:  Nucleic Acids Res       Date:  2012-12-24       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.