Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Literature DB >> 26130573

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Jie Ren¹, Kai Song², Minghua Deng², Gesine Reinert³, Charles H Cannon⁴, Fengzhu Sun⁵.

Abstract

MOTIVATION: Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data.
RESULTS: Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species.
AVAILABILITY AND IMPLEMENTATION: Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html CONTACT: fsun@usc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Mesh：

Year: 2015 PMID： 26130573 PMCID： PMC6169497 DOI： 10.1093/bioinformatics/btv395

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

32 in total

1. Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack.

Authors: Charles H Cannon; Chai-Shian Kua; D Zhang; J R Harting
Journal: Mol Ecol Date: 2010-03 Impact factor: 6.185

2. ChIP-Seq identification of weakly conserved heart enhancers.

Authors: Matthew J Blow; David J McCulley; Zirong Li; Tao Zhang; Jennifer A Akiyama; Amy Holt; Ingrid Plajzer-Frick; Malak Shoukry; Crystal Wright; Feng Chen; Veena Afzal; James Bristow; Bing Ren; Brian L Black; Edward M Rubin; Axel Visel; Len A Pennacchio
Journal: Nat Genet Date: 2010-08-22 Impact factor: 38.330

3. Alignment-free sequence comparison (I): statistics and power.

Authors: Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal: J Comput Biol Date: 2009-12 Impact factor: 1.479

4. The analysis of intron data and their use in the detection of short signals.

Authors: P J Avery
Journal: J Mol Evol Date: 1987 Impact factor: 2.395

5. Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis.

Authors: J Hong
Journal: Nucleic Acids Res Date: 1990-03-25 Impact factor: 16.971

6. 28-way vertebrate alignment and conservation track in the UCSC Genome Browser.

Authors: Webb Miller; Kate Rosenbloom; Ross C Hardison; Minmei Hou; James Taylor; Brian Raney; Richard Burhans; David C King; Robert Baertsch; Daniel Blankenberg; Sergei L Kosakovsky Pond; Anton Nekrutenko; Belinda Giardine; Robert S Harris; Svitlana Tyekucheva; Mark Diekhans; Thomas H Pringle; William J Murphy; Arthur Lesk; George M Weinstock; Kerstin Lindblad-Toh; Richard A Gibbs; Eric S Lander; Adam Siepel; David Haussler; W James Kent
Journal: Genome Res Date: 2007-11-05 Impact factor: 9.043

7. Predicting the molecular complexity of sequencing libraries.

Authors: Timothy Daley; Andrew D Smith
Journal: Nat Methods Date: 2013-02-24 Impact factor: 28.547

8. Exploring genome characteristics and sequence quality without a reference.

Authors: Jared T Simpson
Journal: Bioinformatics Date: 2014-01-17 Impact factor: 6.937

9. MetaSim: a sequencing simulator for genomics and metagenomics.

Authors: Daniel C Richter; Felix Ott; Alexander F Auch; Ramona Schmid; Daniel H Huson
Journal: PLoS One Date: 2008-10-08 Impact factor: 3.240

20 in total

1. CAFE: aCcelerated Alignment-FrEe sequence analysis.

Authors: Yang Young Lu; Kujin Tang; Jie Ren; Jed A Fuhrman; Michael S Waterman; Fengzhu Sun
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

2. A new statistic for efficient detection of repetitive sequences.

Authors: Sijie Chen; Yixin Chen; Fengzhu Sun; Michael S Waterman; Xuegong Zhang
Journal: Bioinformatics Date: 2019-11-01 Impact factor: 6.937

3. Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences.

Authors: Nathan A Ahlgren; Jie Ren; Yang Young Lu; Jed A Fuhrman; Fengzhu Sun
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

4. A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

Authors: Shaokun An; Jie Ren; Fengzhu Sun; Lin Wan
Journal: J Comput Biol Date: 2022-04-22 Impact factor: 1.549

5. Applications of species accumulation curves in large-scale biological data analysis.

Authors: Chao Deng; Timothy Daley; Andrew D Smith
Journal: Quant Biol Date: 2015-10-17

6. Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.

Authors: Lin Wan; Xin Kang; Jie Ren; Fengzhu Sun
Journal: Quant Biol Date: 2020-05-25

7. A network-based integrated framework for predicting virus-prokaryote interactions.

Authors: Weili Wang; Jie Ren; Kujin Tang; Emily Dart; Julio Cesar Ignacio-Espinoza; Jed A Fuhrman; Jonathan Braun; Fengzhu Sun; Nathan A Ahlgren
Journal: NAR Genom Bioinform Date: 2020-06-23

8. Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.

Authors: Weinan Liao; Jie Ren; Kun Wang; Shun Wang; Feng Zeng; Ying Wang; Fengzhu Sun
Journal: Sci Rep Date: 2016-11-23 Impact factor: 4.379

9. Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples.

Authors: Kai Song
Journal: Front Microbiol Date: 2021-05-21 Impact factor: 5.640

10. Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer.

Authors: Guillaume Bernard; Cheong Xin Chan; Mark A Ragan
Journal: Sci Rep Date: 2016-07-01 Impact factor: 4.379