Literature DB >> 26130573

Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

Jie Ren1, Kai Song2, Minghua Deng2, Gesine Reinert3, Charles H Cannon4, Fengzhu Sun5.   

Abstract

MOTIVATION: Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data.
RESULTS: Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species.
AVAILABILITY AND IMPLEMENTATION: Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html CONTACT: fsun@usc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

Mesh:

Year:  2015        PMID: 26130573      PMCID: PMC6169497          DOI: 10.1093/bioinformatics/btv395

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  32 in total

1.  Assembly free comparative genomics of short-read sequence data discovers the needles in the haystack.

Authors:  Charles H Cannon; Chai-Shian Kua; D Zhang; J R Harting
Journal:  Mol Ecol       Date:  2010-03       Impact factor: 6.185

2.  ChIP-Seq identification of weakly conserved heart enhancers.

Authors:  Matthew J Blow; David J McCulley; Zirong Li; Tao Zhang; Jennifer A Akiyama; Amy Holt; Ingrid Plajzer-Frick; Malak Shoukry; Crystal Wright; Feng Chen; Veena Afzal; James Bristow; Bing Ren; Brian L Black; Edward M Rubin; Axel Visel; Len A Pennacchio
Journal:  Nat Genet       Date:  2010-08-22       Impact factor: 38.330

3.  Alignment-free sequence comparison (I): statistics and power.

Authors:  Gesine Reinert; David Chew; Fengzhu Sun; Michael S Waterman
Journal:  J Comput Biol       Date:  2009-12       Impact factor: 1.479

4.  The analysis of intron data and their use in the detection of short signals.

Authors:  P J Avery
Journal:  J Mol Evol       Date:  1987       Impact factor: 2.395

5.  Prediction of oligonucleotide frequencies based upon dinucleotide frequencies obtained from the nearest neighbor analysis.

Authors:  J Hong
Journal:  Nucleic Acids Res       Date:  1990-03-25       Impact factor: 16.971

6.  28-way vertebrate alignment and conservation track in the UCSC Genome Browser.

Authors:  Webb Miller; Kate Rosenbloom; Ross C Hardison; Minmei Hou; James Taylor; Brian Raney; Richard Burhans; David C King; Robert Baertsch; Daniel Blankenberg; Sergei L Kosakovsky Pond; Anton Nekrutenko; Belinda Giardine; Robert S Harris; Svitlana Tyekucheva; Mark Diekhans; Thomas H Pringle; William J Murphy; Arthur Lesk; George M Weinstock; Kerstin Lindblad-Toh; Richard A Gibbs; Eric S Lander; Adam Siepel; David Haussler; W James Kent
Journal:  Genome Res       Date:  2007-11-05       Impact factor: 9.043

7.  Predicting the molecular complexity of sequencing libraries.

Authors:  Timothy Daley; Andrew D Smith
Journal:  Nat Methods       Date:  2013-02-24       Impact factor: 28.547

8.  Exploring genome characteristics and sequence quality without a reference.

Authors:  Jared T Simpson
Journal:  Bioinformatics       Date:  2014-01-17       Impact factor: 6.937

9.  MetaSim: a sequencing simulator for genomics and metagenomics.

Authors:  Daniel C Richter; Felix Ott; Alexander F Auch; Ramona Schmid; Daniel H Huson
Journal:  PLoS One       Date:  2008-10-08       Impact factor: 3.240

View more
  20 in total

1.  CAFE: aCcelerated Alignment-FrEe sequence analysis.

Authors:  Yang Young Lu; Kujin Tang; Jie Ren; Jed A Fuhrman; Michael S Waterman; Fengzhu Sun
Journal:  Nucleic Acids Res       Date:  2017-07-03       Impact factor: 16.971

2.  A new statistic for efficient detection of repetitive sequences.

Authors:  Sijie Chen; Yixin Chen; Fengzhu Sun; Michael S Waterman; Xuegong Zhang
Journal:  Bioinformatics       Date:  2019-11-01       Impact factor: 6.937

3.  Alignment-free $d_2^*$ oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences.

Authors:  Nathan A Ahlgren; Jie Ren; Yang Young Lu; Jed A Fuhrman; Fengzhu Sun
Journal:  Nucleic Acids Res       Date:  2016-11-28       Impact factor: 16.971

4.  A New Context Tree Inference Algorithm for Variable Length Markov Chain Model with Applications to Biological Sequence Analyses.

Authors:  Shaokun An; Jie Ren; Fengzhu Sun; Lin Wan
Journal:  J Comput Biol       Date:  2022-04-22       Impact factor: 1.549

5.  Applications of species accumulation curves in large-scale biological data analysis.

Authors:  Chao Deng; Timothy Daley; Andrew D Smith
Journal:  Quant Biol       Date:  2015-10-17

6.  Confidence intervals for Markov chain transition probabilities based on next generation sequencing reads data.

Authors:  Lin Wan; Xin Kang; Jie Ren; Fengzhu Sun
Journal:  Quant Biol       Date:  2020-05-25

7.  A network-based integrated framework for predicting virus-prokaryote interactions.

Authors:  Weili Wang; Jie Ren; Kujin Tang; Emily Dart; Julio Cesar Ignacio-Espinoza; Jed A Fuhrman; Jonathan Braun; Fengzhu Sun; Nathan A Ahlgren
Journal:  NAR Genom Bioinform       Date:  2020-06-23

8.  Alignment-free Transcriptomic and Metatranscriptomic Comparison Using Sequencing Signatures with Variable Length Markov Chains.

Authors:  Weinan Liao; Jie Ren; Kun Wang; Shun Wang; Feng Zeng; Ying Wang; Fengzhu Sun
Journal:  Sci Rep       Date:  2016-11-23       Impact factor: 4.379

9.  Reads Binning Improves the Assembly of Viral Genome Sequences From Metagenomic Samples.

Authors:  Kai Song
Journal:  Front Microbiol       Date:  2021-05-21       Impact factor: 5.640

10.  Alignment-free microbial phylogenomics under scenarios of sequence divergence, genome rearrangement and lateral genetic transfer.

Authors:  Guillaume Bernard; Cheong Xin Chan; Mark A Ragan
Journal:  Sci Rep       Date:  2016-07-01       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.