| Literature DB >> 24678134 |
Naim U Rashid1, Wei Sun1, Joseph G Ibrahim1.
Abstract
In DAE (DNA After Enrichment)-seq experiments, genomic regions related with certain biological processes are enriched/isolated by an assay and are then sequenced on a high-throughput sequencing platform to determine their genomic positions. Statistical analysis of DAE-seq data aims to detect genomic regions with significant aggregations of isolated DNA fragments ("enriched regions") versus all the other regions ("background"). However, many confounding factors may influence DAE-seq signals. In addition, the signals in adjacent genomic regions may exhibit strong correlations, which invalidate the independence assumption employed by many existing methods. To mitigate these issues, we develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for covariates effects and violations of the independence assumption. We demonstrate that our AR-HMM leads to improved performance in identifying enriched regions in both simulated and real datasets, especially in those in epigenetic datasets with broader regions of DAE-seq signal enrichment. We also introduce a variable selection procedure in the context of the HMM/AR-HMM where the observations are not independent and the mean value of each state-specific emission distribution is modeled by some covariates. We study the theoretical properties of this variable selection procedure and demonstrate its efficacy in simulated and real DAE-seq data. In summary, we develop several practical approaches for DAE-seq data analysis that are also applicable to more general problems in statistics.Entities:
Keywords: Autoregressive modeling; Hidden Markov Model; High-throughput Sequencing; Mixture Regression; Variable Selection
Year: 2014 PMID: 24678134 PMCID: PMC3963211 DOI: 10.1080/01621459.2013.869222
Source DB: PubMed Journal: J Am Stat Assoc ISSN: 0162-1459 Impact factor: 5.033