Literature DB >> 19420051

RJaCGH: Bayesian analysis of aCGH arrays for detecting copy number changes and recurrent regions.

Abstract

SUMMARY: Several methods have been proposed to detect copy number changes and recurrent regions of copy number variation from aCGH, but few methods return probabilities of alteration explicitly, which are the direct answer to the question 'is this probe/region altered?' RJaCGH fits a Non-Homogeneous Hidden Markov model to the aCGH data using Markov Chain Monte Carlo with Reversible Jump, and returns the probability that each probe is gained or lost. Using these probabilites, recurrent regions (over sets of individuals) of copy number alteration can be found. AVAILABILITY: RJaCGH is available as an R package from CRAN repositories (e.g. http://cran.r-project.org/web/packages).

Entities: Chemical Disease Gene Species

Mesh：

Year: 2009 PMID： 19420051 PMCID： PMC2712338 DOI： 10.1093/bioinformatics/btp307

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Genomic DNA copy number alterations (CNAs) are associated with complex diseases (McCarroll and Altshuler, 2007), and are often studied using array-based comparative genomic hybridization (aCGH). To be immediately useful in both clinical and basic research scenarios, aCGH data analysis requires accurate methods that do not impose unrealistic biological assumptions and that provide direct answers to the key question, ‘What is the probability that this gene/region has CNAs?’ Estimates of the probabilities of alteration (instead of P-values or smoothed means) are the most direct and usable answer to this problem (Broët and Richardson, 2006). Probabilities can be used in contexts from basic research to clinical applications (Lockwood et al., 2006; Pinkel and Albertson, 2005) so that a clinician might require high certainty of alteration of a specific gene before invasive procedures, whereas a basic researcher can consider for further study genes that show only a moderate probability of alteration. In addition, many aCGH platforms have probes located at variable distances, which should be incorporated in the analysis (Broët and Richardson, 2006; Lockwood et al., 2006). A variety of methods have been developed for the analysis of aCGH data (see reviews in Lai et al., 2005; Rueda and Diaz-Uriarte, 2007a,b; Willenbrock and Fridlyand, 2005), but most of them do not return probabilities of alteration nor make use of the distance between probes. The few approaches that return probabilities of alteration either do not use distance between probes, or fix the number of possible states of alteration to three or four, a biologically unrealistic assumption. In addition to locating probes that show copy number changes, the identification of common or recurrent regions of alteration is one frequent study objective: the regions more likely to harbor disease-critical genes are those that are recurrent or common among samples (Diskin et al., 2006; Pinkel and Albertson, 2005). The identification of these regions should use the information about the probability of alteration (to avoid giving the same weight to probes with strong and weak evidence of alteration), and should allow the discovery of regions over subsets of samples as it is known that many complex diseases, such as cancer or autism, are composed of subtypes of syndromes (Sebat, 2007). Most available methods for locating common regions (Klijn et al., 2008; Rouveirol et al., 2006; Shah et al., 2007; Taylor et al., 2008), do not allow for among-subject heterogeneity nor use probabilities. Finally, many of the existing methods are not always readily and freely available like those on CRAN, or as easy to use without forcing (often arbitrary) choices on the user. We have developed a freely available R package, RJaCGH, for the analysis of aCGH data that incorporates distance between probes, returns probabilities of alteration and allows the identification of recurrent regions of CNA.

2 RESULTS

To estimate probabilities of copy number changes, we use a non-homogeneous Hidden Markov model (HMM) with an unknown number of hidden states fitted via Reversible Jump Markov Chain Monte Carlo (Cappé et al., 2005). By using a non-homogeneous HMM, we can account for the variable distance between probes/genes and Reversible Jump allows us to use HMMs without fixing the number of hidden states. By exploring the full posterior probabilities and retaining the probabilities of models of different sizes, we can employ Bayesian model averaging (Hoeting et al., 1999), thus incorporating model uncertainty and not conditioning our inferences to the selection of a particular model. The statistical model is described in Rueda and Diaz-Uriarte (2007a), where it is shown that the method performs as well as, or better than, the competing methods ACE (Lingjaerde et al., 2005), BioHMM (Marioni et al. 2006), HMM (Fridlyand et al., 2004), CGHseg (Picard et al., 2005), DNAcopy (Venkatraman and Olshen, 2007) and GLAD (Hupé et al., 2004) in terms of calling gains and losses, and the performance advantage increases as the variability in inter-probe distance increases. For the identification of recurrent regions of CNA, we have developed two algorithms, pREC-A and pREC-S (fully described in the documentation of the program and as technical report from http://biostats.bepress.com/cobra/ps/art43/). pREC-A (probabilistic recurrent copy number regions, common threshold over all arrays) does not allow for among-subject heterogeneity and is, thus, similar in objectives to previous approaches except for the fact that we explicitly use probabilities. pREC-S (probabilistic recurrent copy number regions, subsets of arrays), identifies common regions over subsets of arrays; alternatively, we can think of this algorithm as identifying subsets of arrays that share regions of alteration. This is a novel algorithm, explicitly targeted to incorporate heterogeneity and use probabilities. Both methods use probabilities of alteration as returned by the non-homogeneous HMM. No hard thresholds are imposed, and thus the user decides what constitutes sufficient evidence (in terms of probability of alteration) to call a probe gained (or lost). The probabilities that we use are not the marginal probabilities of alteration but the joint probabilities of alteration of a region of probes. Our approach incorporates both within- and among-array variability: we use the information on the certainty of each call of gain/loss (i.e. the probability) in all computations of recurrent regions. Moreover, using probabilities of alteration (instead of magnitude of change), in addition to differentiating between evidence of alteration and estimated fold change, prevents inter-array differences in range of log2 ratios and tissue mixture to get confounded with evidence of alteration. Finally, both algorithms use at most two parameters and their biological meaning is immediate: probability of alteration, and number of samples that share an alteration. We can use the output of pREC-S as the basis for clustering and to display patterns of groupings of arrays; an example is shown in the documentation of the program. The RJaCGH method has been implemented as an R package (R Development Core Team, 2006). All of the MCMC code for the HMM as well as the two algorithms for common regions have been implemented in C (dynamically loaded from R) for speed. The program is available from the standard R repositories (e.g. http://cran.r-project.org/web/packages/) under the GPL (v. 3) license and has been submitted to BioConductor. The package depends on no additional software (besides R itself). The flexibility and comprehensiveness of RJaCGH does have a computational cost: estimation of probabilities by RJaCGH is considerably slower than segmentation by alternative approaches. If probabilities of alteration are desired (but finding recurrent regions or incorporating distance between probes is not needed), the bcp method of (Erdman and Emerson, 2007, 2008) is a much faster alternative. pREC-A and pREC-S, once the probabilities have been obtained, are very fast (on the order of seconds to a few minutes for datasets that include 50–70 samples).

19 in total

1. CGH-Explorer: a program for analysis of array-CGH data.

Authors: Ole Christian Lingjaerde; Lars O Baumbusch; Knut Liestøl; Ingrid K Glad; Anne-Lise Børresen-Dale
Journal: Bioinformatics Date: 2004-11-05 Impact factor: 6.937

Review 2. Recent advances in array comparative genomic hybridization technologies and their applications in human genetics.

Authors: William W Lockwood; Raj Chari; Bryan Chi; Wan L Lam
Journal: Eur J Hum Genet Date: 2006-02 Impact factor: 4.246

3. A comparison study: applying segmentation to array CGH data for downstream analyses.

Authors: Hanni Willenbrock; Jane Fridlyand
Journal: Bioinformatics Date: 2005-09-13 Impact factor: 6.937

Review 4. Array comparative genomic hybridization and its applications in cancer.

Authors: Daniel Pinkel; Donna G Albertson
Journal: Nat Genet Date: 2005-06 Impact factor: 38.330

5. Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data.

Authors: Weil R Lai; Mark D Johnson; Raju Kucherlapati; Peter J Park
Journal: Bioinformatics Date: 2005-08-04 Impact factor: 6.937

6. Computation of recurrent minimal genomic alterations from array-CGH data.

Authors: C Rouveirol; N Stransky; Ph Hupé; Ph La Rosa; E Viara; E Barillot; F Radvanyi
Journal: Bioinformatics Date: 2006-01-24 Impact factor: 6.937

7. Detection of gene copy number changes in CGH microarrays using a spatially correlated mixture model.

Authors: Philippe Broët; Sylvia Richardson
Journal: Bioinformatics Date: 2006-02-02 Impact factor: 6.937

8. BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data.

Authors: J C Marioni; N P Thorne; S Tavaré
Journal: Bioinformatics Date: 2006-03-13 Impact factor: 6.937

9. A fast Bayesian change point analysis for the segmentation of microarray data.

Authors: Chandra Erdman; John W Emerson
Journal: Bioinformatics Date: 2008-07-29 Impact factor: 6.937

10. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions.

Authors: Philippe Hupé; Nicolas Stransky; Jean-Paul Thiery; François Radvanyi; Emmanuel Barillot
Journal: Bioinformatics Date: 2004-09-20 Impact factor: 6.937

6 in total

1. ParseCNV integrative copy number variation association software with quality tracking.

Authors: Joseph T Glessner; Jin Li; Hakon Hakonarson
Journal: Nucleic Acids Res Date: 2013-01-04 Impact factor: 16.971

2. Fast detection of de novo copy number variants from SNP arrays for case-parent trios.

Authors: Robert B Scharpf; Terri H Beaty; Holger Schwender; Samuel G Younkin; Alan F Scott; Ingo Ruczinski
Journal: BMC Bioinformatics Date: 2012-12-12 Impact factor: 3.169

3. Parsimonious higher-order hidden Markov models for improved array-CGH analysis with applications to Arabidopsis thaliana.

Authors: Michael Seifert; André Gohr; Marc Strickert; Ivo Grosse
Journal: PLoS Comput Biol Date: 2012-01-12 Impact factor: 4.475

4. Fast Bayesian Inference of Copy Number Variants using Hidden Markov Models with Wavelet Compression.

Authors: John Wiedenhoeft; Eric Brugel; Alexander Schliep
Journal: PLoS Comput Biol Date: 2016-05-13 Impact factor: 4.475

5. Integrative analysis of copy number and gene expression in breast cancer using formalin-fixed paraffin-embedded core biopsy tissue: a feasibility study.

Authors: Mahesh Iddawela; Oscar Rueda; Jenny Eremin; Oleg Eremin; Jed Cowley; Helena M Earl; Carlos Caldas
Journal: BMC Genomics Date: 2017-07-11 Impact factor: 3.969

6. O-miner: an integrative platform for automated analysis and mining of -omics data.

Authors: Rosalind J Cutts; Abu Z Dayem Ullah; Ajanthah Sangaralingam; Emanuela Gadaleta; Nicholas R Lemoine; Claude Chelala
Journal: Nucleic Acids Res Date: 2012-05-17 Impact factor: 16.971

6 in total