| Literature DB >> 23750167 |
Armand Valsesia1, Aurélien Macé, Sébastien Jacquemont, Jacques S Beckmann, Zoltán Kutalik.
Abstract
Differences between genomes can be due to single nucleotide variants, translocations, inversions, and copy number variants (CNVs, gain or loss of DNA). The latter can range from sub-microscopic events to complete chromosomal aneuploidies. Small CNVs are often benign but those larger than 500 kb are strongly associated with morbid consequences such as developmental disorders and cancer. Detecting CNVs within and between populations is essential to better understand the plasticity of our genome and to elucidate its possible contribution to disease. Hence there is a need for better-tailored and more robust tools for the detection and genome-wide analyses of CNVs. While a link between a given CNV and a disease may have often been established, the relative CNV contribution to disease progression and impact on drug response is not necessarily understood. In this review we discuss the progress, challenges, and limitations that occur at different stages of CNV analysis from the detection (using DNA microarrays and next-generation sequencing) and identification of recurrent CNVs to the association with phenotypes. We emphasize the importance of germline CNVs and propose strategies to aid clinicians to better interpret structural variations and assess their clinical implications.Entities:
Keywords: bioinformatics; complex disease; copy number variation; genome-wide association studies; genomics; personalized medicine; sequencing
Year: 2013 PMID: 23750167 PMCID: PMC3667386 DOI: 10.3389/fgene.2013.00092
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Examples of algorithms for the detection of structural variants from array data.
| Software | Affymetrix | Illumina | CGH | Method | Use allelic intensities | Multi-sample analysis | Copy number output | URL | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 6.0 | 500 k | 1 M | 610 k | 550 k | |||||||
| CRLMM (Scharpf et al., | X | X | X | X | Corrected robust linear model with maximum likelihood distance | X | X | Allele-specific copy number (CN) | |||
| ASCAT (Van Loo et al., | X | X | X | X | X | Allele-specific piecewise constant fitting | X | Allele-specific copy number (CN) | |||
| GMM (Valsesia et al., | X | X | X | X | X | X | Gaussian mixture model | X | Continuous CN | ||
| PICNIC (Greenman et al., | X | Hidden–Markov model (HMM) | X | Continuous CN + CN genotypes | |||||||
| GLAD (Hupé et al., | X | X | X | X | X | X | Adaptive weight smoothing | Discrete CN | |||
| PennCNV (Wang et al., | X* | X* | X | X | X | HMM | X | Trios only | Discrete CN + CN genotypes | ||
| Birdsuite (McCarroll et al., | X | HMM | X | X | Discrete CN + CN genotypes | ||||||
| QuantiSNP (Colella et al., | X* | X* | X | X | X | HMM | X | Discrete CN + CN genotypes | |||
| Affymetrix.aroma (Bengtsson et al., | X | X | Copy number estimation using robust multichip analysis (CRMA) | X | Unclassified segments | ||||||
| cn.Farms (Clevert et al., | X | X | Probabilistic latent variable model | X | Unclassified segments | ||||||
| GADA (Pique-Regi, | X | X | X | X | X | X | Sparse Bayesian learning | X | Unclassified segments | ||
| CBS (Olshen et al., | X | X | X | X | X | X | Binary segmentations assessed by permutations | Unclassified segments | |||
X* indicates a software not initially designed for such analysis, but that might be used providing upon additional pre-processing steps.
Figure 1SNP and CGH array analyses. (A) Analyses with SNP and CGH arrays of two melanoma samples (Me275 a tetraploid sample and Me280 with large deletions). Probe/SNP are plotted as a function of their genomic position on the X axis. Y axis for CGH arrays corresponds to hybridization ratios. Y axis for SNP arrays corresponds to the predicted copy number. Colors indicate a copy number state (orange <2 copies; gray = 2 copies; cyan = 3 copies; dark blue >3 copies). (B) Analysis of the Me275 sample with SNP array. The top panel shows genome-wide copy number. Subsequent panels show chromosome 7 with, from top to bottom: hybridization log2 ratio, B allele frequency and copy number prediction.
Algorithms for the detection of structural variants from NGS data.
| Strategy | Approach | Reference | ||
|---|---|---|---|---|
| Paired-end mapping | Detection of discordant end-pairs | Tuzun et al. ( | ||
| Clustering of end-pairs | Korbel et al. ( | |||
| Read-depth analysis | Detection of local change points | Campbell et al. ( | ||
| Detection of outliers compared to the read-depth baseline | Alkan et al. ( | |||
| Event-wise testing | Yoon et al. ( | |||
| Split-read analysis | Identification of breakpoints with a pattern growth algorithm | Ye et al. ( | ||
| Sequence assembly analysis | Simpson et al. ( | Burrows–Wheeler transform | Simpson and Durbin ( | |
| Simultaneously assembly of multiple eukaryotic genomes | Boone et al. ( | |||
| Detection of small indels through local reassembly | Massouras et al. ( | |||
| Mixed strategies | Combines both paired-end mapping and read-depth analysis | Medvedev et al. ( |
Figure 2NGS approaches. Analytical strategy to detect CNV from NGS data: (A) pair-end mapping approached, (B) read-depth approach, and (C) split-read approach.
Figure 3Impact of CNV post-filtering on false-discovery rate (FDR). Illustration of the FDR evolution when discarding CNVs based on their length (A) or based on their confidence scores (B). (C,D) Show respectively histograms of CNV length and CNV confidence score. Fluctuations in these histograms (such as inversion of the proportion “small CNVs over long CNVs” or “low-confidence over high-confidence CNVs”) are associated with non-monotonic changes in the FDR curve.
Figure 4Representation of CNV data and CNV-GWA analysis. (A) CNV representation on chromosome 10 (X axis) for different subjects (Y axis). (B) Frequency representation of the same CNV. (C) Matrix-based representation of the CNV along with the phenotype of the different subjects. (D) Representation of the CNV association results.
Figure 5QQ-plots investigation. From a real dataset: copy number predictions for more than 3,600 individuals at 95,770 probes from chromosome 1; association was tested with either a simulated phenotype (A–C) or a real phenotype (D). The simulated phenotype corresponds to normally distributed data influenced by a confounding factor [here the first principal component (PC1) obtained from the matrix of copy number predictions]. (A) Shows a strong p-value inflation (lambda∼65) that is due to the confounding factor (PC1). (B) Corresponds to results from a model where PC1 is added as a covariate (to adjust for the confounding effect). Yet (B) shows a slight p-value deflation (lambda ∼0.87). This deflation is due to the fact that the tested probes are assumed to be independent while many of these probes correspond to a same CNV region (thus the presented p-values are not from truly independent tests). (C) Shows a QQ plot adjusting for PC1 and where P0 (the X axis) accounts for the fact that probes can come from the same CNV region. Such plot can be done (in the R programing language) by setting the vector of expected p-value (X axis) as P0 < −seq[1/N,1,by = (1 − 1/N)/(n − 1)] where N is the number of CNV regions (number of effective tests) and n is the total number of CNV probes (number of observations). (D) Shows results from association with real data (here body mass index). In these QQ-plots, points with identical p-values correspond to rare, but rather long CNVs that produce multiple identical probes.
Figure 6Possible strategies for CNV prioritization. (A) Overview of possible strategies. (B) Functional investigation in animal models (functional impact assessment). (C) Genes ranking based on text-mining approaches (prioritization). (D) Visualization in genome browser (genomic characterization).