Literature DB >> 20605930

CNVineta: a data mining tool for large case-control copy number variation datasets.

Michael Wittig1, Ingo Helbig, Stefan Schreiber, Andre Franke.   

Abstract

MOTIVATION: Copy number variation (CNV), a major contributor to human genetic variation, comprises >/= 1 kb genomic deletions and insertions. Yet, the identification of CNVs from microarray data is still hampered by high false negative and positive prediction rates due to the noisy nature of the raw data. Here, we present CNVineta, an R package for rapid data mining and visualization of CNVs in large case-control datasets genotyped with single nucleotide polymorphism oligonucleotide arrays. CNVineta is compatible with various established CNV prediction algorithms, can be used for genome-wide association analysis of rare and common CNVs and enables rapid and serial display of log(2) of raw data ratios as well as B-allele frequencies for visual quality inspection. In summary, CNVineta aides in the interpretation of large-scale CNV datasets and prioritization of target regions for follow-up experiments.
AVAILABILITY AND IMPLEMENTATION: CNVineta is available as an R package and can be downloaded from http://www.ikmb.uni-kiel.de/CNVineta/; the package contains a tutorial outlining a typical workflow. The CNVineta compatible HapMap dataset can also be downloaded from the link above.

Entities:  

Mesh:

Year:  2010        PMID: 20605930      PMCID: PMC2922892          DOI: 10.1093/bioinformatics/btq356

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Many recent research findings suggest that copy number variation (CNV) plays a major role in genetic variability and hence human disease (Manolio et al., 2009; Zhang et al., 2009). However, while methodologies for large-scale association studies of single nucleotide polymorphisms (SNPs) are well established, a comprehensive framework for the analysis and interpretation of genome-wide CNV predictions in large case–control datasets is still lacking (Manolio et al., 2009). CNV prediction algorithms for data derived from SNP microarrays have improved considerably over the last few years, but many predicted segments are still false positives (Winchester et al., 2009) and can only be excluded through visual inspection in analogy to the mandatory inspection of genotyping assay scatterplots derived from SNP-based genome-wide association studies (GWAS; WTCCC, 2007). In addition, false negative rates tend to be high (Barnes et al., 2008) which can, for example, cause problems when rare CNV candidates are overlooked in the data. Current visualization tools such as the Affymetrix® Genotyping Console work well with small datasets and are suitable for cancer diagnostics and similarly small-sized sample sets. The difficulties in bridging the gap between packages for CNV prediction, association analysis and visualization for distinguishing genuine signals from false positive/negative predictions in generally noisy datasets have hindered many scientists from conducting genome-wide CNV analysis of their existing large GWAS datasets. We here present CNVineta, an R package capable of handling CNV data derived from large datasets with implemented analysis tools for the detection and visualization of disease-associated rare (Supplementary Fig. S4) and common CNVs (Supplementary Figs S5, S6). CNVineta was designed to allow researchers to access large datasets from standard desktop computers. We hope that CNVineta will empower scientists to perform genome-wide CNV screening and quickly evaluate raw data of candidate CNVs in existing GWAS datasets.

2 FEATURES OF CNVINETA

2.1 Input and output

In order to allow for rapid access to both predicted CNV data and raw data, as log2 of raw data ratios (LRR) and B-allele frequencies (BAF), CNVineta requires the CNV prediction output from third party tools [e.g. Affymetrix® Power Tools (APT), QuantiSNP (Colella et al., 2007), etc.] as well as the LRR and BAF to be converted to the CNVineta file format. The file format consists of (i) a single large binary data file with binary data on LRR and BAF for all samples; (ii) a SNP array annotation file; (iii) a sample annotation file; (iv) a segment annotation file containing data on predicted CNVs; and (v) a RefGene file with standard annotations for all genes in the human genome [UCSC refGene table (Karolchik et al., 2009)]. A detailed description of the CNVineta input file format and a conversion workflow are available in the online tutorial of CNVineta as well as in the package vignette. As an example for the CNVineta workflow (Fig. 1A).
Fig. 1.

Workflow and screenshots from CNVineta. (A) Workflow. Before starting the CNV screening, the SNP array data has to be processed with a third-party CNV prediction algorithm. Subsequent CNV association screening can be performed for rare and/or common CNVs. The functions automatically generate result tables and graphs for all regions that were identified as associated by CNVineta. The visual data mining can be performed in a stepwise fashion. (B–D) Plotting results for a known common deletion at the IRGM gene locus (McCarroll et al., 2008) in the Affymetrix® 6.0 HapMap dataset (International HapMap Consortium, 2003) comprising 180 samples (CEU and YRI). (B) Regional overview plot. From top to bottom the predicted CNVs (deletions highlighted by red horizontal lines), array probe sets within the region (SNP marker in black and non-polymorphic probe sets by blue vertical lines) and annotated genes (purple arrows). (C) Raw data plots. For each sample, the raw data visualization includes LRR (upper panel) and BAF (lower panel). (D) Heat map. To obtain a sample set-wide impression of the particular CNV and in order to identify potential false positive and negative CNVs that should be subjected to further follow-up, heat maps of LRR data can be generated.

Workflow and screenshots from CNVineta. (A) Workflow. Before starting the CNV screening, the SNP array data has to be processed with a third-party CNV prediction algorithm. Subsequent CNV association screening can be performed for rare and/or common CNVs. The functions automatically generate result tables and graphs for all regions that were identified as associated by CNVineta. The visual data mining can be performed in a stepwise fashion. (B–D) Plotting results for a known common deletion at the IRGM gene locus (McCarroll et al., 2008) in the Affymetrix® 6.0 HapMap dataset (International HapMap Consortium, 2003) comprising 180 samples (CEU and YRI). (B) Regional overview plot. From top to bottom the predicted CNVs (deletions highlighted by red horizontal lines), array probe sets within the region (SNP marker in black and non-polymorphic probe sets by blue vertical lines) and annotated genes (purple arrows). (C) Raw data plots. For each sample, the raw data visualization includes LRR (upper panel) and BAF (lower panel). (D) Heat map. To obtain a sample set-wide impression of the particular CNV and in order to identify potential false positive and negative CNVs that should be subjected to further follow-up, heat maps of LRR data can be generated. Help from the authors is available upon request (Google Group: http://groups.google.com/group/CNVineta).

2.2 Handling and visualization of CNVineta objects

CNVineta combines CNV prediction and raw data of the entire dataset. CNVs can be visualized in user-defined regions of the human genome across all samples using the regional overview option (Fig. 1B). As CNVs often overlap and vary in size, a graphical overview quickly allows for an orientation of the CNV structure at a particular region in the entire sample set. To demonstrate CNVineta's compatibility with large datasets, a regional overview of the proximal part of the chromosome 15q region (13 Mb) in 1587 samples is shown in the Supplementary Figure S1.

2.3 Genome-wide screening of rare and common CNVs

Analysis methods for genome-wide screening of rare and common CNVs are implemented. To reduce the complexity of the CNV predictions, CNVineta minimizes the genomic positions for association analysis. To this end, the array marker set is reduced to those markers, where a predicted CNV in any sample either starts or ends (Supplementary Fig. S2). In CNVineta, these markers are so-called atoms. The collated atoms are then used to scan genome wide for rare CNVs and/or common CNVs using a logistic regression model. The latter analysis method allows for the inclusion of covariates, e.g. gender, age and/or EIGENVECTORs that can help to remove confounding genotyping batch effects and/or population stratification (Price et al., 2006). LRR and BAF plots of samples with CNVs at the candidate regions as well as heat maps can automatically be generated. Primary, temporary and resulting data can all be processed by existing and/or custom R functions.

3 CONCLUSIONS

In this application note, we introduce CNVineta, an R package designed to meet the needs of scientists working with large-scale CNV data derived from SNP arrays. While many recently published CNV analysis programs focus on CNV prediction algorithms, CNVineta aims at handling, visualizing and mining these CNV predictions. The CNVineta workflow, outlined in detail in the online tutorial and Figure 1A, allows for in-depth data mining of a given dataset for associated rare as well as common CNVs for a phenotype of interest. Previous studies on CNVs in large sample sets genotyped with SNP arrays identified a large number of apparently false positive and negative CNV calls, which hampered an accurate interpretation and follow-up. CNVineta aims to overcome these difficulties by combining the visualization of CNV calls and the respective raw data. Funding: German Ministry of Education and Research (BMBF) through the National Genome Research Network (NGFN); DFG excellence cluster ‘Inflammation at Interfaces’. Conflict of Interest: none declared.
  10 in total

1.  The International HapMap Project.

Authors: 
Journal:  Nature       Date:  2003-12-18       Impact factor: 49.962

2.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

Review 3.  Comparing CNV detection methods for SNP arrays.

Authors:  Laura Winchester; Christopher Yau; Jiannis Ragoussis
Journal:  Brief Funct Genomic Proteomic       Date:  2009-09-08

4.  The UCSC Genome Browser.

Authors:  Donna Karolchik; Angie S Hinrichs; W James Kent
Journal:  Curr Protoc Bioinformatics       Date:  2009-12

Review 5.  Copy number variation in human health, disease, and evolution.

Authors:  Feng Zhang; Wenli Gu; Matthew E Hurles; James R Lupski
Journal:  Annu Rev Genomics Hum Genet       Date:  2009       Impact factor: 8.929

Review 6.  Finding the missing heritability of complex diseases.

Authors:  Teri A Manolio; Francis S Collins; Nancy J Cox; David B Goldstein; Lucia A Hindorff; David J Hunter; Mark I McCarthy; Erin M Ramos; Lon R Cardon; Aravinda Chakravarti; Judy H Cho; Alan E Guttmacher; Augustine Kong; Leonid Kruglyak; Elaine Mardis; Charles N Rotimi; Montgomery Slatkin; David Valle; Alice S Whittemore; Michael Boehnke; Andrew G Clark; Evan E Eichler; Greg Gibson; Jonathan L Haines; Trudy F C Mackay; Steven A McCarroll; Peter M Visscher
Journal:  Nature       Date:  2009-10-08       Impact factor: 49.962

7.  A robust statistical method for case-control association testing with copy number variation.

Authors:  Chris Barnes; Vincent Plagnol; Tomas Fitzgerald; Richard Redon; Jonathan Marchini; David Clayton; Matthew E Hurles
Journal:  Nat Genet       Date:  2008-09-07       Impact factor: 38.330

8.  Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease.

Authors:  Steven A McCarroll; Alan Huett; Petric Kuballa; Shannon D Chilewski; Aimee Landry; Philippe Goyette; Michael C Zody; Jennifer L Hall; Steven R Brant; Judy H Cho; Richard H Duerr; Mark S Silverberg; Kent D Taylor; John D Rioux; David Altshuler; Mark J Daly; Ramnik J Xavier
Journal:  Nat Genet       Date:  2008-09       Impact factor: 38.330

9.  Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls.

Authors: 
Journal:  Nature       Date:  2007-06-07       Impact factor: 49.962

10.  QuantiSNP: an Objective Bayes Hidden-Markov Model to detect and accurately map copy number variation using SNP genotyping data.

Authors:  Stefano Colella; Christopher Yau; Jennifer M Taylor; Ghazala Mirza; Helen Butler; Penny Clouston; Anne S Bassett; Anneke Seller; Christopher C Holmes; Jiannis Ragoussis
Journal:  Nucleic Acids Res       Date:  2007-03-06       Impact factor: 16.971

  10 in total
  3 in total

1.  Chromosomal variation in lymphoblastoid cell lines.

Authors:  Matthew D Shirley; Joseph D Baugher; Eric L Stevens; Zhenya Tang; Norman Gerry; Christine M Beiswanger; Dorit S Berlin; Jonathan Pevsner
Journal:  Hum Mutat       Date:  2012-04-16       Impact factor: 4.878

2.  Chapter 15: disease gene prioritization.

Authors:  Yana Bromberg
Journal:  PLoS Comput Biol       Date:  2013-04-25       Impact factor: 4.475

3.  Genome-wide rare copy number variation screening in ulcerative colitis identifies potential susceptibility loci.

Authors:  Hamid Reza Saadati; Michael Wittig; Ingo Helbig; Robert Häsler; Carl A Anderson; Christopher G Mathew; Limas Kupcinskas; Miles Parkes; Tom Hemming Karlsen; Philip Rosenstiel; Stefan Schreiber; Andre Franke
Journal:  BMC Med Genet       Date:  2016-04-01       Impact factor: 2.103

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.