Literature DB >> 26013811

DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data.

Yuanwei Zhang¹, Zhenhua Yu², Rongjun Ban², Huan Zhang¹, Furhan Iqbal³, Aiwu Zhao⁴, Ao Li⁵, Qinghua Shi⁶.

Abstract

With the decrease in costs, whole-exome sequencing (WES) has become a very popular and powerful tool for the identification of genetic variants underlying human diseases. However, integrated tools to precisely detect and systematically annotate copy number variations (CNVs) from WES data are still in great demand. Here, we present an online tool, DeAnnCNV (Detection and Annotation of Copy Number Variations from WES data), to meet the current demands of WES users. Upon submitting the file generated from WES data by an in-house tool that can be downloaded from our server, DeAnnCNV can detect CNVs in each sample and extract the shared CNVs among multiple samples. DeAnnCNV also provides additional useful supporting information for the detected CNVs and associated genes to help users to find the potential candidates for further experimental study. The web server is implemented in PHP + Perl + MATLAB and is online available to all users for free at http://mcg.ustc.edu.cn/db/cnv/.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 26013811 PMCID： PMC4489280 DOI： 10.1093/nar/gkv556

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Various genetic variants have been associated with human diseases, of which copy number variations (CNVs) are of great importance (1) as genome-wide CNVs are reported to be involved in various human diseases including cancer (2), autism (3,4), schizophrenia (5) and intellectual disability (6). Cancer studies show that segmental deletions or duplications of chromosomes frequently occur throughout the process of tumorigenesis and progression (7,8). These aberrations are often associated with abnormal expression of tumor suppressors and oncogenes (9). Therefore, accurate detection of CNVs is an important step to identify disease-causing genes and functionally disrupted pathways. Advances in experimental technologies from array-based technologies including array comparative genomic hybridization (10) and single nucleotide polymorphism genotyping (11) to recent high-throughput DNA sequencing (12) have greatly promoted studies on human genomes. As whole-exome sequencing (WES) continues to be cheaper and more reliable, it has been demonstrated as an effective alternative to whole-genome sequencing for the identification of genetic variants underlying human diseases. Several state-of-the-art tools (13–16) have been developed to discover CNVs from WES data. These methods can be classified into two categories on the basis of approaches used: (i) to detect deviations in read counts among a pool of examined samples without the need of control samples, such as CoNIFER (13) and XHMM (14); (ii) to find deviations in read counts ratio by comparing the examined samples with the controls, such as ExomeCNV (15) and EXCAVATOR (16). Most of these tools are stand-alone programs that require users to locally set up computational environments with necessary hardware and software, which is sometimes difficult for users or even impossible if the technical requirements cannot be met. On the other hand, few tools are available for systematically functional annotation of CNVs by integrating currently available resources (17–20). These tools need a file containing the information of genome coordinates of CNVs as input, and annotation process is performed by finding genomic overlaps between input and annotation features. However, sample information is not provided in the annotation results from these tools, which makes it inconvenient for users to assign the annotation information to a specific sample carrying these CNVs, especially when applying these tools to annotate CNVs found in cohort studies. To our knowledge, integrated pipelines for detection and annotation of CNVs from WES data have not been reported yet. Therefore, online bioinformatics tools that can precisely detect and systematically annotate CNVs are highly needed for WES data. Here we introduce DeAnnCNV, an efficient web server designed for integrating Detection and Annotation of Copy Number Variations from WES data. DeAnnCNV is capable of identifying CNVs from each sample accurately based on our previously published algorithm GPHMM (21) and providing detailed visualization of the detected CNVs. It can also extract CNVs shared by multiple samples and further copiously annotate them based on several supporting features including: (i) whether a CNV has been reported or not (documented in dbVar (22)); (ii) detailed information on genes associated with CNVs; (iii) whether genetic variants of these genes have been reported in human diseases (collected from ClinVar (23)); (iv) phenotypes of mice deficient for these genes (collected from Mouse Genome Informatics (MGI) (24)); (v) mRNA expression of these genes in human tissues and cell lines; (vi) functional enrichment analysis for these genes (including enriched Gene Ontology (GO), pathway and protein domains) and (vii) constructing the protein–protein interaction (PPI) network for the genes involved in CNVs, in which whether a gene is associated with a human disorder is indicated. In order to verify the practicability of our tool, we applied DeAnnCNV to a study of infertile men and found that two patients have a CNV (each patient has only one of the two copies), which shares a gene PABPN1L, hemizygous deletion of which causes male infertility in mice. This result indicates that DeAnnCNV is a powerful and reliable tool for the detection and annotation of CNVs from WES data.

WEB SERVER CONSTRUCTION

DeAnnCNV consists of two modules: (i) detection and visualization of CNVs from each sample, and finding CNVs shared by patients; (ii) annotation of the detected CNVs and the associated genes.

Detection of CNVs

For both patient and control samples, read counts for all exons were extracted from each sample. In order to make read counts comparable among samples, sample normalization was performed for each patient sample by dividing the read counts of an exon by the total read counts of all exons (25). We averaged read counts from all the control samples to create a common reference that was used to represent read counts of the normal genome. The ratio between the read counts of each patient and the reference was then calculated and further normalized to eliminate Guanine-Cytosine content (GC-content) bias, which allowed ratio comparison between different genomic loci. We used the logarithm of the normalized read counts ratio to represent copy number profiles for each patient. The CNVs of each patient sample were then detected by a hidden Markov model (HMM) with hidden states corresponding to different CNVs (Supplementary Methods). Furthermore, a reliability score was calculated for each detected CNV to evaluate the reliability of DeAnnCNV results (Supplementary Methods). In order to assess the performance of DeAnnCNV, we simulated 10 samples and each sample contained a distinct complement of 10 CNVs. Results of this simulation are presented in Supplementary Table S2. Three measurements, precision, recall and F-measure, were calculated to evaluate the CNV detection performance of DeAnnCNV (Supplementary Methods). DeAnnCNV presented high F-measure (≥0.96) across all the simulated samples, attesting the performance of DeAnnCNV for the detection of CNVs from WES data. In addition, we investigated the ability of DeAnnCNV to distinguish between different copy numbers, and the results in Supplementary Table S3 demonstrated DeAnnCNV can precisely estimate copy number for each segment. Supplementary Methods contain the detailed description of the simulation procedure and the performance evaluation strategy. Following the detection of CNVs in each sample, CNVs encompassing the overlapped genomic region were extracted as shared CNVs (for single sample, share number was set as 1).

Annotation of CNVs

The information used for the annotation of the CNVs was collected from multiple sources and stored in MySQL. The gene location information was downloaded from Ensembl (GRCh37 and GRCh38) (26). The information on reported CNVs was downloaded from dbVar (22). The information regarding genetic variants of CNV associated genes in human disease was collected from ClinVar. The GO, pathway and protein domain information used for enrichment analysis was retrieved from DAVID (27). The enrichment P values were quantitatively measured by Fisher's exact test and Bonferroni correction was calculated when an adjustment was made to P values. The mRNA expression data were downloaded from The Human Protein Atlas (28). The PPI information was integrated from several major public databases, including HPRD (29), BioGRID (30), DIP (31), MINT (32), IntAct (33) and STRING (34) with redundant PPIs removed. The Cytoscape web application (35) was used to visualize the retrieved PPI network.

UTILITY AND WEB INTERFACE

Preparation of uploaded files

By using PreprocessFile package (provided by our server), files (*.count and *.gc) containing read counts for each sample and GC content of target regions can be generated from WES data and compressed into a single file (*.tar.gz) automatically. To maximize the ease for users, a shell script named ‘run.sh’ was included in the package to conveniently generate DeAnnCNV supported files.

Data analysis

The compressed file (*.tar.gz) can be directly uploaded to DeAnnCNV server (Figure 1A). After the uploaded data is decompressed (Figure 1B), users will be guided to the pages where they can either choose the default parameters or set their own parameters for CNVs detection. To initiate the analysis, users should assign each sample as patient or control. If no control is assigned, DeAnnCNV will use a default control (generated from in-house WES data of normal healthy males) as the reference. Then users need to select the version of genome. Currently, our tool supports GRCh37 and GRCh38. Following this, users have an option to set the threshold score (default 80) which descripts the strength of the evidence for the detected CNVs. In addition, users can modify the parameter ‘Number of patients sharing the same CNV’ to fetch shared CNVs among multiple samples (for single sample, share number set as 1). Users can also optimize the percentage of a gene covered by a CNV while defining CNV associated genes. Once all the parameters are set, users should click the ‘Finish’ button to start analysis (Figure 1C). A page will be provided to display the status of a job and link to the detailed results (Figure 1D).

Figure 1.

Input page of DeAnnCNV and parameters for the CNVs detection.

Results and description

Once a job is completed, the results page will display the parameters for the detection of CNVs and provide links to the detailed results. These results include CNVs associated information, enrichment analysis and PPI network.

CNV associated information

This page will display the detailed information on CNVs and associated genes. For CNVs, the information includes: the chromosome location of each CNV, copy number, copy number gain (copy number > 2) or loss (copy number < 2), score of the CNV, sample carrying the CNV, number of samples sharing the same CNVs and whether these CNVs are reported in dbVar (Figure 2A). By clicking the sample ID or share number, figures illustrating all the CNVs detected in this sample or all the samples carrying a specific CNV will be displayed, respectively (Figure 2B and C). For genes involved in the CNVs, the information includes: chromosome location of genes, percentage of genes covered by a certain CNV (coverage), association with human disease reported in ClinVar, phenotypes of deficient mice in MGI and the mRNA expression in human tissues and cell lines is provided (Figure 2D). By clicking the names of tissues or cell lines, corresponding histogram will be displayed to illustrate the mRNA expression (Figure 2E and F).

Figure 2.

An example output page of DeAnnCNV.

Enrichment analysis

This page will display the enrichment analysis results for CNVs associated genes including: the enriched chromosomes, GO terms, pathways and protein domains. The enrichment fold, P values (Fisher's exact test) and Bonferroni adjusted P values are displayed for each annotated term (Figure 2G). Users can select CNV type and annotation categories from the drop-down box and optimize P value and enrichment fold to refine the results. The detailed information for each enriched term can be viewed by clicking the term and a page will display the resource providing this annotation information. Users can save results of enrichment analysis by clicking download button in the footer navigation bar of Enrichment analysis page.

PPI network

This page will display the PPI network for the genes involved in the detected CNVs to determine if some already reported disease-causing genes interact with them. CNV associated genes will be presented in diamonds and their interacted genes will be presented in circles. Nodes with color represent genes with disease information reported in ClinVar. By clicking the colored node, the information of disease will be displayed (Figure 2H).

CASE STUDY AND DISCUSSION

We have performed WES for male infertile patients with the same clinical phenotype, however, no potential disease-causing candidate variants (both SNVs and Indels) were found in exome region. Thus, we tried to explore whether there are some potential CNVs related to male infertility. After uploading the files generated from WES data of four patients to DeAnnCNV server, a total of 45 CNVs (score > 80) and 256 associated genes (coverage > 90%) were detected in these patients. The detailed results can be accessed at http://mcg.ustc.edu.cn/db/cnv/check.php?rand_num=7235007981. Through sorting in the ‘gain/loss’ column of the results table, 13 CNVs with copy number loss were selected for further analysis (Supplementary Figure S1A). From the phenotype information presented by DeAnnCNV, we focused on the CNVs that encompassed the genes that contain reported infertility-associated variations in ClinVar or the genes for which the deficient mice are infertile or subfertile. Interestingly, by searching for ‘infertility’ in the ‘MGI-KO’ column, we found that two patients, patients 3 and 4, carried two CNVs (copy number = 1) that shared a gene PABPN1L, localized at human 16q24.3. According to MGI annotation, male mice with hemizygous deletion of Pabpn1l (localized at mouse chromosome 8) are infertile (Supplementary Figure S1B). Interestingly, the carrying of the two CNVs in the patients was confirmed experimentally by using the ΔΔCT-method as previously described (36,37). A reduction of PABPN1L gDNA amount was observed in both patients compared with control (Figure 3). Consistently, in the ‘mRNA Expression’ column of the results table, PABPN1L was found to express in human testis (Supplementary Figure S1B). Besides, in patients 3 and 4, we did not detect any other CNVs that encompassed genes that highly expressed in human testis or damaged human and mouse fertility after mutated. Taken together, these results indicated that loss of a copy of PABPN1L is the most probable cause of male infertility in these two patients, although functional research on how haploinsufficiency in PABPN1L results in male infertility is needed. This case study indicates that DeAnnCNV can not only detect CNVs based on WES data but also help the users to find out potential disease-associated CNVs by comprehensive annotation of detected CNVs and associated genes.

Figure 3.

Candidate CNVs detected by DeAnnCNV were confirmed by qPCR. Schematic representation of CNVs in genomic region (A). Control samples with different gDNA amounts (5, 10 and 20 ng) were used to imitate the ΔCT generated by one copy, two copies and four copies of PABPN1L. Ten nanogram of patients gDNA was used as templates. The ΔΔCT value for 10 ng of control gDNA was set as 0. All samples were then normalized to the calibrator to determine ΔΔCT values (B).

CONCLUSION

In conclusion, we have described the DeAnnCNV web server, a web-based tool that is capable of systematic detection and annotation of CNVs from WES data. This server has integrated two separated modules, detection and annotation of CNVs. With the annotation of genes involved in CNVs, users can conveniently screen the potential disease-associated CNVs.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

35 in total

1. Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV.

Authors: Jarupon Fah Sathirapongsasuti; Hane Lee; Basil A J Horst; Georg Brunner; Alistair J Cochran; Scott Binder; John Quackenbush; Stanley F Nelson
Journal: Bioinformatics Date: 2011-08-09 Impact factor: 6.937

2. Discovery and statistical genotyping of copy-number variation from whole-exome sequencing depth.

Authors: Menachem Fromer; Jennifer L Moran; Kimberly Chambert; Eric Banks; Sarah E Bergen; Douglas M Ruderfer; Robert E Handsaker; Steven A McCarroll; Michael C O'Donovan; Michael J Owen; George Kirov; Patrick F Sullivan; Christina M Hultman; Pamela Sklar; Shaun M Purcell
Journal: Am J Hum Genet Date: 2012-10-05 Impact factor: 11.025

3. wANNOVAR: annotating genetic variants for personal genomes via the web.

Authors: Xiao Chang; Kai Wang
Journal: J Med Genet Date: 2012-06-20 Impact factor: 6.318

4. Autism genome-wide copy number variation reveals ubiquitin and neuronal genes.

Authors: Joseph T Glessner; Kai Wang; Guiqing Cai; Olena Korvatska; Cecilia E Kim; Shawn Wood; Haitao Zhang; Annette Estes; Camille W Brune; Jonathan P Bradfield; Marcin Imielinski; Edward C Frackelton; Jennifer Reichert; Emily L Crawford; Jeffrey Munson; Patrick M A Sleiman; Rosetta Chiavacci; Kiran Annaiah; Kelly Thomas; Cuiping Hou; Wendy Glaberson; James Flory; Frederick Otieno; Maria Garris; Latha Soorya; Lambertus Klei; Joseph Piven; Kacie J Meyer; Evdokia Anagnostou; Takeshi Sakurai; Rachel M Game; Danielle S Rudd; Danielle Zurawiecki; Christopher J McDougle; Lea K Davis; Judith Miller; David J Posey; Shana Michaels; Alexander Kolevzon; Jeremy M Silverman; Raphael Bernier; Susan E Levy; Robert T Schultz; Geraldine Dawson; Thomas Owley; William M McMahon; Thomas H Wassink; John A Sweeney; John I Nurnberger; Hilary Coon; James S Sutcliffe; Nancy J Minshew; Struan F A Grant; Maja Bucan; Edwin H Cook; Joseph D Buxbaum; Bernie Devlin; Gerard D Schellenberg; Hakon Hakonarson
Journal: Nature Date: 2009-04-28 Impact factor: 49.962

5. Copy number variation detection and genotyping from exome sequence data.

Authors: Niklas Krumm; Peter H Sudmant; Arthur Ko; Brian J O'Roak; Maika Malig; Bradley P Coe; Aaron R Quinlan; Deborah A Nickerson; Evan E Eichler
Journal: Genome Res Date: 2012-05-14 Impact factor: 9.043

6. MINT, the molecular interaction database: 2012 update.

Authors: Luana Licata; Leonardo Briganti; Daniele Peluso; Livia Perfetto; Marta Iannuccelli; Eugenia Galeota; Francesca Sacco; Anita Palma; Aurelio Pio Nardozza; Elena Santonico; Luisa Castagnoli; Gianni Cesareni
Journal: Nucleic Acids Res Date: 2011-11-16 Impact factor: 16.971

7. cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate.

Authors: Günter Klambauer; Karin Schwarzbauer; Andreas Mayr; Djork-Arné Clevert; Andreas Mitterecker; Ulrich Bodenhofer; Sepp Hochreiter
Journal: Nucleic Acids Res Date: 2012-02-01 Impact factor: 16.971

8. STRING v9.1: protein-protein interaction networks, with increased coverage and integration.

Authors: Andrea Franceschini; Damian Szklarczyk; Sune Frankild; Michael Kuhn; Milan Simonovic; Alexander Roth; Jianyi Lin; Pablo Minguez; Peer Bork; Christian von Mering; Lars J Jensen
Journal: Nucleic Acids Res Date: 2012-11-29 Impact factor: 16.971

9. DbVar and DGVa: public archives for genomic structural variation.

Authors: Ilkka Lappalainen; John Lopez; Lisa Skipper; Timothy Hefferon; J Dylan Spalding; John Garner; Chao Chen; Michael Maguire; Matt Corbett; George Zhou; Justin Paschall; Victor Ananiev; Paul Flicek; Deanna M Church
Journal: Nucleic Acids Res Date: 2012-11-27 Impact factor: 16.971

10. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists.

Authors: Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal: Nucleic Acids Res Date: 2008-11-25 Impact factor: 16.971

8 in total

1. Clinical Utility of Next-Generation Sequencing for Developmental Disorders in the Rehabilitation Department: Experiences from a Single Chinese Center.

Authors: Yun Liu; Xiaomei Liu; Dongdong Qin; Yiming Zhao; Xuanlan Cao; Xiaoli Deng; Yu Cheng; Fuping Liu; Fang Yang; Tiesong Zhang; Xiu-An Yang
Journal: J Mol Neurosci Date: 2020-09-21 Impact factor: 3.444

2. iCopyDAV: Integrated platform for copy number variations-Detection, annotation and visualization.

Authors: Prashanthi Dharanipragada; Sriharsha Vogeti; Nita Parekh
Journal: PLoS One Date: 2018-04-05 Impact factor: 3.240

3. A Novel Role for CSRP1 in a Lebanese Family with Congenital Cardiac Defects.

Authors: Amina Kamar; Akl C Fahed; Kamel Shibbani; Nehme El-Hachem; Salim Bou-Slaiman; Mariam Arabi; Mazen Kurban; Jonathan G Seidman; Christine E Seidman; Rachid Haidar; Elias Baydoun; Georges Nemer; Fadi Bitar
Journal: Front Genet Date: 2017-12-18 Impact factor: 4.599

4. Accurate Inference of Tumor Purity and Absolute Copy Numbers From High-Throughput Sequencing Data.

Authors: Xiguo Yuan; Zhe Li; Haiyong Zhao; Jun Bai; Junying Zhang
Journal: Front Genet Date: 2020-04-30 Impact factor: 4.599

5. A verified genomic reference sample for assessing performance of cancer panels detecting small variants of low allele frequency.

Authors: Wendell Jones; Binsheng Gong; Natalia Novoradovskaya; Dan Li; Rebecca Kusko; Todd A Richmond; Donald J Johann; Halil Bisgin; Sayed Mohammad Ebrahim Sahraeian; Pierre R Bushel; Mehdi Pirooznia; Katherine Wilkins; Marco Chierici; Wenjun Bao; Lee Scott Basehore; Anne Bergstrom Lucas; Daniel Burgess; Daniel J Butler; Simon Cawley; Chia-Jung Chang; Guangchun Chen; Tao Chen; Yun-Ching Chen; Daniel J Craig; Angela Del Pozo; Jonathan Foox; Margherita Francescatto; Yutao Fu; Cesare Furlanello; Kristina Giorda; Kira P Grist; Meijian Guan; Yingyi Hao; Scott Happe; Gunjan Hariani; Nathan Haseley; Jeff Jasper; Giuseppe Jurman; David Philip Kreil; Paweł Łabaj; Kevin Lai; Jianying Li; Quan-Zhen Li; Yulong Li; Zhiguang Li; Zhichao Liu; Mario Solís López; Kelci Miclaus; Raymond Miller; Vinay K Mittal; Marghoob Mohiyuddin; Carlos Pabón-Peña; Barbara L Parsons; Fujun Qiu; Andreas Scherer; Tieliu Shi; Suzy Stiegelmeyer; Chen Suo; Nikola Tom; Dong Wang; Zhining Wen; Leihong Wu; Wenzhong Xiao; Chang Xu; Ying Yu; Jiyang Zhang; Yifan Zhang; Zhihong Zhang; Yuanting Zheng; Christopher E Mason; James C Willey; Weida Tong; Leming Shi; Joshua Xu
Journal: Genome Biol Date: 2021-04-16 Impact factor: 13.583

6. Benchmarking germline CNV calling tools from exome sequencing data.

Authors: Veronika Gordeeva; Elena Sharova; Konstantin Babalyan; Rinat Sultanov; Vadim M Govorun; Georgij Arapidi
Journal: Sci Rep Date: 2021-07-13 Impact factor: 4.379

7. cnvScan: a CNV screening and annotation tool to improve the clinical utility of computational CNV prediction from exome sequencing data.

Authors: Pubudu Saneth Samarakoon; Hanne Sørmo Sorte; Asbjørg Stray-Pedersen; Olaug Kristin Rødningen; Torbjørn Rognes; Robert Lyle
Journal: BMC Genomics Date: 2016-01-14 Impact factor: 3.969

8. Anaconda: AN automated pipeline for somatic COpy Number variation Detection and Annotation from tumor exome sequencing data.

Authors: Jianing Gao; Changlin Wan; Huan Zhang; Ao Li; Qiguang Zang; Rongjun Ban; Asim Ali; Zhenghua Yu; Qinghua Shi; Xiaohua Jiang; Yuanwei Zhang
Journal: BMC Bioinformatics Date: 2017-10-03 Impact factor: 3.169

8 in total