| Literature DB >> 23251486 |
Francesco Lescai1, Silvia Bonfiglio, Chiara Bacchelli, Estelle Chanudet, Aoife Waters, Sanjay M Sisodiya, Dalia Kasperavičiūtė, Julie Williams, Denise Harold, John Hardy, Robert Kleta, Sebahattin Cirak, Richard Williams, John C Achermann, John Anderson, David Kelsell, Tom Vulliamy, Henry Houlden, Nicholas Wood, Una Sheerin, Gian Paolo Tonini, Donna Mackay, Khalid Hussain, Jane Sowden, Veronica Kinsler, Justyna Osinska, Tony Brooks, Mike Hubank, Philip Beales, Elia Stupka.
Abstract
Recent advances in genomics technologies have spurred unprecedented efforts in genome and exome re-sequencing aiming to unravel the genetic component of rare and complex disorders. While in rare disorders this allowed the identification of novel causal genes, the missing heritability paradox in complex diseases remains so far elusive. Despite rapid advances of next-generation sequencing, both the technology and the analysis of the data it produces are in its infancy. At present there is abundant knowledge pertaining to the role of rare single nucleotide variants (SNVs) in rare disorders and of common SNVs in common disorders. Although the 1,000 genome project has clearly highlighted the prevalence of rare variants and more complex variants (e.g. insertions, deletions), their role in disease is as yet far from elucidated.We set out to analyse the properties of sequence variants identified in a comprehensive collection of exome re-sequencing studies performed on samples from patients affected by a broad range of complex and rare diseases (N = 173). Given the known potential for Loss of Function (LoF) variants to be false positive, we performed an extensive validation of the common, rare and private LoF variants identified, which indicated that most of the private and rare variants identified were indeed true, while common novel variants had a significantly higher false positive rate. Our results indicated a strong enrichment of very low-frequency insertion/deletion variants, so far under-investigated, which might be difficult to capture with low coverage and imputation approaches and for which most of study designs would be under-powered. These insertions and deletions might play a significant role in disease genetics, contributing specifically to the underlining rare and private variation predicted to be discovered through next generation sequencing.Entities:
Mesh:
Year: 2012 PMID: 23251486 PMCID: PMC3522676 DOI: 10.1371/journal.pone.0051292
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Rationale.
We processed 162 exomes from different diseases (comprising 22 familial cases with mendelian inheritance and 140 sporadic/complex disease cases), and 11 samples from a different set of familial rare disorders with Mendelian inheritance, to be used for validation, with our pipeline characterised by Novoalign, Dindel and our own annotation script based on the ENSEMBL API. With the same pipeline we processed 21 samples from the Exome dataset of the 1000 Genome Consortium, to be used for comparison. 1000 Genome Consortium INDEL release October 2010 was also annotated with the same script, and annotation data have been compared. INDELs called in the two UCL datasets have been compared to identify common and private ones, and select a representative set to be validated with Sanger sequencing.
Figure 2INDELs characteristics.
Figure 2A shows a comparison of the length of INDEL variants called in our patients, and those available in the same capture regions in the ENSEMBL database and in the 1000 Genome Consortium release. The plot shows a higher presence of 1 bp insertion/deletions in ENSEMBL, and an increased size detection capability in 1000 Genome data, obtained from whole genome sequencing. Figure 2B shows a correlation of the INDELs already described in ENSEMBL between the size of the variant sequenced in our samples and the length reported in the database (r2 = 0.9221 for insertions and 0.4213 for deletions, both with p value<2* 10−16). Figure 2C shows the distribution of the distance (i.e. difference between start positions) between the INDELs as they were called by Dindel on our data, or as released by 1000 Genome Consortium, and the corresponding ones present in ENSEMBL.
Figure 3INDELs frequency and validation.
Figure 3A plots the non-reference allele frequency of INDELs called in our samples, divided in three categories: those already described in ENSEMBL, those described only in the released of 1000 Genome and those completely novel to our dataset, most of which are rare. Figure 3B shows the counts of validated INDELs according to the following frequency categories: common (non reference allele frequency equal or higher than 0.05), rare (frequency lower than 0.05) and private. The validation rate is significantly different in the three groups (Chi-squared = 44.4844, p-value = 2.189*10−10).
INDELs counts.
| described in ENSEMBL | described in 1000 Genome only | novel | total per consequence | |||||||||
| consequence | average (std.dev) | average rare (std.dev) | Ratio rare/total | average (std.dev) | average rare (std.dev) | Ratio rare/total | average (std.dev) | average rare (std.dev) | average rare complex | Ratio rare/total | average | rare |
| ESSENTIAL_SPLICE_SITE | 5.25 (1.43) | 0.24 (0.44) | 0.045714286 | 0.34 (0.52) | 0.07 (0.26) | 0.205882353 | 0.61 (0.84) | 0.43 (0.68) | 0.38 (0.65) | 0.704918033 | 6.2 | 0.74 |
| STOP | 0 (0) | 0 (0) | 0 (0) | 0 (0) | 0.03 (0.21) | 0.03 (0.21) | 0.02 (0.19) | 1 | 0.03 | 0.03 | ||
| COMPLEX_INDEL | 3.13 (1.01) | 0.17 (0.41) | 0.054313099 | 0.36 (0.52) | 0.09 (0.29) | 0.25 | 0.47 (0.63) | 0.24 (0.43) | 0.21 (0.41) | 0.510638298 | 3.96 | 0.51 |
| FRAMESHIFT_CODING | 69.4 (7.38) | 3.24 (2) | 0.046685879 | 3.28 (1.73) | 1.15 (1.05) | 0.350609756 | 23.49 (5.43) | 15.73 (4.89) | 13.81 (5.51) | 0.670940171 | 96.17 | 20.11 |
| NON_SYNONYMOUS_CODING | 68.33 (6.57) | 2.59 (1.8) | 0.037920937 | 2.24 (1.11) | 0.47 (0.67) | 0.209821429 | 6.69 (3.19) | 6.08 (3.13) | 5.2 (3.07) | 0.908819133 | 77.26 | 9.13 |
| SPLICE_SITE | 98.42 (7.53) | 3.35 (2.15) | 0.034044715 | 4.69 (2.01) | 0.97 (0.99) | 0.206823028 | 7.05 (3) | 5.21 (2.74) | 4.84 (2.68) | 0.739007092 | 110.16 | 9.53 |
| 5PRIME_UTR | 22.78 (3.18) | 0.66 (0.92) | 0.02907489 | 0.53 (0.64) | 0.14 (0.35) | 0.264150943 | 1.37 (1.12) | 1.37 (1.12) | 1.28 (1.08) | 1 | 24.69 | 2.17 |
| 3PRIME_UTR | 49.45 (5.3) | 1.32 (1.28) | 0.026720648 | 3.08 (1.45) | 0.4 (0.59) | 0.12987013 | 2.33 (1.68) | 2.18 (1.67) | 2.05 (1.63) | 0.935622318 | 54.85 | 3.9 |
| INTRONIC | 616.76 (32.95) | 16.44 (9.63) | 0.026623377 | 32.93 (5.55) | 6.33 (2.71) | 0.192401216 | 37.69 (11.25) | 27.6 (10.81) | 25.2 (10.8) | 0.734042553 | 687.38 | 50.37 |
| OTHER | 12.91 (2.58) | 0.6 (0.74) | 0.046511628 | 0.36 (0.53) | 0 (0) | 2.02 (1.32) | 0.85 (0.96) | 0.78 (0.93) | 0.420792079 | 15.29 | 1.45 | |
| total per category | 946.42 | 28.61 | 0.030232558 | 47.81 | 9.61 | 0.201046025 | 81.76 | 59.72 | 53.77 | 0.730722154 | ||
The table summarises the counts and standard deviation of INDELs according to their predicted consequence in the full dataset, in the subgroups of rare variants (i.e. MAF<0.05) and for the novel calls also the rare variants as calculated by excluding the familial cases from the dataset and re-computing the MAF. Adjusted p value indicates the results of a Wilcoxon test between rare variants in the entire dataset, and those rare in the dataset without familial cases.
Validation of variants.
| A) Common INDELs | |||
| Validated | Not Validated | Validation Rate | |
| Novel | 24 | 13 | 64.86% |
| Newly released | 42 | 3 | 93.33% |
| Total | 66 | 16 | 80.49% |
The table provides a summary of the validation results, both for the INDELs common to the two sequencing datasets used, and the private ones. All INDELs sent for validation were classified as “novel” according to dbSNP 131 and the 1000 Genome Consortium release October 2010. During the validation phase new data have been released by 1000 Genome (November 11th 2011): INDELs have been here categorised according to this latest release, to be considered as an independent confirmation.
Figure 4INDELs consequences comparison.
This boxplot details the differences in the comparison of the distributions of consequence proportions per sample between our disease exomes data and 1000 Genome exomes. Significant differences, calculated with a non parametric Wilcoxon test of independent samples, have been highlighted with a star.