| Literature DB >> 23164068 |
Andrew Stubbs1, Elizabeth A McClellan, Sebastiaan Horsman, Saskia D Hiltemann, Ivo Palli, Stephan Nouwens, Anton Hj Koning, Frits Hoogland, Joke Reumers, Daphne Heijsman, Sigrid Swagemakers, Andreas Kremer, Jules Meijerink, Diether Lambrechts, Peter J van der Spek.
Abstract
BACKGROUND: Next generation sequencing provides clinical research scientists with direct read out of innumerable variants, including personal, pathological and common benign variants. The aim of resequencing studies is to determine the candidate pathogenic variants from individual genomes, or from family-based or tumor/normal genome comparisons. Whilst the use of appropriate controls within the experimental design will minimize the number of false positive variations selected, this number can be reduced further with the use of high quality whole genome reference data to minimize false positives variants prior to candidate gene selection. In addition the use of platform related sequencing error models can help in the recovery of ambiguous genotypes from lower coverage data. DESCRIPTION: We have developed a whole genome database of human genetic variations, Huvariome, determined by whole genome deep sequencing data with high coverage and low error rates. The database was designed to be sequencing technology independent but is currently populated with 165 individual whole genomes consisting of small pedigrees and matched tumor/normal samples sequenced with the Complete Genomics sequencing platform. Common variants have been determined for a Benelux population cohort and represented as genotypes alongside the results of two sets of control data (73 of the 165 genomes), Huvariome Core which comprises 31 healthy individuals from the Benelux region, and Diversity Panel consisting of 46 healthy individuals representing 10 different populations and 21 samples in three Pedigrees. Users can query the database by gene or position via a web interface and the results are displayed as the frequency of the variations as detected in the datasets. We demonstrate that Huvariome can provide accurate reference allele frequencies to disambiguate sequencing inconsistencies produced in resequencing experiments. Huvariome has been used to support the selection of candidate cardiomyopathy related genes which have a homozygous genotype in the reference cohorts. This database allows the users to see which selected variants are common variants (> 5% minor allele frequency) in the Huvariome core samples, thus aiding in the selection of potentially pathogenic variants by filtering out common variants that are not listed in one of the other public genomic variation databases. The no-call rate and the accuracy of allele calling in Huvariome provides the user with the possibility of identifying platform dependent errors associated with specific regions of the human genome.Entities:
Year: 2012 PMID: 23164068 PMCID: PMC3549785 DOI: 10.1186/2043-9113-2-19
Source DB: PubMed Journal: J Clin Bioinforma ISSN: 2043-9113
Standardized parameters for meta-data
| Age | Age | Variable |
| Gender | Male , Female, other | Fixed |
| Ethnicity | HapMap groups | Fixed |
| Country of origin (if known) | UK, Netherlands, … | Variable |
| Study Type | Cancer, Family, Reference | Fixed |
| Biomaterial | Blood, Tissue, Cell line | Fixed |
| Biomaterial Subtype | PBMC, WBC, Heart, B-cell, … | Variable |
| Biomaterial Source | Peripheral vien, … | Variable |
| Biomaterial Modification | EBV transformed, … | Variable |
The Description defines the options that are available for each Parameter. A parameter for which the Limit is fixed means that there are a fixed number of possible values whereas one that is Variable is stored as any value (free text or numeric).
Summary of huvariome genomes
| Gross mapping yield (Gb) | 206 (160–280) | 202 (168–249) | 206 (160–280) |
| Fully called genome fraction | 95.13% (91%/97%) | 95.89% (94%/97%) | 95.13% (91%/97%) |
| Partially called genome fraction | 0.96% (0%/2%) | 0.69% (0%/1%) | 0.96% (0%/2%) |
| No-called genome fraction | 3.91% (3%/7%) | 3.42% (2%/5%) | 3.91% (2%/7%) |
| SNP total count | 3257897 (2966002–3396520) | 3275018 (3056972–3495143) | 3257897 (2966002–3495143) |
| SNP novel rate | 6.71% (6%/7%) | 6.78% (6%/9%) | 6.71% (6%/9%) |
| Synonymous SNP | 9239 (8668–9564) | 9229 (8503–9821) | 9239 (8503–9821) |
| Missense SNP | 9046 (8348–9456) | 9037 (8380–9574) | 9046 (8348–9574) |
| Nonsense SNP | 96 (86–117) | 95 (77–110) | 96 (77–117) |
| Nonstop SNP | 24 (19–29) | 22 (17–26) | 24 (17–29) |
| INS total count | 180040 (152082–208451) | 190177 (160473–209226) | 180040 (152082–209226) |
| INS novel rate | 21.31% (19%/23%) | 21.58% (19%/23%) | 21.31% (19%/23%) |
| Frame-shifting INS | 134 (113–157) | 130 (95–152) | 134 (95–157) |
| Frame-preserving INS | 116 (96–132) | 117 (98–130) | 116 (96–132) |
| DEL total count | 192550 (157937–217782) | 202085 (166805–227228) | 192550 (157937–227228) |
| DEL novel rate | 23.85% (23%/25%) | 23.75% (22%/26%) | 23.85% (22%/26%) |
| Frame-shifting DEL | 117 (96–144) | 110 (86–126) | 117 (86–144) |
| Frame-preserving DEL | 120 (100–138) | 115 (106–128) | 120 (100–138) |
| SUB total count | 68020 (56179–75040) | 69396 (59319–76699) | 68020 (56179–76699) |
| SUB novel rate | 34.07% (31%/38%) | 33.96% (31%/37%) | 34.07% (31%/38%) |
| Frame-shifting SUB | 21 (11–27) | 19 (14–26) | 21 (11–27) |
| Frame-preserving SUB | 259 (208–320) | 252 (225–279) | 259 (208–320) |
The data represent the average counts from the 31 genomes of Huvariome Core in which the fraction of heterozygous SNPs, inserts, deletions or substitutions are not found in dbSNP (SNP, INS, DEL, SUB novel). The number of loci where a coding SNV did not result in protein sequence change (Synonymous SNP), number of loci where a coding SNV resulted in protein sequence change, with no change in size of protein (Missense SNP), number of loci where the single nucleotide change in coding sequence resulted in a STOP codon (TGA, TAG, or TAA), causing an early termination of protein translation (Nonsense SNP), number of loci where the single nucleotide change in coding sequence resulted in the change of a STOP codon into a codon that codes for an amino acid, resulting in the continuation of the translation for this protein (Nonstop SNP), number of loci where the single nucleotide change in coding sequence resulted in the change of a START codon into a codon for something other than a start codon, likely resulting in a non-functional gene (Misstart SNP). The number of insertion, deletion or substitution loci where the change in coding sequence resulted in a frameshift for the encoded protein (Frame-shifting INS, DEL, SUB), number loci where there is a change in coding sequence and the length of the insertion is a multiple of 3, resulting in the insertion of amino acids in the encoded protein in-frame (Frame-preserving INS, DEL, SUB).
Figure 1Huvariome database schema. The part of the database schema representing the relationship between the core variation and annotation tables based on the content delivered by Complete Genomics.
Figure 2Huvariome database access screen. Users can access the system from a central page (http://huvariome.erasmusmc.nl) in which a genome is chosen and variants to be searched are uploaded as tab or space delimited search requests such that there is one variation per line. A region is searched by including an end position with the chromosome and start position.
Figure 3Huvariome genotype frequency page. The output page giving the distribution of allelic variations in the Diversity Panel of genomes for the European (CEU) and African (ASW) populations. In this example the first ten positions that were queried from the cardiomyopathy data set from Meder et al. 2011 [25] are shown. Each variant is returned per row with the frequency of each genotype highlighted by the size of the associated blue bar. Abbreviations: chromosome (chr); 0-based location (pos); reference allele (ref); variant alleles 1 and 2 (a1, a2); indels (ins, del); substitutions (sub); no-call (unkn.); no-call rate (nc rate); external reference (xref); predicted amino acid change (impact); gene symbol (gsym); gene component (comp), e.g. exon, intron; and variant from database of genomic variants (dgv).
Quality measures for huvariome genomes
| CDS | 2.99 | 1.13 |
| INTERGENIC | 2.07 | 1.15 |
| INTRON | 2.22 | 1.15 |
| UTR | 2.15 | 1.17 |
| DONOR | 2.81 | 0.97 |
| TSS | 2.08 | 1.16 |
| ACCEPTOR | 2.42 | 1.21 |
| SYNONYMOUS | 5.20 | 1.40 |
| MISSENSE | 2.14 | 1.28 |
| NONSENSE | 1.93 | 1.37 |
The transition (ti) to transversion (tv) and heterozygous (hetero) to homozygous (homo) ratios for SNVs were calculated for several regions of the genome, coding regions (CDS), intergenic, intronic (INTRON), untranslated region (UTR), splice sites (DONOR and ACCEPTOR), transcription region (TSS), and for different impact on the resultant protein sequence, no change in sequence (SYNONYMOUS), a change in the protein sequence with no change in size of protein (MISSENSE) and an early termination of protein translation (NONSENSE).
Figure 4Similarity of HVC to diversity panel. The output from AWClust [23] is shown in a multi-dimensional scaling plot. The genomes of HVC samples (HuVar), the Chinese and Japanese (CHB, JPT), the African (MKK, ASW, YRI, LWK), the Indian and Mexican (MXL, GIH) and the European (CEU) populations are shown in this plot. The HVC (HuVar =red) overlap with the CEU population, indicating the samples are more similar to the European samples than the others.
Confirmation of genotypes from 26 ambiguous variations calls
| | | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 227829345 | G | ASP,TYR | COL4A3 | G | G | G | G | G | G | G | G:T | | G:T | ||
| 6 | 71029714 | T | GLU,GLY | COL9A1 | T | T | T | T | T | T | T | |||||
| 14 | 70059717 | T | TYR,CYS | ADAM20 | T | T | T | T | T | T | T | |||||
| 16 | 28511156 | T | ASN,THR | SULT1A2 | T | T | T | T | T | T | T | T:G | T:G | T:G | ||
| 9 | 100837151 | C | PRO,LEU | COL15A1 | C | C | C | C | C | C | C | C:C | | | ||
| 14 | 69994396 | T | TYR,HIS | ADAM21 | T | T | T | T | T | T | T | |||||
| 16 | 28514697 | G | PRO,LEU | SULT1A2 | G | G | G | G | G | G | G | G:A | G:A | | ||
| 2 | 227632472 | G | PRO,LEU | COL4A4 | G | G | G | G | G | G | G | | G:A | | ||
| 2 | 237945216 | G | ARG,TRP | COL6A3 | G | G | G | G | G | G | G | | G:A | | ||
| 6 | 32286548 | C | GLY,ARG | NOTCH4 | C | C | C | C | C | C | C | | C:T | | ||
| 10 | 123233227 | C | ARG,GLN | FGFR2 | C | C | C | C | C | C | C | |||||
| 15 | 72802259 | A | ILE,THR | CYP1A1 | A | A | A | A | A | A | A | | A:G | | ||
| 15 | 82491107 | T | MET,THR | ADAMTSL3 | T | T | T | T | T | T | T | | T:C | | ||
| 19 | 46210061 | T | ILE,THR | CYP2B6 | T | T | T | T | T | T | T | |||||
| 4 | 73407648 | C | GLY,ARG | ADAMTS3 | C | C | C | C | C | C | C | | C:T | | ||
| 6 | 46728211 | C | ARG,PRO | CYP39A1 | C | C | C | C | C | C | C | C:G | C:G | | ||
| 7 | 99283117 | G | ARG,GLN | CYP3A43 | G | G | G | G | G | G | G | | G:A | | ||
| 10 | 96474119 | G | VAL,LEU | CYP2C18 | G | G | G | G | G | G | G | | G:T | | ||
| 10 | 96698964 | A | HIS,ARG | CYP2C9 | A | A | A | A | A | A | A | | A:G | | ||
| 15 | 76845583 | G | PRO,LEU | ADAMTS7 | G | G | G | G | G | G | G | | G:A | | ||
| 1 | 120269806 | C | GLY,ARG | NOTCH2 | C | C | C | C | C | C | C | |||||
| 8 | 24420887 | C | PRO,LEU | ADAM7 | C | C | C | C | C | C | C | |||||
| 17 | 38961570 | G | PRO,LEU | ETV4 | G | G | G | G | G | G | G | | | G:A | ||
| 21 | 46365928 | C | PRO,THR | COL6A2 | C | C | C | C | C | C | C | | | C:A | ||
| 2 | 189622367 | G | PRO,LEU | COL5A2 | G | G | G | G | G | G | G | | | G:A | ||
| 5 | 129100767 | C | THR,ILE | ADAMTS19 | C | C | C | C | C | C | C | C:T | ||||
Results from Huvariome analysis of 26 selected coding SNVs to disambiguate genetic variation determined by Ng et al. 2009 [24]. The first 7 columns display the normal variations and their proposed functional impact, determined by Ng et al. 2009. The base changes are presented as the IUPAC codes per sample (e.g. NA12156), which are grouped by populations, CEU (NA12156, NA12878), YRI (NA18507, NA18517, NA19129, NA19240), Asian (NA18555, NA18956), with the impacted bases denoted with bold letters, and in the column titled “change”. The last three columns contain the genotypes called for the three populations present in Huvariome Core and the Diversity Panel (European, African, and Asian). The Huvariome genotypes highlighted as bold demonstrate that Huvariome calls homozygous reference while the genotypes are heterozygous reference.
Confirmation of known population variation
| LMNA | 1 | 154372809 | T | C | Yes | T/C | dbsnp.83:rs505058 |
| SMYD2 | 1 | 212558909 | G | A | Yes | G/A | dbsnp.86:rs1134647 |
| TTN | 2 | 179163877 | G | A | Yes | G/A | dbsnp.130:rs72646845 |
| TTN | 2 | 179170739 | A | G | Yes | A/G | dbsnp.126:rs35833641 |
| TTN | 2 | 179329196 | C | T | Yes | C/T | dbsnp.116:rs7585334 |
| TTN | 2 | 179337706 | C | T | Yes | C/T | dbsnp.100:rs2291311 |
| TTN | 2 | 179352280 | G | A | Yes | G/A | dbsnp.88:rs1552280 |
| HDAC2 | 6 | 114372280 | C | Yes | T/C | dbsnp.121:rs13204445 | |
| TMEM2 | 9 | 73549916 | C | T | Yes | C/T | dbsnp.72:rs25689 |
| TMEM2 | 9 | 73550029 | C | Yes | G/C | dbsnp.107:rs3739783 | |
| MYPN | 10 | 69603927 | G | A | No | G/A | dbsnp.120:rs10997975 |
| LDB3 | 10 | 88483707 | A | T | Yes | A/T | dbsnp.127:rs45567939 |
| TRAF6 | 11 | 36473064 | T | C | Yes | T/C* | |
| MYBPC3 | 11 | 47326019 | G | A | Yes | G/A | dbsnp.120:rs11570058 |
| MYBPC3 | 11 | 47326617 | T | C | Yes | T/C | dbsnp.107:rs3729989 |
| MYH6 | 14 | 22931651 | A | G | Yes | A/G | dbsnp.80:rs365990 |
| MYH7 | 14 | 22968900 | G | A | No | G/A | dbsnp.86:rs735712 |
| DICER1 | 14 | 94626500 | A | T | Yes | A/T | dbsnp.52:rs13078 |
| ACTC1 | 15 | 32868460 | G | C | Yes | G/C | dbsnp.116:rs8037241 |
| TPM1 | 15 | 61138893 | C | A | Yes | C/A | dbsnp.86:rs1071646 |
| TCAP1 | 17 | 35075837 | A | C | Yes | A/C | dbsnp.86:rs1053651 |
| DSC2 | 18 | 26903040 | T | C | Yes | T/C** | |
| DSG2 | 18 | 27365107 | G | A | Yes | G/A | |
| DSG2 | 18 | 27376616 | G | A | Yes | G/A | |
| TNNI3 | 19 | 60357396 | A | C | Yes | A/C | dbsnp.116:rs7252610 |
| PARVB | 22 | 42726784 | T | C | Yes | T/C | dbsnp.86:rs1007863 |
| PARVB | 22 | 42821201 | T | C | Yes | T/C | dbsnp.92:rs1983609 |
| PARVB | 22 | 42821229 | T | C | Yes | T/C | dbsnp.86:rs738479 |
| DMD | X | 31406271 | C | T | Yes | C/T | dbsnp.89:rs1800280 |
| DMD | X | 32413115 | T | C | Yes | T/C | dbsnp.79:rs228406 |
Genomic nucleotide positions 1-based (Reference Position), nucleotides (Reference and Variant Alleles), and Confirmation by Sanger Sequencing are determined by Meder et al. 2011 [25]. The Variant Alleles in bold are the reference alleles in NCBI build 36. Huvariome Alleles are represented with the NCBI build 36 reference allele first in the pair (e.g. T/C with T from NCBI build 36). The T/C variant labeled with * is not found in the HVC, but in the CEU and GIH population; the T/C variant labeled with ** is not found in the HVC, but in the YRI and JPT population.
Variations in candidate cardiomyopathy genes
| LMNA | 1 | 154372340 | R>Stop | C | T | Yes | C/C | R321ter | + | Cardiomyopathy,_dilated|961C>T |
| TNNT2 | 1 | 199599130 | E163fs | C | -- | No | C/C | | -- | Cardiomyopathy,_hypertrophic|487G>A |
| SMYD2 | 1 | 212558105 | H>Y | C | T | Yes | C/C | | | |
| DSP | 6 | 7525794 | R>G | C | G | No | C/C | | | Arrhythmogenic_right_ventricular_dysplasia/ cardiomyopathy|4372C>G |
| TMEM2 | 9 | 73505380 | T>T | C | T | Yes | C/C | | | |
| ILK | 11 | 6585971 | P>L | C | T | No | C/C | | | |
| MYBPC3 | 11 | 47324447 | R>Q | C | T | Yes | C/C | R326Q | -- | |
| MYBPC3 | 11 | 47313209-47313210 | P955fs | CT | -- | Yes | AG/AG | P955fs | | Cardiomyopathy,_hypertrophic|2864_2865delCT |
| MYBPC3 | 11 | 47321263-47321264 | F412fs | TT | -- | Yes | AA/AA | F412fs | | Cardiomyopathy,_hypertrophic|1235_1236delTT |
| MYH7 | 14 | 22963165 | C905fs | G | -- | No | A/A | | | |
| MYH7 | 14 | 22968054 | R>C | G | A | Yes | G/G | R453C | -- | Cardiomyopathy,_hypertrophic|1357C>A |
| MYH7 | 14 | 22971706 | Y>H | A | G | Yes | A/A | | | |
| MYH7 | 14 | 22971762 | R>Q | C | T | Yes | C/C | R143Q | -- | Cardiomyopathy,_hypertrophic|428G>A |
Genomic nucleotide positions 1-based (Reference Position), nucleotides (Reference and Variant Alleles), and Confirmation by Sanger Sequencing are determined by Meder et al. 2011 [25]. Huvariome alleles are represented with the NCBI build 36 reference allele first in the pair (e.g. T/C with T from NCBI build 36). Variants that have previously been found to be associated with cardiomyopathy are denoted by Known Pathological Variant [25] and cardiomyopathy variations derived from the professional edition of Human Gene Mutation Database (HGMD) were supplied by Biobase. The HGMD descriptions in bold are linked to the first being described by Meder et al. 2011 [25] as related to dilated or hypertrophic cardiomyopathy.
Rate of reference genome variation in candidate cardiomyopathy genes
| TNNT2 | 2 | 1153 | 6.4 | 1 |
| LMNA | 5 | 3225 | 6.5 | 1 |
| SMYD2 | 2 | 1685 | 6.7 | 1 |
| TMEM2 | 6 | 6523 | 7.0 | 1 |
| MYH7 | 5 | 6030 | 7.1 | 4 |
| DSP | 7 | 9730 | 7.2 | 1 |
| ILK | 1 | 1797 | 7.5 | 1 |
| MYBPC3 | 2 | 4218 | 7.7 | 5 |
This table displays the number of variations in the genes that are used for searching Huvariome (Var count), the expected number of variants in the 31 genomes of HVC (Ref var count), the cumulative exon length in bp and the resultant rate of variants within the exons (negative log).