Literature DB >> 26764160

267 Spanish Exomes Reveal Population-Specific Differences in Disease-Related Genetic Variation.

Joaquín Dopazo1, Alicia Amadoz2, Marta Bleda3, Luz Garcia-Alonso2, Alejandro Alemán3, Francisco García-García2, Juan A Rodriguez4, Josephine T Daub4, Gerard Muntané4, Antonio Rueda5, Alicia Vela-Boza5, Francisco J López-Domingo5, Javier P Florido5, Pablo Arce5, Macarena Ruiz-Ferrer6, Cristina Méndez-Vidal7, Todd E Arnold8, Olivia Spleiss9, Miguel Alvarez-Tejado10, Arcadi Navarro11, Shomi S Bhattacharya12, Salud Borrego7, Javier Santoyo-López5, Guillermo Antiñolo13.   

Abstract

Recent results from large-scale genomic projects suggest that allele frequencies, which are highly relevant for medical purposes, differ considerably across different populations. The need for a detailed catalog of local variability motivated the whole-exome sequencing of 267 unrelated individuals, representative of the healthy Spanish population. Like in other studies, a considerable number of rare variants were found (almost one-third of the described variants). There were also relevant differences in allelic frequencies in polymorphic variants, including ∼10,000 polymorphisms private to the Spanish population. The allelic frequencies of variants conferring susceptibility to complex diseases (including cancer, schizophrenia, Alzheimer disease, type 2 diabetes, and other pathologies) were overall similar to those of other populations. However, the trend is the opposite for variants linked to Mendelian and rare diseases (including several retinal degenerative dystrophies and cardiomyopathies) that show marked frequency differences between populations. Interestingly, a correspondence between differences in allelic frequencies and disease prevalence was found, highlighting the relevance of frequency differences in disease risk. These differences are also observed in variants that disrupt known drug binding sites, suggesting an important role for local variability in population-specific drug resistances or adverse effects. We have made the Spanish population variant server web page that contains population frequency information for the complete list of 170,888 variant positions we found publicly available (http://spv.babelomics.org/), We show that it if fundamental to determine population-specific variant frequencies to distinguish real disease associations from population-specific polymorphisms.
© The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  disease variants; exome sequencing; pharmacogenomic variants.; population variability

Mesh:

Year:  2016        PMID: 26764160      PMCID: PMC4839216          DOI: 10.1093/molbev/msw005

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

Recent large-scale population genomic projects (Durbin et al. 2010; Fu et al. 2013) have revealed the existence of an enormous amount of rare variation at the genome level in human populations (Coventry et al. 2010; Li et al. 2010; Gravel et al. 2011; Marth et al. 2011; Keinan and Clark 2012; Nelson et al. 2012; Tennessen et al. 2012). In addition to the anticipated neutral variation, apparently normal healthy individuals present a considerable number of deleterious variants with a putative effect on the function of human protein-coding genes (Kryukov et al. 2007; MacArthur and Tyler-Smith 2010; Marth et al. 2011; Nelson et al. 2012; Tennessen et al. 2012; Xue et al. 2012; Garcia-Alonso et al. 2014) and functional noncoding genomic elements, including miRNAs (Carbonell et al. 2012) and other regulatory regions (Dunham et al. 2012; Spivakov et al. 2012). Moreover, recent studies have described a remarkable local component (Kryukov et al. 2007; Marth et al. 2011; Nelson et al. 2012) and a high stratification level (Mathieson and McVean 2012; Moreno-Estrada et al. 2013) in many rare variants with uncertain functional consequences. It is likely that these rare variants help explaining the differential risk of many diseases in distinct human populations (Corona et al. 2013; Fernandez et al. 2013). All of these observations highlight the need for population-specific catalogs of genetic variation (Bustamante et al. 2011). Despite being systematically studied at the single nucleotide polymorphism (SNP) level (Bustamante et al. 2011), sample sizes for individual European populations (amounting to only 50–90 individuals per population) in large-scale genome sequencing projects (Durbin et al. 2010) limit the precision at which their genetic variation at nucleotide resolution level can be assessed. Extensive variability surveys of European populations based on SNPs show a clear correspondence between genetic and geographic distances (Novembre et al. 2008). On the other hand, previous studies reported that the proportion of damaging substitutions is appreciably higher in individuals with European ancestry than in those with African ancestry (Barreiro et al. 2008; Lohmueller et al. 2008; Vasseur and Quintana-Murci 2013), which reinforces the need to gain insight into the degree of genetic variation at the population level. To our knowledge, only two initiatives that produced population-specific catalogs of genetic variation have been published to date: a whole-genome sequence (WGS) study of 100 Malays (Wong et al. 2013) and the recent Genome of the Netherlands with low-resolution (∼13×) WGS data of 250 trio-families from across the entire country (The_Genome_of_the_Netherlands_Consortium 2014). Another recent study of 109 exomes from French-Canadians, descendants of a small number of French settlers, show how the frequency of rare variants increases after population bottlenecks (Casals et al. 2013). Genomic data are providing an increasingly detailed perspective of the landscape of human variability at the nucleotide resolution level (Goldstein et al. 2013). Precision medicine, an emerging paradigm of medicine oriented to maximizing individual care and disease prevention rather than merely treating disease (Hood and Friend 2011), requires knowledge about the genetic structure of human populations, particularly in its preventive aspects (Khoury et al. 2012). Such knowledge is also essential for identifying genetic factors contributing to variation in disease risk as well as to drug pharmacokinetics, treatment efficacy, and adverse drug reactions (Dopazo 2014). Attempts to map the genetic basis of any of these disease traits will likely produce spurious associations if the genetic structure of the population is not properly accounted for (Price et al. 2006; Goldstein et al. 2013; Boomsma et al. 2014). Despite the extensive use of targeted exome sequencing to discover disease genes in Mendelian disorders (Ng et al. 2010; Bamshad et al. 2011) or cancer (Garraway and Lander 2013; Vogelstein et al. 2013), lack of information about local genetic variation can severely hinder the discrimination of real disease variants from local polymorphisms and rare variants. The classical consensus about the existence of relative homogeneity within European populations started to be questioned by genome-wide association studies (GWAS) reporting a correspondence between genetic and geographic distances (Novembre et al. 2008). Our observations support this view and suggest that the level of variation in local populations, even in the absence of geographic barriers, is higher than expected from previous variability studies based on polymorphisms (Novembre et al. 2008; Bustamante et al. 2011). A large proportion of this variability corresponds to private variants, as previously reported in several studies (Coventry et al. 2010; Li et al. 2010; Marth et al. 2011; Keinan and Clark 2012; Nelson et al. 2012; Tennessen et al. 2012). Here, we have analyzed whole-exome sequencing (WES) data from a sample of 267 healthy individuals of Spanish origin, which allowed us to carry out an exhaustive study of variability in the healthy Spanish population. The individuals were used as controls in the Medical Genome Project (MGP; http://www.medicalgenomeproject.com, last accessed January 28, 2016), a public–private partnership which aims to discover disease genes and mutations. Our analysis discovered a high degree of local variability private to the Spanish population, and a considerable difference in allele frequencies in polymorphic variants relative to geographically close populations (e.g., Tuscans from Italy). These observations extend to disease susceptibility or causal disease variants. Although allele frequencies for complex diseases in the Spanish population seem to be similar to other populations, the scenario for Mendelian and rare diseases seems to be the opposite, with significant differences in allele frequencies in the Spanish population (at least for several paradigmatic cases, including hereditary cardiomyopathies, degenerative retinal dystrophies, and others). Interestingly, differences in the allelic distributions of variants affecting known drug binding sites seem to be frequent, which suggests that drug resistance, adverse effects, etc., may have an important population-specific component. Finally, we report an example illustrating the importance of local variation to distinguish disease variants from local polymorphisms which could otherwise be taken as rare variants.

Results

Collection of Healthy Individuals

Blood samples from a total of 267 unrelated individuals of Spanish origin (obtained mainly in the North West—in Galicia—and in the South—in Andalusia—), who were phenotyped as healthy (i.e., with no known diseases or genetic conditions in the family history), were collected (see Materials and Methods).

Sequencing Results

The pipeline for primary data analysis (quality control and mapping) is described in Materials and Methods. The observed average coverage was satisfactory in all the samples analyzed (above 40×, which approximately corresponds to the expected coverage). The frequency of the alternative allele when the variant call was heterozygous was over 30% of the reads. These results ensured the quality of the variant calling process in this population. Sequence data has been deposited at the European Genome-phenome Archive (EGA, see http://ega.crg.eu, last accessed January 28, 2016), under accession number EGAS00001000938.

Variability Distribution in the Spanish Population

A summary of the variability corresponding to the exonic regions of the Spanish population is shown in table 1. Almost one-third of the variants found had not been previously described in dbSNP (Sherry et al. 2001), 1000 Genomes populations (1000G) (Durbin et al. 2010), or the National Heart, Lung, and Blood Institute Exome Sequencing Project (Fu et al. 2013). This level of discovery is similar to that previously observed in other sequencing projects (Fu et al. 2013). A large proportion of variants were found in only one individual in the Spanish population (85%), which also agrees with previous observations of rare variant frequencies in different human populations (Coventry et al. 2010; Li et al. 2010; Marth et al. 2011; Keinan and Clark 2012; Nelson et al. 2012; Tennessen et al. 2012; Casals et al. 2013). The average number of variants per individual in the coding regions of the genome analyzed was ∼19,000. Among these variants, the observed number of nonsynonymous changes per individual was 9,194. In particular, an average of 95.8 stop gains and 29.4 stop losses per individual was observed. There was also an average of 417.2 variants which affect splicing. As observed in other large-scale genomic projects (MacArthur and Tyler-Smith 2010; Xue et al. 2012), there was an average of 352 likely deleterious nonsynonymous single nucleotide variants (SNVs) (those which meet at least two of the three pathogenicity indexes indicative of a potential deleterious effect; see Materials and Methods) per individual. Among these, more than 50 variants per individual were homozygotes, therefore representing the presence of a considerable amount of potentially deleterious variation in the Spanish population. Only 27.5% of the variants were already present in the IBS population of 16 individuals, making 60.3% of the variability we describe here new. The Spanish population variant server web page contains the complete list of 170,888 variant positions found in the Spanish MGP population sequenced in this study, which can be interactively queried (http://spv.babelomics.org/, last accessed January 28, 2016).
Table 1.

Variants in the Exonic Regions of the MGP Spanish Population.

All Variants
Private MGP Variants
Total VariantsAverage Variants per IndividualAverage Variants per Individual (homozygous)Total VariantsAverage Variants per IndividualAverage Variants per Individual (homozygous)
Exome positions with SNV170,88818,875.86,90663,243835.859.4
Exome monoallelic positions170,37018,871.66,90663,143835.959.4
Exome multiallelic positions5184.201000.80
Exome SNV171,40618,880.16,90663,343836.759.4
Singletons54,21420259.454,21420259.4
Nonsynonymous SNV97,5899,193.73,335.540,564538.641
Synonymous SNV73,0119,7343,596.521,857287.218
Stop gain SNV1,85295.8221,06015.90.4
Stop loss SNV17829.412710.60.1
Splicing SNV4,217417.2154.81,84225.12
LoF SNV32,7361,163.8211.217,314141.83.3
LoF stricta SNV12,639352.651.47,136510.3

aAll three pathogenicity predictors (SIFT, Polyphen, and conservation score) reported these SNVs as pathogenic, in contrast with loss-of-function (LoF) in which only two pathogenicity predictions were required to consider the variant as pathogenic.

Variants in the Exonic Regions of the MGP Spanish Population. aAll three pathogenicity predictors (SIFT, Polyphen, and conservation score) reported these SNVs as pathogenic, in contrast with loss-of-function (LoF) in which only two pathogenicity predictions were required to consider the variant as pathogenic. Figure 1 depicts the extent of the variability of the Spanish population captured by this study. The total number of new variants present only in the Spanish population grew linearly with the number of individuals analyzed and seemed to be far from reaching a plateau in our study. However, when new variants were decomposed into rare variants (singletons) and polymorphic variants (those shared by several individuals) it was apparent that the main contribution to the private Spanish variability comes from rare variants, while polymorphic variants soon reached a plateau. This suggests that most of the polymorphisms within coding regions, observed only in the Spanish population, were apparently discovered in this work and seem to be restricted to ∼10,000 positions. Approximately, one-third of the variants found in the Spanish population are homozygous. This proportion decreases to 7% if only Spanish-specific variants are considered. The heterogeneity in the population can be viewed in supplementary figure S1, Supplementary Material online, and probably corresponds to different geographic locations. Unfortunately, anonymization and randomization of the samples (see Materials and Methods) precludes the assignation of specific samples to precise geographic locations.
F

Accumulative number of new variants contributed by individuals. The red line represents the number of variants found as the number of sequenced individuals increase. The green line represents the number of already known variants among all the variants found. The blue line represents the number of new variants not present in the 1000G populations. New variants are decomposed into polymorphic variants (present in more than one individual in the MGP population) represented by the blue dashed line, and rare variants (present in only one MGP individual), represented by the blue dotted line.

Accumulative number of new variants contributed by individuals. The red line represents the number of variants found as the number of sequenced individuals increase. The green line represents the number of already known variants among all the variants found. The blue line represents the number of new variants not present in the 1000G populations. New variants are decomposed into polymorphic variants (present in more than one individual in the MGP population) represented by the blue dashed line, and rare variants (present in only one MGP individual), represented by the blue dotted line. The distribution pattern of homozygotes and heterozygotes is consistent with a scenario in which most of the variants are in Hardy–Weinberg equilibrium (Stern 1943). Thus, at low allelic frequencies of the alternative allele, heterozygotes are prevalent, while the situation is the opposite at high allelic frequencies, where many alternative alleles are fixed in the population. In summary, we observed an excess of low-frequency nonsynonymous coding variants, most of them heterozygotes, thus confirming the observations made in other populations (Coventry et al. 2010; Li et al. 2010; Marth et al. 2011; Keinan and Clark 2012; Nelson et al. 2012; Tennessen et al. 2012; Casals et al. 2013).

The Relationship of Spanish Populations to Other Populations

Variants located in coding regions, with a minor allele frequency (MAF) > 0.01 were used to carry out a principal components analysis (PCA) analysis using the SNPRelate program. Supplementary figure S2, Supplementary Material online represents the two main axes of the PCA which depicts the relationship between the MGP population and the different 1000G populations. As expected, the Spanish population closely related to other European populations, with the Italian (from Tuscany) population (TSI) being the closest. Labels in the plot are located at the average of the coordinates for each individual. As expected as well, the location of the Spanish population in the plot coincides with the Spanish individuals included in the 1000G project (IBS population).

Disease Variants and Disease Risk in the Spanish Population

All 170,888 variant positions found in the 267 exomes of the MGP population were screened for known disease variants present in the Human Genome Mutation Database (HGMD, commercial release 2011.4). We identified the presence of 3,069 variants annotated in the disease database. Among them, 193 had MAFs in the Spanish population which exceeded those found in the 1000G populations by 2-fold (supplementary table S1, Supplementary Material online). When compared with the 1000G subpopulation with European ancestry (the TSI, FIN, GBR, and CEU populations), 69 disease variants still showed MAFs in the MGP population which were at least 2-fold larger than those observed in European ancestry populations (table 2). Some examples of familial diseases with variants with remarkably high (between 4- and 18-fold) allelic frequencies in Spain are: Marfan syndrome, Von Willebrand syndrome, Ellis-van Creveld syndrome, Wilson disease, cystinuria, Crohn’s disease, or Charcot-Marie-Tooth disease, just to cite a few. In particular, several degenerative retinal dystrophies seem to have associated mutations at unusually high frequencies in the Spanish population, such as the autosomal dominant cone dystrophy (heterozygous in 4 out of the 267 Spanish samples but absent in the 1000G populations), retinitis pigmentosa, Leber congenital amaurosis, Bardet–Biedl syndrome, and other ocular diseases such as primary open angle glaucoma or Stargardt. All these diseases showed significant differences in allele frequencies among the populations compared when taking adjusted P values <0.05 (supplementary table S1, Supplementary Material online).
Table 2.

Variants Associated with Diseases That Have Allele Frequencies in the Spanish Population at Least 2-Fold Higher Than in the 1000G Populations.

MGP
1000G
chrStartCTRAR/RR/AA/AMAFR/RR/AA/AMAFRatioRatio EHGMD_disease
2236688178nsGA263400.00751,078000NANAEpstein syndrome?
642141500nsCT263400.00751,078000NANACone dystrophy, autosomal dominant
2236688178synGA263400.00751,078000NANAEpstein syndrome?
1736104650nsCA264300.00561,078000NANADiabetes, Maturity onset diabetes of the young (MODY)
1548704816nsGA262500.00941,077105.00E-0418.8000NAMarfan syndrome
6162206852nsGA262500.00941,077105.00E-0418.8000NAParkinsonism. juvenile. autosomal recessive
1216420460nsCA263400.00751,077105.00E-0415.0000NARetinitis pigmentosa. recessive. no hearing loss
1741199716sgAT264300.00561,077105.00E-0411.2000NAOvarian cancer
198436373nsCT263400.00751,076209.00E-048.33333NALower plasma triglyceride level
1118050850nsCT263400.00751,076209.00E-048.33333NAAttention deficit hyperactivity disorder
3123376066nsCT263400.00751,076209.00E-048.33333NAAortic dissections?
7107329557nsTC262500.00941,075300.00146.71429NAPendred syndrome
799032559nsGA2501700.03181,0641310.0074.5428612.23076Complex I deficiency
182937867nsCA260700.01311,075300.00149.3571410.07692Psoriasis
1558957371nsCG261600.01121,076209.00E-0412.44448.615384Alzheimer disease. late onset
1542684875ncCT261600.01121,072600.00284.000008.615384Muscular dystrophy. limb girdle
194159747nsGA262500.00941,076209.00E-0410.44447.230769Hypertriglyceridemia
126143978nsCT262500.00941,075300.00146.714297.230769Von Willebrand. Normandy variant
1115221116nsCA2571000.01871,073500.00238.130437.192307Adenosine monophosphate deaminase deficiency
1232994073nsGA2571000.01871,073500.00238.130437.192307Arrhythmogenic right ventricular dysplasia/cardiomyopathy
45627493synGT258900.01691,074310.00237.347836.5Ellis-van Creveld syndrome
271825797nsCG263400.00751,077105.00E-0415.00005.769230Muscular dystrophy. limb girdle 2B
1352534410nsCT263400.00751,077105.00E-0415.00005.769230Wilson disease
8145699735nsGC263400.00751,077105.00E-0415.00005.769230Congenital heart defects
1196709833nsCT263400.00751,076209.00E-048.333335.769230Factor H deficiency
194568686nsCT263400.00751,076209.00E-048.333335.769230Stargardt disease
5110454719nsAG2551200.02251,074400.001911.84215.625Glaucoma. primary open angle
1167799622nsCT260700.01311,076209.00E-0414.55555.038461Complex I deficiency
8100832259nsAG260700.01311,074400.00196.894745.038461Cohen síndrome
1183532364nsTA260700.01311,071700.00324.093755.038461Chronic granulomatous disease
176211574nsCA264300.00561,078000NA4.307692Medium chain acyl CoA dehydrogenase deficiency
9120475248nsGA261600.01121,076209.00E-0412.4444.307692Meningococcal disease?
2245691554nsCT264300.00561,077105.00E-0411.2004.307692Renal adysplasia
1188924465nsCA264300.00561,077105.00E-0411.2004.307692Albinism. oculocutaneous 1
1421811213nsAG264300.00561,077105.00E-0411.2004.307692Leber congenital amaurosis
1421811213nsAG264300.00561,077105.00E-0411.2004.307692Retinitis pigmentosa?
1348939088nsCT264300.00561,077105.00E-0411.2004.307692Retinoblastoma
1423862646nsCA264300.00561,077105.00E-0411.2004.307692Cardiomyopathy. dilated
570945029nsTC261600.01121,074400.00195.894744.307692Complex I deficiency
182925359nsCT2452200.04121,0611700.00795.215194.12Psoriasis
631729925nsCT2254020.082498087110.05061.628464.12Leukemia. risk. association with
1756348226spTG259800.0151,075300.001410.71423.75Myeloperoxidase deficiency
1203194834nsCT262500.00941,076209.00E-0410.4443.615384Chitotriosidase deficiency
1616259579synGA262500.00941,075300.00146.714293.615384Pseudoxanthoma elasticum
244513202nsTC262500.00941,075300.00146.714293.615384Cystinuria
271738977nsGA262500.00941,074400.00194.947373.615384Muscular dystrophy. limb girdle/Miyoshi myopathy
1332914592nsCT262500.00941,074400.00194.947373.615384Breast and/or ovarian cancer?
12234791nsCT260700.01311,073500.00235.695653.275Cleft lip?
1575012987nsGT2333310.06551,0482910.01444.548613.275Colorectal cancer. reduced risk. association with
1773837042nsTC263400.00751,076209.00E-048.333332.884615Hemophagocytic lymphohistiocytosis. Familial
112064892nsGA261600.01121,073410.00284.000002.8Charcot-Marie-Tooth disease 2a
2144317156nsAC2541300.02431,070800.00376.567572.43Complex I deficiency
2144317156spAC2541300.02431,070800.00376.567572.43Complex I deficiency
1858038832nsTG2541300.02431,069900.00425.785712.43Obesity. autosomal dominant?
126103650nsGA2541300.02431,069900.00425.785712.43Von Willebrand disease 1?
1733430313nsTC2541300.02431,0651300.0064.050002.43Breast cancer. increased risk. association with
1349281554nsAG2541300.02431,0611700.00793.075952.43Atopy. association with
126458350nsAG2422500.04681,0552210.01114.216222.34Ischemic cerebrovascular events. association with
1742463054nsGC2551200.02251,0681000.00464.891302.25Glanzmann thrombasthenia
1222017410spCT2551200.02251,0661200.00564.017862.25Myocardial infarction. association with
1222017410nsCT2551200.02251,0661200.00564.017862.25Myocardial infarction. association with
1158624528nsGT2323410.06741,0423420.01763.829552.246666Spherocytosis. association with?
2246614274nsCG2244030.08611,0274920.02463.500002.1525Elevated plasma lipid conc. assoc. in diabetes
582491674nsTC2333400.06371,0423600.01673.814372.123333Lung cancer. susceptibility to. association with
1936341311nsTA2561100.02061,072600.00287.357142.06Focal segmental glomerulosclerosis
5151202476nsCT2561100.02061,0661200.00563.678572.06Hyperekplexia
5110428060nsTC2561100.02061,0641400.00653.169232.06Glaucoma. primary open angle. association with?
1378475230nsCT2561100.02061,0641400.00653.169232.06Hirschsprung disease
1227170648synCT257910.02061,0631500.0072.942862.06Ubiquinone deficiency with cerebellar ataxia

Note.—The first column indicates the chromosome; the second column indicates the position of the variant; the third column labeled CT, contains the consequence type, which are ns, nonsynonymous SNV; syn, synonymous; sg, stop gain; sp, variant affecting splicing; nc, ncRNA_exonic; the fourth column, labeled R, contains the reference allele in the position; the fifth column, labeled A, contains the alternative allele; the three following columns (sixth, seventh, and eight), labeled R/R, R/A, and A/A contain the number of individuals in which a reference homozygote (R/R), heterozygote (R/A) or an alternative homozygote (A/A) are found in the Spanish population, respectively; the ninth column, labeled MAF, contains the alternative allele frequency in the Spanish population; the three following columns (tenth, 11th, and 12th) contain the number of individuals in which a reference homozygote (R/R), heterozygote (R/A), or an alternative homozygote (A/A) are found in the 1000G populations; the 13 column, labeled Ratio, contains the ratio between the Spanish and the 1000G MAFs, the 14th column, labeled Ratio E, contains the ratio between the Spanish MAF and the 1000G MAFs of populations with European ancestry only; and finally, the 15th column, labeled as HGMD disease, contains the description of the disease caused by the variant, which can be a causal effect, or an association (when the description ends in “association with”) and can also be uncertain (then, the definition includes a question mark).

Variants Associated with Diseases That Have Allele Frequencies in the Spanish Population at Least 2-Fold Higher Than in the 1000G Populations. Note.—The first column indicates the chromosome; the second column indicates the position of the variant; the third column labeled CT, contains the consequence type, which are ns, nonsynonymous SNV; syn, synonymous; sg, stop gain; sp, variant affecting splicing; nc, ncRNA_exonic; the fourth column, labeled R, contains the reference allele in the position; the fifth column, labeled A, contains the alternative allele; the three following columns (sixth, seventh, and eight), labeled R/R, R/A, and A/A contain the number of individuals in which a reference homozygote (R/R), heterozygote (R/A) or an alternative homozygote (A/A) are found in the Spanish population, respectively; the ninth column, labeled MAF, contains the alternative allele frequency in the Spanish population; the three following columns (tenth, 11th, and 12th) contain the number of individuals in which a reference homozygote (R/R), heterozygote (R/A), or an alternative homozygote (A/A) are found in the 1000G populations; the 13 column, labeled Ratio, contains the ratio between the Spanish and the 1000G MAFs, the 14th column, labeled Ratio E, contains the ratio between the Spanish MAF and the 1000G MAFs of populations with European ancestry only; and finally, the 15th column, labeled as HGMD disease, contains the description of the disease caused by the variant, which can be a causal effect, or an association (when the description ends in “association with”) and can also be uncertain (then, the definition includes a question mark). In contrast, the frequencies were more similar across all the populations for variants which conferred susceptibility to common diseases. There are several exceptions, in which a particular variant, among the many associated with the disease, displayed remarkably higher allelic frequency in the Spanish population when compared with the 1000G populations. Such cases include certain forms of diabetes, juvenile Parkinsonism or late-onset Alzheimer (table 2 and supplementary table S1, Supplementary Material online). Furthermore, a few variants which have been associated with different types of cancer also displayed comparatively high allelic frequencies in the Spanish population, including variants for ovarian cancer, breast and ovarian cancer, retinoblastoma, and increased melanoma risk, among others (table 2 and supplementary table S1, Supplementary Material online). Variants associated with other diseases with less severe symptoms also had comparatively high allelic frequencies in the Spanish population (e.g., psoriasis and a type of autosomal dominant obesity). There were also relatively underrepresented variants in the Spanish population. As an anecdotal example, a variant associated with red hair (CM003595, gene MC1R, which causes both a nonsynonymous change and simultaneously affects an exomic ncRNA) occurs at a low frequency (0.0037) in the Spanish population compared with the relatively higher frequency in the 1000G populations with European ancestry (0.07). Similarly, variants associated with rare diseases also showed remarkable differences in allelic frequencies in the opposite direction. For example, the allele for the 2L form of Charcot-Marie-Tooth disease is underrepresented in the Spanish population (MAF 0.0037 versus 0.052 in 1000G, see supplementary table S1, Supplementary Material online), while the 2a form is overrepresented (see earlier). This is in agreement with the observation that some diseases are caused by different alleles in different populations (Fernandez et al. 2013) that can be relevant for diagnosis. Variants associated with several cardiovascular pathologies are also underrepresented in the Spanish population (supplementary table S1, Supplementary Material online). In addition, there are 376 rare variants annotated to diverse diseases present only in one Spanish exome and absent in all the 1000G individuals (supplementary table S1, Supplementary Material online).

Allele Frequencies in Mendelian and Rare versus Complex Diseases

Population differences in allele frequencies behave different for complex than for mendelian and rare disease. Figure 2 and supplementary figure S3, Supplementary Material online provide comparative information about allele frequencies of all the diseases with more than five variants recorded in HGMD and reveal an interesting trend. Most associations with complex diseases show an almost identical distribution of frequencies for all the variants in all the genes, including associations with Alzheimer disease (fig. 2), schizophrenia, myocardial infarction, type 2 diabetes, obesity, or essential hypertension, as well as most cancer associations (fig. 2B and supplementary fig. S3, Supplementary Material online). In sharp contrast, mendelian and rare diseases such as Marfan syndrome (fig. 2), Wilson disease, phenylketonuria, several degenerative retinopathies such as age-related macular degeneration and many others (fig. 2), have remarkably different allelic frequencies in the Spanish population compared with the 1000G populations.
F

Comparison of allelic frequencies described in HGMD between the MGP Spanish population and the 1000 genomes populations in four diseases with more than five variants. Upper left panel shows the frequencies in the Spanish MGP samples found for all the variants associated with Alzheimer disease in HGMD (X axis) versus the corresponding frequencies observed in all the individuals of the 1000 genomes populations (Y axis). Upper right panel presents a similar plot for variants described in HGMD as associated to leukemia risk. Lower left and right panels depict the same relationship for two rare diseases, Marfan syndrome, and age-related macular degeneration, respectively.

Comparison of allelic frequencies described in HGMD between the MGP Spanish population and the 1000 genomes populations in four diseases with more than five variants. Upper left panel shows the frequencies in the Spanish MGP samples found for all the variants associated with Alzheimer disease in HGMD (X axis) versus the corresponding frequencies observed in all the individuals of the 1000 genomes populations (Y axis). Upper right panel presents a similar plot for variants described in HGMD as associated to leukemia risk. Lower left and right panels depict the same relationship for two rare diseases, Marfan syndrome, and age-related macular degeneration, respectively.

Relationship between Variant Frequencies and the Prevalence of the Disease in the Population

To test whether population differences in frequencies of risk alleles (for both, complex and mendelian diseases) result in difference in the prevalence of the corresponding diseases, we have collected data from the “Global Burden of Disease database” (see Materials and Methods). We have found data on “differences in disability-adjusted life years” (DALYs), a widely used proxy of disease prevalence (Murray et al. 2015), for several of the diseases analyzed here. Interestingly, when the relative differences in allele frequencies found in the Spanish population with respect to European populations (TSI, FIN, GBR, and CEU) are compared with the corresponding relative differences in DALYs between Spain and the Central and East European populations, a remarkable correspondence between both parameters was found. Figure 3 depicts these relationships for the diseases showing the most extreme differences in allelic frequencies (Alzheimer, Attention deficit hyperactivity disorder, Parkinson, Psoriasis, and Cardiovascular diseases). Observed increases or decreases in allele frequencies in the Spanish population relative to European populations correspond to increases or decreases in the prevalence of the diseases, respectively.
F

Comparison of the relative prevalence and MAFs for several of the diseases showing the most extreme differences in allelic frequencies. The two first bars in each disease represent the log2 of the ratios of prevalence of the disease (DALYs) in Spain with respect to the corresponding prevalence in Central and East Europe, respectively, and the third bar represents the log2 of the ratio of the MAF of alleles of the disease in Spain and the corresponding MAF in the European populations of 1000G. The diseases are abbreviated as: Alzheimer (AD), Attention deficit hyperactivity disorder (ADHD), Parkinson (PD), Psoriasis (PSO), and Cardiovascular diseases (CAD).

Comparison of the relative prevalence and MAFs for several of the diseases showing the most extreme differences in allelic frequencies. The two first bars in each disease represent the log2 of the ratios of prevalence of the disease (DALYs) in Spain with respect to the corresponding prevalence in Central and East Europe, respectively, and the third bar represents the log2 of the ratio of the MAF of alleles of the disease in Spain and the corresponding MAF in the European populations of 1000G. The diseases are abbreviated as: Alzheimer (AD), Attention deficit hyperactivity disorder (ADHD), Parkinson (PD), Psoriasis (PSO), and Cardiovascular diseases (CAD).

Variants of Pharmacogenetic Relevance

The 267 exomes of the MGP population were screened for variants of pharmacogenetics relevance. In particular, we considered variants which affect drug binding sites, thus potentially disrupting the binding domain, without being deleterious to the protein. These variants will likely cause total or partial drug binding inhibition with potential effects such as resistance to treatments or may even cause adverse effects. There are 112 variants affecting well defined drug binding domains (Hopkins and Groom 2002). Among these, 31 are predominant in the Spanish population, with MAFs 1.5-fold higher than those observed in the 1000G populations (supplementary table S2, Supplementary Material online). For example, the gene CYP11B2 from the Cytochrome P450 family is affected in the binding site of different drugs (Eplerenone, Etomidate, Hydrocortisone, Metoclopramide, Metyrapone) used to treat a variety of diseases (including heart failure or hypercortisolism) or symptoms (antiemetic), by a nonsynonymous mutation which has a MAF prevalence in the Spanish population 15 times higher than that observed in the 1000G populations. Binding sites for several statins, drugs for migraine treatment, analgesics, and others (up to a total of 31) were also found to be affected by nonsynonymous variants present in the Spanish population at higher frequencies (over 1.5 times higher) than in the 1000G populations. Binding sites for several natural substances were also comparatively more affected by nonsynonymous variants in the MGP population than in the 1000G populations, including Ursodeoxycholic acid, Vitamin A, Choline (B-vitamin complex), L-Tryptophan, Glycine, and Tetrahydrobiopterin, among others. On the other hand, there are also 46 drug and natural compound binding sites that were remarkably less affected by variants in the Spanish population than in the 1000G reference population (with MAFs which are less than one-half that observed in 1000G; see supplementary table S2, Supplementary Material online). The fact that different drug binding sites are affected at different frequencies in different populations could account for population-specific differences in sensitivity, efficiency, and even resistance to drugs or their adverse effects. Supplementary table S2, Supplementary Material online lists other natural substances of interest. Since the binding sites of these substances may have been under negative selective pressure we studied possible deviations from the Hardy–Weinberg equilibrium which could be caused by a deleterious allele. Only a few of them (32; see supplementary table S2, Supplementary Material online) deviated significantly from the equilibrium, which suggests that either the majority of variants do not deactivate the binding sites or that there are other binding sites which compensate for their putative loss. Among the variants that deviate from the Hardy–Weinberg equilibrium there is one that affects the binding site for several compounds, including Glutamic Acid, in the gene GRIN3A. This gene is a glutamate receptor known to be under geographically localized positive selection, and to be related to obesity, coronary artery calcification, and Thiazide-induced adverse metabolic effects in hypertensive patients (Colonna et al. 2014). Moreover, four variants which affect three binding sites for NADH display significantly lower frequencies in the Spanish population when compared with the 1000G populations and one of them, located in the SORD gene, significantly deviates from the Hardy–Weinberg equilibrium (adjusted P value = 0.00005). The same occurs with a binding site for L-Phenylalanine, L-Tyrosine, and Tetrahydrobiopterin, to which the antihypertensive drug Metyrosine (a tyrosine hydroxylase enzyme inhibitor), also binds (supplementary table S2, Supplementary Material online). Interestingly, a variant which affects the binding site of two diuretic drugs (Amiloride, Triamterene) also significantly deviates from the Hardy–Weinberg equilibrium (adjusted P value < 0.00005). In these three cases, the 1000G population presented significantly higher MAFs and did not deviate from the Hardy–Weinberg equilibrium. We also observed that the corresponding frequencies in the Exome Variant Server (Fu et al. 2013) are similar to the 1000G population.

Selective Pressures

We used the McDonald–Kreitman test (MKT) (McDonald and Kreitman 1991) to search for signals of natural selection acting on genes. Although a total of 145 genes, corresponding to 365 different transcripts showed events of positive selection (supplementary table S3, Supplementary Material online lists genes with a nominal P value < 0.05), only MUC4 was still significant after multiple testing correction (adjusted P value = 1.7 × 10−6). Interestingly, MUC4 has a variant associated with Ulcerative colitis (HGMD ID CM066583) which is significantly underrepresented in the MGP population (adjusted P value = 0.00208) (supplementary table S1, Supplementary Material online). To find signals of recent positive selection FST values were calculated as a measurement of population differentiation between the MGP population and all the European 1000G populations (TSI, FIN, GBR, and CEU, excluding IBS). As expected, the mean FST between these two groups was low (0.007) although several loci showed extreme values (supplementary fig. S4, Supplementary Material online for a histogram of the FST distribution). SNPs with exceptionally high FST values (FST > 0.2) were considered candidates for selection (152 SNPs, 0.41% of the SNPs, see supplementary table S4, Supplementary Material online). Interestingly, a SNP (rs2550270) in the gene MUC4 presented an extreme FST value (FST = 0.3; see supplementary table S4, Supplementary Material online) but it is not the same one associated with Ulcerative colitis.

Increased Resolution Using Local Variability to Find Disease Genes

Knowledge of local variability can also have a practical application in clinical research. The systematic use of WES for finding disease genes has proven to be very successful in discovering new disease genes (Bamshad et al. 2011). Since exomes contain a vast number of mutations the number of candidate disease variants must be reduced in a process of prioritization involving a series of filters to exclude variants that are not likely to cause disease. One of the most stringent filters involves discarding variants which are present in the population at frequencies similar to or above the prevalence of the disease itself (Goldstein et al. 2013). Since local population frequencies are not typically available, general repositories, such as 1000G and others, are used. As a practical demonstration of the importance of knowledge about local variability, here we describe a specific example of a large family affected by autosomal-dominant retinitis pigmentosa (adRP; OMIM 268000). The use of the conventional consecutive filtering approach, implemented in the BiERApp tool (Aleman et al. 2014) which includes a specific filtering step for local variants, enormously increases the discovery power of the methodology. The family, of Spanish origin, comprises three generations and our study included seven affected members (fig. 4), who were clinically diagnosed with adRP, following ophthalmic criteria as previously described (Mendez-Vidal et al. 2013). All the affected individuals were derived from the Ophthalmology Department of the Genetic, Reproduction and Fetal Medicine Department at the Hospital Virgen del Rocio (Seville, Spain). The family did not present any known mutation for adRP and none of their members was included in the 267 MGP samples analyzed. The KING program (Manichaikul et al. 2010) was used to confirm the absence of any possible kinship between the family studied and any of the samples in the MGP population.
F

Effect of filtering out variants with high MAFs using frequency data inferred either from the available databases (1000G) or from the MGP Spanish population sequenced here. (A) Pedigree of the family studied with seven members affected by adRP. (B) Segregation analysis across the family was carried out, followed by a step filtering out the variants found in a reference population with a MAF incompatible with the observed prevalence of adRP. The plot represents the number of candidate variants that segregate with the family as a growing number of affected members were used to select the variants (from one to seven) and when two reference populations (1000G, pale blue and the MGP Spanish local population, dark blue) were used to filter out variants with MAFs that were too high to be compatible with the prevalence observed for the disease (>0.001 in 1000G and >0.004 in the MGP population, respectively). The filtering effect on the local Spanish population was drastically more stringent than for the 1000G population.

Effect of filtering out variants with high MAFs using frequency data inferred either from the available databases (1000G) or from the MGP Spanish population sequenced here. (A) Pedigree of the family studied with seven members affected by adRP. (B) Segregation analysis across the family was carried out, followed by a step filtering out the variants found in a reference population with a MAF incompatible with the observed prevalence of adRP. The plot represents the number of candidate variants that segregate with the family as a growing number of affected members were used to select the variants (from one to seven) and when two reference populations (1000G, pale blue and the MGP Spanish local population, dark blue) were used to filter out variants with MAFs that were too high to be compatible with the prevalence observed for the disease (>0.001 in 1000G and >0.004 in the MGP population, respectively). The filtering effect on the local Spanish population was drastically more stringent than for the 1000G population. WES of all the affected patients as well as the grandmother (with a genetic background common to all of them) was carried out as described in Materials and Methods. The selection of heterozygous variants segregating with the pedigree in figure 4 raised many possible candidates. Since the incidence of the disease is below 1 in 4,000 (Ayuso and Millan 2010), variants present in normal populations at frequencies higher than 0.001, the lowest frequency that can be obtained from 1000G (Durbin et al. 2010) populations, were discarded as putative disease-causing variants. The dark blue bars in figure 4B correspond to the number of variants which do not appear in the 1000G populations but that still segregate with the family when a growing number of affected individuals (from one to seven) are used to filter out variants. When we instead used the local Spanish population (MGP; sequenced here) to filter out variants present in the healthy population (pale blue bars in fig. 4B) the filtering power was strongly increased. We used the BiERapp tool (Aleman et al. 2014) to apply consecutive filters (segregation along the family pedigree, predicted pathogenicity, and population frequency) to select potential disease variants and genes. The analysis of variants shared by the seven affected members after filtering out those present in the Spanish population rendered a total of 7 possible variants corresponding to 7 candidate genes. We then performed cosegregation analysis using DNA samples from available family members which confirmed the presence of the variant in the family. The novel variant identified was subsequently screened in 200 healthy matched control subjects by Sanger sequencing, confirming its absence and thus validating the c.937-2_944del variant as a novel causal RP mutation. This variant produces the loss of a cryptic splice acceptor site in intron 4–5 of the RHO gene, and therefore the use of a cryptic splice site upstream of the normal acceptor splice site results in a truncated protein which might be subject to nonsense-mediated decay in these patients. Sanger sequencing was used to validate the mutation found. For this purpose, specific primers encompassing the RHO intron 4–exon 5 junction were designed using the Primer3 software (Rozen and Skaletsky 2000) with sense and antisense sequences TACAGAACACCCTTGGCACA and AGGTGTAGGGGATGGGAGAC, respectively, rendering an amplicon length of 424 bp and a Tm of 62° C.

Discussion

Our work describes with precision the level of variation observed in the genomic coding regions of 267 unrelated healthy Spanish individuals, which makes it the largest study to date on local variation in a single population. Thanks to large sample size, the conclusions can be considered more significant than those obtained from general studies involving multiple populations with smaller numbers of individuals (Durbin et al. 2010; Li et al. 2010; Corona et al. 2013; Fu et al. 2013; Moreno-Estrada et al. 2013). However, it must also be taken into account that, while the population-specific results can be considered robust, the existence of a certain bias in the comparative analysis of the Spanish and the 1000G populations, derived from the fact that they are independent experiments, using different sequencing technologies and sampling strategies cannot be completely ruled out. Here, we document that while the polymorphic variants private to the Spanish population are almost completely described in this work by analyzing only 267 individuals (fig. 1), the rate of discovery of new rare variants with increasing numbers of sequenced individuals was still far from reaching a plateau. Although many Spanish rare variants remain to be discovered, the use of the population frequencies obtained in this work does already afford increased ability (relative to the use of 1000G project data) to filter candidate variants in a Spanish family which could otherwise be interpreted as possible disease variants (fig. 4). As expected given the magnitude of the rare variation discovered in the Spanish population, a significant number of variants were related to diseases (Kryukov et al. 2007; MacArthur and Tyler-Smith 2010; MacArthur et al. 2012; Xue et al. 2012). When the frequencies of disease variants or disease-risk variants in the Spanish population are compared with the corresponding frequencies observed in the 1000G populations an interesting trend emerges: complex disease variants seem to have similar allelic frequencies in both the MGP Spanish samples and 1000G populations. In contrast, mendelian or rare diseases tend to present dissimilar allelic frequency distributions (supplementary fig. S3, Supplementary Material online). In these, and other similar diseases, the most prevalent alleles are different in distinct populations. This observation agrees with the fact that, while high-frequency variants and variants underlying complex diseases tend to be shared across populations (Marigorta and Navarro 2013), low-frequency alleles tend to be private (Casals et al. 2013). This, together with recent discoveries of new population-specific variants causal of inherited diseases, such as retinopathies (Méndez-Vidal et al. 2014), strongly points at a crucial role of private mutations in the configuration of the mutational spectrum of certain diseases. In other words, geographic heterogeneity in the genetic architecture of disease, that is, the fact that different variants, often in different genes, can cause common multigenic diseases in different populations (Fernandez et al. 2013), may be more frequent than expected, again highlighting the need for local variation catalogs (Bustamante et al. 2011). Since the use of drugs is very recent in evolutionary terms, it is expectable that the observed differences in frequencies in the variants located in a number of drug binding sites are due to population founder effects rather than any selective processes. However, once an area of a protein’s surface turns out to be a drug binding site it becomes relevant from a clinical perspective. We observed a total of 121 variants affecting the binding sites of different drugs. We also observed differences in the frequencies of variants affecting the binding sites of some natural products, such as Vitamin A, Choline, L-Tryptophan, Glycine, etc. In this case, some selective effect against mutations in the binding sites could be hypothesized. It is known that genes related to xenobiotic metabolism (Arbiza et al. 2006), pathogen adaptation to (Karlsson et al. 2014), or dietary change (Luca et al. 2010) are under selective pressures due to their relationship to disease in modern humans (Babbitt et al. 2011; Engelken et al. 2014). However, the study of Hardy–Weinberg equilibrium does not support the existence of selective pressure against any of the alleles for most of the cases of variants affecting the binding sites of natural products. Therefore, either the power of the test is too low or many of the studied variations are unlikely to inactivate these binding sites. In this study only three specific binding sites, corresponding to NADH, L-Phenylalanine, and L-Tyrosine, were affected by variants that significantly deviate from the Hardy–Weinberg equilibrium, thus suggesting the existence of some type of selection against such variants in the Spanish population which has not been detected in the 1000G project. Our findings clearly highlight the importance of local variability in any study which attempts to relate genotype to phenotype, specifically when the phenotype is a disease. In the example given, the local variability filter discarded five times more candidate variants (false positives) than the filter based only on population allelic frequencies derived from foreign populations (1000G). Although the need for population-specific catalogs of genetic variation has been previously noted (26), our results clearly reveal the quantitative magnitude of the differences expected between the use of a general population and a local population, and its impact on clinically relevant human variation. To foster research about other pathologies, we have made publicly available through the Spanish population variant server web page (http://spv.babelomics.org/, last accessed January 28, 2016) all the relevant information on population frequencies for the 170,888 variant positions found in this study.

Materials and Methods

Human Subjects

Following informed consent, 267 unrelated samples of Spanish origin, which were phenotyped as healthy, were obtained and further anonymized and sequenced. The criteria followed for declaring them healthy were the absence of current known disease or genetic conditions in the family history, although diseases appearing at older ages cannot be completely ruled out. The samples were collected in 2004 and stored in the Biobank at the Hospital Virgen del Rocio (Seville, Spain), where they were routinely used as controls for genotype studies. Their geographical origin corresponds mainly to the North (Galicia and Catalonia), the center (Madrid), and the South of Spain (Andalucia). The sampling centers were the Camas and the Candelaria hospitals (Andalucia), the Dr Joan Vilaplana Hospital (Cataluña), the Hospital Clínico Universitario de Santiago (Galicia), and the Almodeva Hospital (Madrid). The number of individuals sampled in each location was approximately proportional to the populations of the corresponding regions. Because the samples were sequenced in the context of the Medical Genome Project, we called this population MGP. Samples were obtained in accordance with the approved protocols of the respective institutional review boards for the protection of human subjects. The study conformed to the tenets of the declaration of Helsinki.

Human Populations

A total of 13 human populations were used in this study which included: European populations TSI from Tuscany in Italy (98 samples), FIN Finnish from Finland (93 samples), GBR British from England and Scotland (89 samples), CEU residents of Utah (CEPH collection) with northern and western European ancestry (85 samples), and the IBS, from Spain (14 samples); Asian populations CHB Han Chinese in Beijing, China (97 samples), CHS southern Han Chinese (100 samples), and JPT Japanese in Tokyo, Japan (89 samples); American populations were MXL Mexican Ancestry in Los Angeles, CA (66 samples), PUR Puerto Rican in Puerto Rico (55 samples), and CLM Colombian in Medellin, Colombia (60 samples); and African populations were YRI Yoruba in Ibadan, Nigeria (88 samples), LWK Luhya in Webuye, Kenya (97 samples), and ASW African Ancestry in Southwest United States (61 samples). The exome sequences of all the individuals corresponding to the 13 populations were downloaded from the 1000 genomes web page (http://www.1000genomes.org/, last accessed January 28, 2016) in multisample variant calling format (VCF). Finally, we used the MGP Spanish samples (367), totaling 1,359 studied individuals.

Construction of DNA Libraries and Sequencing

Library preparation and exome capture were carried out according to a protocol based on the Baylor College of Medicine protocol version 2.1 (with several minor modifications). First, 5 µg of input genomic DNA is sheared, end-repaired and ligated with specific adaptors. A fragment size distribution ranging from 160 to 180 bp after shearing and 200–250 bp after adaptor ligation was verified using a Bioanalyzer (Agilent). The library is amplified using a precapture linker-mediated polymerase chain reaction (LM-PCR) using a FastStart High Fidelity PCR System (Roche) and barcoded primers. After purification, 2 µg of LM-PCR product are hybridized to NimbleGen SeqCap EZ Exome libraries V3. After washing, amplification is performed by postcapture LM-PCR using a FastStart High-Fidelity PCR System (Roche). Capture enrichment is measured by qPCR according to the NimbleGen protocol. The successfully captured DNA is measured by Quant-iT PicoGreen dsDNA reagent (Invitrogen) and subjected to standard sample preparation procedures for sequencing on the SOLiD 5500xl platform as recommended by the manufacturer. Emulsion PCR is performed on the E80 scale (about 1 billion template beads) using a concentration of 0.616 pM that contains 4 equimolecular pooled libraries of enriched DNA. After breaking and enrichment, ∼276 million enriched template beads are sequenced per lane on a 6-lane SOLiD 5500xl slide.

Sequencing Data Analysis

A customized pipeline for processing the raw sequences (FastQ files) was applied. First, sequence reads were aligned to the reference human genome build GRCh37 (hg19) by using the SHRiMP tool (Rumble et al. 2009). Correctly mapped reads were further filtered with SAMtools (Li et al. 2009), which was also used for sorting and indexing mapping files. Only high quality sequence reads mapping to the reference human genome in unique locations were used for calling variants. The Genome Analysis Toolkit (GATK) (McKenna et al. 2010) was used to realign the reads around known indels and for base quality score recalibration. Identification of single nucleotide variants and indels was performed using GATK standard hard filtering parameters (DePristo et al. 2011). The result of this pipeline was a VCF file for each sequenced sample. All the VCF files were then scanned for disease variants (variants with a reported association with a disease). The program VARIANT (Medina et al. 2012), which contains annotations from the latest versions of Ensembl Variation (Flicek et al. 2012), Uniprot (Magrane and Consortium 2011), dbSNP (Sherry et al. 2001), and HGMD (Stenson et al. 2009) was used for this purpose. Positions that were not determined (because of any quality control problems or lack of coverage) for more than the 75% of the samples analyzed were not considered in the study.

Tests for Selection

We used the MKT (McDonald and Kreitman 1991) to test for possible selective pressures in the MGP population. The MKT is based on the comparison of the ratio of nonsynonymous to synonymous SNPs between species (D) and within species (P/P). Assuming that synonymous mutations behave neutrally, a higher D/D than P/Ps ratio is expected in case of adaptive selection, because mutations that are positively selected in a population, and therefore rise quickly to fixation, would contribute more to divergence than to polymorphism. We estimated the per gene proportion of base substitutions fixed by natural selection, α (Smith and Eyre-Walker 2002) with: where P and P are the total number of nonsynonymous and synonymous polymorphisms and D and D the number of nonsynonymous and synonymous divergent differences, respectively. Significant positive values of α (Dn/Pn > Ds/Ps) indicate an excess of fixation of nonneutral mutations suggesting that positive selection is driving a change in this gene. Measures of polymorphism in the MGP population and human–chimpanzee divergence were used to calculate α for all coding sequences. Sites diverging between panTro2 and hg19 were inferred on the Galaxy website (http://main.g2.bx.psu.edu/, last accessed January 28, 2016) using the regional variation/fetch substitutions from the pairwise alignments tool. For each identified nucleotide substitution, pairwise alignments of panTro2/hg19 were downloaded using the fetch alignments/fetch pairwise MAF blocks tool for our set of genomic intervals. In addition, to detect signals of recent positive selection we calculated the FST values (Weir and Cockerham 1984) between the MGP population and all European 1000G populations (TSI, FIN, GBR, and CEU, excluding IBS) as a measure of population differentiation between two populations . Only polymorphic positions with maximum of 75% missing data and with a MAF of at least 10% were used. These SNPs were assigned to genes using the SNP Nexus webtool (Dayem Ullah et al. 2012). In the cases in which SNP Nexus could not assign genes to SNPs, they were assigned manually using the Ensembl Biomart (GRCh37 archive site) (Kinsella et al. 2011) and dbSNP (Sherry et al. 2001) websites. Finally, as additional evidence for possible selection against the alternative allele homozygote, deviations from the Hardy–Weinberg proportions were tested for all the variants with a χ2 test (Wigginton et al. 2005).

Statistical Analysis

The program SNPRelate (Zheng et al. 2012) was used to carry out principal component analyses. A χ2 test was used to assess the significance of the differences in allele frequencies in the MGP population when compared with the 1000G populations. Multiple testing adjustments were performed using the False Discovery Rate method (Benjamini and Hochberg 1995). The KING program (Manichaikul et al. 2010), with the parameters—unrelated—degree 2 (which extracts a list of individuals with no first- or second-degree relationship between any pairs), was used to confirm that all the samples used in the study were unrelated individuals. The VCF file from the sequencing data was converted into the PLINK binary format required by KING via VCFtools (Danecek et al. 2011). Functional enrichment was assessed using the FatiGO (Al-Shahrour et al. 2004) as implemented in Babelomics (Alonso et al. 2015). We also applied a gene set enrichment analysis as described in Daub et al. (2013) using the sum of FST values as summary statistic of genes in a gene set.

Disease Variants, Genes, and Definitions

Disease annotations were taken from HGMD commercial release 2011.4 (Stenson et al. 2009). A total of 76,128 annotations, including 69,965 unique variants (denoted by the chromosome and start position in chromosomal coordinates) are contained in the database. HGMD incorporates different types of variants that include not only single base pair substitutions and indels affecting coding regions but also variants with consequences for mRNA splicing and regulatory abnormalities. HGMD stores only disease-causing mutations and disease-associated/functional polymorphisms but also includes a number of SNPs from GWAS, although only if there is evidence of an effect on function. Since not all the HGMD identifiers were still public at the time of writing this article, we included the source from which the evidence of the mutation was taken (a publication) in the tables presented here. The DALYs were taken as an approximation to the prevalence of the diseases. DALYs from the “Global Burden of Disease” Study (Murray et al. 2015) were obtained from the repository of this study (http://www.healthdata.org/gbd, last accessed January 28, 2016).

Definition of Deleterious Variants

Variants with “synonymous” or “unknown” functional consequences were filtered out. Then, using the VARIANT software (Medina et al. 2012) the putative impact and damaging effect of the variants on protein function was predicted by computing both SIFT (Kumar et al. 2009) and Polyphen (Ramensky et al. 2002) damage scores and the phastCons (Siepel et al. 2005) conservation score. Since the conservation score is the only parameter applicable to any type of position, it was used as a primary filter. Variants with a phastCons conservation score higher than 200 were selected as damaging variants. SIFT and Polyphen scores were only used when available. SIFT scores lower than 0.05 and/or Polyphen scores higher than 0.95 indicate that a variant is most likely deleterious. Variants with at least two of the three pathogenicity predictors indicating a potential deleterious effect were considered deleterious.

Drug Targets

A list of 130 drug binding domains was extracted from “The druggable genome” publication (Hopkins and Groom 2002). The domains were mapped in the corresponding proteins using InterPro (version 47.0, 20 May 2014) (Hunter et al. 2012). To define what domains were affected by substitutions that disrupt the domain without being deleterious for the protein only SNVs mapping to drug binding domains with SIFT and Polyphen values outside the deleteriousness range (i.e., SIFT scores higher than 0.05 and Polyphen scores lower than 0.95) were considered.

Supplementary Material

Supplementary tables S1–S4 and supplementary figures S1–S4 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
  78 in total

1.  dbSNP: the NCBI database of genetic variation.

Authors:  S T Sherry; M H Ward; M Kholodov; J Baker; L Phan; E M Smigielski; K Sirotkin
Journal:  Nucleic Acids Res       Date:  2001-01-01       Impact factor: 16.971

2.  Primer3 on the WWW for general users and for biologist programmers.

Authors:  S Rozen; H Skaletsky
Journal:  Methods Mol Biol       Date:  2000

Review 3.  The druggable genome.

Authors:  Andrew L Hopkins; Colin R Groom
Journal:  Nat Rev Drug Discov       Date:  2002-09       Impact factor: 84.694

4.  FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes.

Authors:  Fátima Al-Shahrour; Ramón Díaz-Uriarte; Joaquín Dopazo
Journal:  Bioinformatics       Date:  2004-01-22       Impact factor: 6.937

5.  A note on exact tests of Hardy-Weinberg equilibrium.

Authors:  Janis E Wigginton; David J Cutler; Goncalo R Abecasis
Journal:  Am J Hum Genet       Date:  2005-03-23       Impact factor: 11.025

6.  Principal components analysis corrects for stratification in genome-wide association studies.

Authors:  Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal:  Nat Genet       Date:  2006-07-23       Impact factor: 38.330

7.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors:  Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal:  Genome Res       Date:  2005-07-15       Impact factor: 9.043

8.  Adaptive protein evolution in Drosophila.

Authors:  Nick G C Smith; Adam Eyre-Walker
Journal:  Nature       Date:  2002-02-28       Impact factor: 49.962

9.  Human non-synonymous SNPs: server and survey.

Authors:  Vasily Ramensky; Peer Bork; Shamil Sunyaev
Journal:  Nucleic Acids Res       Date:  2002-09-01       Impact factor: 16.971

10.  Positive selection, relaxation, and acceleration in the evolution of the human and chimp genome.

Authors:  Leonardo Arbiza; Joaquín Dopazo; Hernán Dopazo
Journal:  PLoS Comput Biol       Date:  2006-04-28       Impact factor: 4.475

View more
  34 in total

1.  Four Years' Experience in the Diagnosis of Very Long-Chain Acyl-CoA Dehydrogenase Deficiency in Infants Detected in Three Spanish Newborn Screening Centers.

Authors:  B Merinero; P Alcaide; E Martín-Hernández; A Morais; M T García-Silva; P Quijada-Fraile; C Pedrón-Giner; E Dulin; R Yahyaoui; J M Egea; A Belanger-Quintana; J Blasco-Alonso; M L Fernandez Ruano; B Besga; I Ferrer-López; F Leal; M Ugarte; P Ruiz-Sala; B Pérez; C Pérez-Cerdá
Journal:  JIMD Rep       Date:  2017-07-29

2.  G534E Variant in HABP2 and Nonmedullary Thyroid Cancer.

Authors:  Macarena Ruiz-Ferrer; Raquel M Fernández; Elena Navarro; Guillermo Antiñolo; Salud Borrego
Journal:  Thyroid       Date:  2016-07       Impact factor: 6.568

3.  New variants in Spanish Niemann-Pick type c disease patients.

Authors:  Laura López de Frutos; Jorge J Cebolla; Luis Aldámiz-Echevarría; Ángela de la Vega; Sinziana Stanescu; Carlos Lahoz; Pilar Irún; Pilar Giraldo
Journal:  Mol Biol Rep       Date:  2020-02-14       Impact factor: 2.316

4.  Broad phenotypes in heterozygous NR5A1 46,XY patients with a disorder of sex development: an oligogenic origin?

Authors:  Núria Camats; Mónica Fernández-Cancio; Laura Audí; André Schaller; Christa E Flück
Journal:  Eur J Hum Genet       Date:  2018-06-11       Impact factor: 4.246

5.  Whole-Exome Sequencing Reveals Uncaptured Variation and Distinct Ancestry in the Southern African Population of Botswana.

Authors:  Gaone Retshabile; Busisiwe C Mlotshwa; Lesedi Williams; Savannah Mwesigwa; Gerald Mboowa; Zhuoyi Huang; Navin Rustagi; Shanker Swaminathan; Eric Katagirya; Samuel Kyobe; Misaki Wayengera; Grace P Kisitu; David P Kateete; Eddie M Wampande; Koketso Maplanka; Ishmael Kasvosve; Edward D Pettitt; Mogomotsi Matshaba; Betty Nsangi; Marape Marape; Masego Tsimako-Johnstone; Chester W Brown; Fuli Yu; Adeodata Kekitiinwa; Moses Joloba; Sununguko W Mpoloka; Graeme Mardon; Gabriel Anabwani; Neil A Hanchard
Journal:  Am J Hum Genet       Date:  2018-04-26       Impact factor: 11.025

6.  Unravelling the genetic basis of simplex Retinitis Pigmentosa cases.

Authors:  Nereida Bravo-Gil; María González-Del Pozo; Marta Martín-Sánchez; Cristina Méndez-Vidal; Enrique Rodríguez-de la Rúa; Salud Borrego; Guillermo Antiñolo
Journal:  Sci Rep       Date:  2017-02-03       Impact factor: 4.379

7.  Admixture Has Shaped Romani Genetic Diversity in Clinically Relevant Variants.

Authors:  Neus Font-Porterias; Aaron Giménez; Annabel Carballo-Mesa; Francesc Calafell; David Comas
Journal:  Front Genet       Date:  2021-06-16       Impact factor: 4.599

8.  A view on clinical genetics and genomics in Spain: of challenges and opportunities.

Authors:  Teresa Pàmpols; Feliciano J Ramos; Pablo Lapunzina; Ignasi Gozalo-Salellas; Luis A Pérez-Jurado; Aurora Pujol
Journal:  Mol Genet Genomic Med       Date:  2016-07-18       Impact factor: 2.183

9.  Improving the management of Inherited Retinal Dystrophies by targeted sequencing of a population-specific gene panel.

Authors:  Nereida Bravo-Gil; Cristina Méndez-Vidal; Laura Romero-Pérez; María González-del Pozo; Enrique Rodríguez-de la Rúa; Joaquín Dopazo; Salud Borrego; Guillermo Antiñolo
Journal:  Sci Rep       Date:  2016-04-01       Impact factor: 4.379

10.  Whole Exome Sequencing reveals new candidate genes in host genomic susceptibility to Respiratory Syncytial Virus Disease.

Authors:  Antonio Salas; Jacobo Pardo-Seco; Miriam Cebey-López; Alberto Gómez-Carballa; Pablo Obando-Pacheco; Irene Rivero-Calle; María-José Currás-Tuala; Jorge Amigo; José Gómez-Rial; Federico Martinón-Torres
Journal:  Sci Rep       Date:  2017-11-21       Impact factor: 4.379

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.