Literature DB >> 26143870

Beyond Homozygosity Mapping: Family-Control analysis based on Hamming distance for prioritizing variants in exome sequencing.

Atsuko Imai¹, Akihiro Nakaya², Somayyeh Fahiminiya³, Martine Tétreault³, Jacek Majewski³, Yasushi Sakata⁴, Seiji Takashima⁵, Mark Lathrop³, Jurg Ott⁶.

Abstract

A major challenge in current exome sequencing in autosomal recessive (AR) families is the lack of an effective method to prioritize single-nucleotide variants (SNVs). AR families are generally too small for linkage analysis, and length of homozygous regions is unreliable for identification of causative variants. Various common filtering steps usually result in a list of candidate variants that cannot be narrowed down further or ranked. To prioritize shortlisted SNVs we consider each homozygous candidate variant together with a set of SNVs flanking it. We compare the resulting array of genotypes between an affected family member and a number of control individuals and argue that, in a family, differences between family member and controls should be larger for a pathogenic variant and SNVs flanking it than for a random variant. We assess differences between arrays in two individuals by the Hamming distance and develop a suitable test statistic, which is expected to be large for a causative variant and flanking SNVs. We prioritize candidate variants based on this statistic and applied our approach to six patients with known pathogenic variants and found these to be in the top 2 to 10 percentiles of ranks.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
DNA

Year: 2015 PMID： 26143870 PMCID： PMC5155624 DOI： 10.1038/srep12028

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

In current sequencing of patients in autosomal recessive (AR) families, candidate disease variants are generally prioritized based on well-known filtering steps12. Homozygosity mapping is also often applied to identify long runs of homozygosity3, which may be interpreted as harboring segments of DNA identical by descent (IBD), but length alone is known to be a poor statistic for this purpose4. Information from unaffected individuals and estimating haplotype frequencies to identify ancestral haplotypes may aid in the identification of segments of IBD4. Here we developed a novel method to prioritize candidate variants in AR families based on direct comparison of segments of sequence variants between an affected family member and control individuals from the same population, that is, our approach works by comparing a single affected individual (from a small AR family) with a number of control individuals. Consider a set of variants within d basepairs of a candidate variant. In each individual, case and controls, we select homozygous variant sites from sequence vcf files. We distinguish only two states, v/v and “not v/v”, that is, anything other than v/v, where v is the variant allele (also called the alternate allele). For any two individuals, we want to measure how much their two arrays of variants differ. We do this with the Hamming distance56, which is the number of elements that differ between two arrays. For our set of variants and selection criteria, the two individuals can exhibit the following numbers of pairs of genotypes: n1 (v/v, v/v), n2 (v/v, not v/v), and n3 (not v/v, v/v). The Hamming distance is then equal to n2 + n3. Expressed in words, for a set of variants within d kb of a candidate locus, the Hamming distance between two individuals, A and B, is given by the number of homozygous variants occurring only in individual A plus the number of homozygous variants occurring only in individual B. To allow for varying numbers of basepairs flanking a candidate variant, we define a relative Hamming distance, or Hamming Distance Ratio, HDR = (n2 + n3)/(n1 + n2 + n3), which is our measure for distance between sets of variant genotypes in two individuals. Individuals affected with a rare autosomal recessive trait tend to have parents who are related so that the two disease alleles tend to be identical by descent (IBD), that is, they are likely to be copies of one ancestral allele. Because of paucity of crossovers very close to the disease locus, SNVs in its vicinity also tend to be IBD and, thus, homozygous3. For this reason, we want to see whether distances between affected and control individuals are larger for true candidate variants than other candidate variants. Various approaches may be taken for such comparisons. We found the following procedure appealing and powerful. For an affected family member and n control individuals, we form all possible pairs of individuals and distinguish the n pairs containing the affected (group 1) from the n(n−1)/2 pairs consisting only of control individuals (group 2). Then we compare mean HDR values between the two groups by a one-sided t statistic in the expectation that, at a pathogenic variant, group 1 means exceed group 2 means. We do this for a range of values, d = 100 kb through 1000 kb in steps of 100 kb, and retain the largest t value, tmax = maxd(td). The limiting values of 100 and 1000 kb were chosen empirically; in our experience, the maximum t statistic often occurs inside this interval. In principle, the lower limit could be 0 and the upper limit could be the length of the given chromosome. The main difference would generally be increased computation time. Candidate variants are then ranked based on their tmax value, with rank 1 corresponding to the largest tmax value. As the number of candidate variants may differ between affected individuals, we also compute the percentile (top %) of the rank of the disease variant. For the data described below, calculations were conducted by our HDR program. While our focus is on ranking (prioritizing) candidate variants, we calculate empirical significance levels (p-values) associated with a given tmax statistic (candidate variant) as follows. We create null data by treating each of the n control individuals in turn as a pseudo-affected individual. All other individuals, including the affected, then represent pseudo-control individuals. Thus, we construct n null datasets, each consisting of 1 pseudo-affected and n pseudo-controls in analogy to 1 affected and n controls in the observed data. In each of the n null datasets, we perform the analysis done on the observed data. For a given candidate variant, the p-value associated with the observed tmax statistic is computed as the proportion of null datasets with a null-tmax value at least as large as the observed tmax value. To be conservative, we include the observed data with the null-data, so the smallest possible significance level is p = 1/(n + 1). These statistical analyses can be performed with a suitably modified version of the maxstat program written in Pascal.

Results

As a proof of concept, we applied our approach to six patients from five different small AR families (Fig. 1) for which the disease-causing mutation had previously been identified and published7891011121314 (Table 1). Families F1, F4, F6, and OI are members of the French-Canadian population of Québec, which originates from approximately 8500 French settlers who immigrated more than 300 years ago15. Family L1 is of European ancestry. For our approach to be valid it is important that control individuals be from the same ethnic background as the patients. For families F1, F4, and F6, we chose 30 members of this same population as controls; for family OI, we had 32 control individuals available; and for family L1 we used 30 European control individuals. As our approach currently works with one affected individual versus a number of controls, the two affected individuals in family L1 were considered separately. All control individuals had previously been investigated for reasons unrelated to the patients. All individuals, cases and controls, had been exome-sequenced at McGill University and Genome Québec Innovation Center, Montreal, after obtaining approval from the Institutional Review Board of McGill University and informed consent from all individuals. The detailed sequencing protocol is given in the Methods section. To narrow down the list of variants, we applied several filtering steps as follows: We selected exonic (missense, nonsense, and Indels) or splicing or UTR variants annotated as homozygous or possibly homozygous with an allele frequency <5% in 1000 Genome and EVS databases, then removed variants with quality scores less than 50 or map quality score less than 20, read depths less than 5, and those that were seen in more than 10 individuals in our exome database of ~1200 samples. These filtering steps resulted in the m = 10 through 50 final candidate variants shown in Table 1.

Figure 1

Pedigree drawings for families F1, F4, F6, OI, and L1.

For families F1, F4, and F6, genotypes are marked for individuals with DNA available and tested; the following abbreviations are used: +, normal (wild-type) variant; Δ, rare mutant variant. Family OI: The affected individual (P1) is indicated with a solid symbol, heterozygotes are shown with half-solid symbols.

Table 1

Results of our family-control analysis for prioritizing m final candidate variants in an affected individual from five families.

Family	Gene	Disease	rank	m	%	p
F1	TTC7A	Multiple intestinal atresia	1	10	10.0	0.0645
F4	TTC7A	Multiple intestinal atresia	1	14	7.1	0.0645
F6	TTC7A	Multiple intestinal atresia	1	18	5.6	0.0645
OI	BMP1	Osteogenesis imperfecta	1	14	7.1	0.0303
L1a	POLR3B	Leukodystrophy	1	50	2.0	0.0645
L1b	POLR3B	Leukodystrophy	4	44	9.1	0.0645

Rank = order of test statistic (largest tmax ranked 1) for pathogenic variant among the m candidate variants; % = top percentile rank, 100 × rank/m; p = empirical significance level. L1a and L1b refer to two affected individuals in family L1.

Results obtained by our family-control analysis demonstrate that we were able to prioritize the known disease variants to be in the top 2% to 10% of candidate variants (Table 1), that is, the HDR method narrowed down the original number of shortlisted candidate variants to between 10-fold and 50-fold smaller lists. The p-values for the test statistic of the true disease variant ranged from 0.0303 through 0.0645. Combining five independent p-values (using only one individual in family L1) by the Fisher method1617 (pvalues program) results in a final empirical significance level of 0.0013. Thus, we demonstrated significantly larger distances between case and control individuals for homozygous pathogenic variants than non-pathogenic variants.

Discussion

Our approach has several advantages over existing (homozygosity mapping, HM) methods: (1) Our HDR method can provide a ranking of homozygous regions while HM approaches rank on the basis of ROH length, which is less reliable. (2) We can assess inherited regions specific to disease pedigrees more accurately than using heterogeneous populations by using relatively homogenous control individuals in the same population as family members. (3) Most HM approaches work with sliding windows of a given size and additional parameters like minimum number of SNPs in an ROH, minimum length of an ROH, and maximum number of heterozygous SNPs in an ROH. These settings may or may not be optimal; on the other hand, our HDR method employs a single estimated parameter for prioritizing candidate variants. Thus, the HDR approach does not require any parameters that need to be fixed at the outset. A limitation of our approach is that control individuals are required while HM may be carried out on single (affected) individuals. We used 30 control individuals as a compromise between cost and efficiency, (1) because our approach proved successful with the given numbers, and (2) to obtain a p-value potentially smaller than 0.05 given that we include the observed data in our null data. Applied to our six patients and corresponding control individuals, the HDR method narrowed down the original number of shortlisted candidate variants to more than tenfold smaller lists.

Methods

Pathogenic mutations

Families F1/F4/F6

A pathogenic variant for multiple intestinal atresia18 was found in three affected individuals from three different families. It is a homozygous mutation for a four-base intronic deletion on chromosome 2 at positions 47,221,651-47,221,654 in the TTC7A gene, immediately adjacent to a consensus GT splice donor site.

Family OI

A pathogenic variant for osteogenesis imperfecta12 was observed in an affected individual. It is a homozygous missense mutation, T > C, at position 22,058,957 on chromosome 8 in the UTR3 region of the BMP1 gene.

Family L1

A pathogenic variant for leukodystrophy13 was observed in an affected brother-sister pair in this small family. It is a missense mutation, rs138249161, at position 106,432,421 on chromosome 12 in the POLR3B gene.

Sequencing protocol

Whole exome library preparation, capturing, sequencing and bioinformatics analyses were performed at the Genome Québec Innovation Center, Montreal, Canada, as detailed in our previous publications (see main text). In brief, 3 micrograms of DNA of 65 individuals were used for exome capture and sequencing. For each exome, the Burroughs Wheeler alignment (BWA) version 0.5.919 was used to align the sequencing reads (100 bp paired-end) to the human reference sequence (hg19). Alignments were converted with SamTools19 from SAM format to sorted, indexed BAM files. Regions surrounding potential indels were realigned with the GATK IndelRealigner tool20. Picard-tools were used to remove invalid alignments and duplicate reads from the BAM files19. Single nucleotide variant (SNV) and indel variants were called with Samtools (v. 0.1.17) mpileup and were then quality filtered so that at least 20% (SNVs) or 15% (Indels) of reads supported the variant calls. All called variants were annotated with the ANNOVAR program21 to identify exonic or splicing or UTR variants, allele frequency in the 1000 Genomes Project22, Exome Variant Server (EVS, version 6500) and dbSNP (version 135), SIFT, PolyPhen-2 and PHASTCONS scores.

Web Resources

HDR program http://www.gi.med.osaka-u.ac.jp/software/hdr/ ANNOVAR Software http://www.openbioinformatics.org/annovar/ Exome Variant Server, NHLBI GO Exome Sequencing Project (ESP) http://evs.gs.washington.edu/EVS/ GATK Software http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Maxstat program http://lab.rockefeller.edu/ott/programs PVALUES program http://www.jurgott.org/linkage/util.htm#pvalues

Additional Information

How to cite this article: Imai, A. et al. Beyond Homozygosity Mapping: Family-Control analysis based on Hamming distance for prioritizing variants in exome sequencing. Sci. Rep. 5, 12028; doi: 10.1038/srep12028 (2015).

18 in total

1. Mutations in C5ORF42 cause Joubert syndrome in the French Canadian population.

Authors: Myriam Srour; Jeremy Schwartzentruber; Fadi F Hamdan; Luis H Ospina; Lysanne Patry; Damian Labuda; Christine Massicotte; Sylvia Dobrzeniecka; José-Mario Capo-Chichi; Simon Papillon-Cavanagh; Mark E Samuels; Kym M Boycott; Michael I Shevell; Rachel Laframboise; Valérie Désilets; Bruno Maranda; Guy A Rouleau; Jacek Majewski; Jacques L Michaud
Journal: Am J Hum Genet Date: 2012-03-15 Impact factor: 11.025

2. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

3. Recessive mutations in POLR3B, encoding the second largest subunit of Pol III, cause a rare hypomyelinating leukodystrophy.

Authors: Martine Tétreault; Karine Choquet; Simona Orcesi; Davide Tonduti; Umberto Balottin; Martin Teichmann; Sébastien Fribourg; Raphael Schiffmann; Bernard Brais; Adeline Vanderver; Geneviève Bernard
Journal: Am J Hum Genet Date: 2011-10-27 Impact factor: 11.025

Review 4. Population history and its impact on medical genetics in Quebec.

Authors: A-M Laberge; J Michaud; A Richter; E Lemyre; M Lambert; B Brais; G A Mitchell
Journal: Clin Genet Date: 2005-10 Impact factor: 4.438

5. Osteogenesis imperfecta type V: marked phenotypic variability despite the presence of the IFITM5 c.-14C>T mutation in all patients.

Authors: Frank Rauch; Pierre Moffatt; Moira Cheung; Peter Roughley; Liljana Lalic; Allan M Lund; Norman Ramirez; Somayyeh Fahiminiya; Jacek Majewski; Francis H Glorieux
Journal: J Med Genet Date: 2013-01 Impact factor: 6.318

6. The Sequence Alignment/Map format and SAMtools.

Authors: Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal: Bioinformatics Date: 2009-06-08 Impact factor: 6.937

7. Homozygosity mapping: a way to map human recessive traits with the DNA of inbred children.

Authors: E S Lander; D Botstein
Journal: Science Date: 1987-06-19 Impact factor: 47.728

8. A polyadenylation site variant causes transcript-specific BMP1 deficiency and frequent fractures in children.

Authors: Somayyeh Fahiminiya; Hadil Al-Jallad; Jacek Majewski; Telma Palomo; Pierre Moffatt; Paul Roschger; Klaus Klaushofer; Francis H Glorieux; Frank Rauch
Journal: Hum Mol Genet Date: 2014-09-11 Impact factor: 6.150

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. Exome sequencing identifies mutations in the gene TTC7A in French-Canadian cases with hereditary multiple intestinal atresia.

Authors: Mark E Samuels; Jacek Majewski; Najmeh Alirezaie; Isabel Fernandez; Ferran Casals; Natalie Patey; Hélène Decaluwe; Isabelle Gosselin; Elie Haddad; Alan Hodgkinson; Youssef Idaghdour; Valerie Marchand; Jacques L Michaud; Marc-André Rodrigue; Sylvie Desjardins; Stéphane Dubois; Francoise Le Deist; Philip Awadalla; Vincent Raymond; Bruno Maranda
Journal: J Med Genet Date: 2013-02-19 Impact factor: 6.318

8 in total

1. To aggregate or not, that is the question. A commentary on single-nucleotide variant proportion in genes: a new concept to explore major depression based on DNA sequencing data.

Authors: Jurg Ott
Journal: J Hum Genet Date: 2017-02-02 Impact factor: 3.172

2. A latent genetic subtype of major depression identified by whole-exome genotyping data in a Mexican-American cohort.

Authors: C Yu; M Arcos-Burgos; J Licinio; M-L Wong
Journal: Transl Psychiatry Date: 2017-05-16 Impact factor: 6.222

3. HDR: a statistical two-step approach successfully identifies disease genes in autosomal recessive families.

Authors: Atsuko Imai; Masakazu Kohda; Akihiro Nakaya; Yasushi Sakata; Kei Murayama; Akira Ohtake; Mark Lathrop; Yasushi Okazaki; Jurg Ott
Journal: J Hum Genet Date: 2016-06-30 Impact factor: 3.172

Review 4. Population genetics: past, present, and future.

Authors: Atsuko Okazaki; Satoru Yamazaki; Ituro Inoue; Jurg Ott
Journal: Hum Genet Date: 2020-07-18 Impact factor: 4.132

5. Determining population stratification and subgroup effects in association studies of rare genetic variants for nicotine dependence.

Authors: Ai-Ru Hsieh; Li-Shiun Chen; Ying-Ju Li; Cathy S J Fann
Journal: Psychiatr Genet Date: 2019-08 Impact factor: 2.458

6. Comprehensive use of extended exome analysis improves diagnostic yield in rare disease: a retrospective survey in 1,059 cases.

Authors: Gaber Bergant; Ales Maver; Luca Lovrecic; Goran Čuturilo; Alenka Hodzic; Borut Peterlin
Journal: Genet Med Date: 2017-09-14 Impact factor: 8.822

7. Low-frequency and rare variants may contribute to elucidate the genetics of major depressive disorder.

Authors: Chenglong Yu; Mauricio Arcos-Burgos; Bernhard T Baune; Volker Arolt; Udo Dannlowski; Ma-Li Wong; Julio Licinio
Journal: Transl Psychiatry Date: 2018-03-27 Impact factor: 6.222

8. Maximal Segmental Score Method for Localizing Recessive Disease Variants Based on Sequence Data.

Authors: Ai-Ru Hsieh; Jia Jyun Sie; Chien Ching Chang; Jurg Ott; Ie-Bin Lian; Cathy S J Fann
Journal: Front Genet Date: 2020-06-12 Impact factor: 4.599

8 in total