Literature DB >> 35266528

GenRisk: A tool for comprehensive genetic risk modeling.

Rana Aldisi1, Emadeldin Hassanin1, Sugirthan Sivalingam1,2,3, Andreas Buness1,2,3, Hannah Klinkhammer1,3, Andreas Mayr3, Holger Fröhlich4,5, Peter Krawitz1, Carlo Maj1,6.   

Abstract

SUMMARY: The genetic architecture of complex traits can be influenced by both many common regulatory variants with small effect sizes and rare deleterious variants in coding regions with larger effect sizes. However, the two kinds of genetic contributions are typically analyzed independently. Here we present GenRisk, a python package for the computation and the integration of gene scores based on the burden of rare deleterious variants and common-variants based polygenic risk scores. The derived scores can be analyzed within GenRisk to perform association tests or to derive phenotype prediction models by testing multiple classification and regression approaches. GenRisk is compatible with VCF input file formats.
AVAILABILITY AND IMPLEMENTATION: GenRisk is an open source publicly available python package that can be downloaded or installed from Github (https://github.com/AldisiRana/GenRisk). SUPPLEMENTARY INFORMATION: GenRisk documentation is available online at https://genrisk.readthedocs.io/en/latest/.
© The Author(s) 2022. Published by Oxford University Press.

Entities:  

Year:  2022        PMID: 35266528      PMCID: PMC9048672          DOI: 10.1093/bioinformatics/btac152

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


1 Introduction

In the past decade, genome-wide association studies (GWAS) have been used extensively to investigate the genetic architecture of complex traits and diseases (Uffelmann ). However, despite the identification of many disease-associated common variants which also led to the development of several accurate polygenic risk score (PRS) models, a substantial part of the genetic architecture of common traits remains unknown (Lee ). This is known as missing heritability, which is the difference between the heritability observed in twins studies and the measured heritability explained by common variants (Génin, 2020). Different studies suggested that the missing heritability is mainly attributable to rare variants (Young, 2019). In line with this hypothesis, many studies have observed that rare variants play a role in complex phenotypes, such as hypertension (Russo ), schizophrenia (John ) and autism (Havdahl ). Burden tests are among the most applied methods to investigate rare variant effects starting from sequencing data. These methods typically collapse rare variants in a genetic region (e.g. gene) into a single burden variable and then regress the phenotype on the burden variable to test for the cumulative effects of rare variants (Bomba ). On the other hand, the genetic contribution of common variants is typically analyzed by mean of PRS, which is usually computed as the weighted sum of risk alleles with respect to a phenotype, where the risk alleles and the corresponding weights are derived from a reference GWAS (Choi ). Generally, gene-based burden tests are applied on exome/target sequencing data while GWAS is performed on post-imputed chip-array data for the genotyping of high-frequent variants. In the light of the increasing availability of whole genome sequencing data, there is a need of bioinformatics solutions integrating different methodological approaches into a unique framework. With this aim in mind, we developed GenRisk, a python package that seamlessly combines different tools and libraries to analyze genotype–phenotype associations by considering both polygenic effects and the enrichment of rare deleterious variants at gene-based level.

2 Implementation

The GenRisk pipeline contains multiple modules, which can be run using a commandline interface or within a python environment. The modules can be run sequentially, so that the input of a module is the output of the previous module. In addition, each module can also be used independently with data provided by the user to increase flexibility of the tool for custom-analyses. Starting from a VCF, GenRisk computes gene scores based on variant annotations. Given a phenotype and potential covariates (possibly including PRS), the individual gene scores can be used to perform association analyses and to build phenotype prediction models. Furthermore, an interactive command implements PRS computation, the PRS model can be either provided by the user or available in pgscatalog (https://www.pgscatalog.org/). The workflow of the pipeline is summarized in Figure 1. In the following sections the main features of GenRisk are described.
Fig. 1.

GenRisk pipeline workflow. A VCF file with functional annotations and frequencies can be used to calculate gene-based scores, alternatively a VCF can be used to extract and calculate PRS. The scores can then be used with phenotypic data for association analysis or to develop prediction models

GenRisk pipeline workflow. A VCF file with functional annotations and frequencies can be used to calculate gene-based scores, alternatively a VCF can be used to extract and calculate PRS. The scores can then be used with phenotypic data for association analysis or to develop prediction models

2.1 Gene-based scoring system

The gene scores are derived by the weighted sum of the variants in a gene. Each allele count is weighted according to the product of a deleteriousness score and a coefficient based on the allele frequency. Namely, a weighting function is applied to the variant frequency to potentially up-weight the biological importance of rare variants. Two weighting functions are implemented, –log10 as already applied in another gene-based score tool (Mossotto ) and the beta density function, which contains two parameters α and β that can be adjusted for more flexible weight calculation as implemented in the sequence kernel association test (Lee ). An adjustable threshold parameter for the minor allele frequency (MAF) can be also considered to filter only for rare variants.

2.2 Genetic risk scores analysis

According to the distribution of the scores, different statistical tests can be applied to analyze gene–phenotype associations starting from the derived individual-based gene scores. The association analysis results are generated as summary statistics and can be visualized via QQ-plots and Manhattan plots. Prediction models are computed using the open-source Pycaret, a machine learning python library (Ali, 2020). The models can be generated for both quantitative and binary traits. The gene-based scores, as well as PRS and covariates, such as sex and age, can be used as features. The data given by the user can be divided into training and testing sets (with flexible size). Cross-validation is applied on different models and the best performing model is selected, tuned and finalized. The model is then saved and can be further evaluated with external testing sets. Model evaluation reports and testing set labels are exported. Graphs like, feature importance, confusion matrix and prediction error, are also generated to visualize the model performance.

3 Usage case

We applied the pipeline on ≈ 160 000 samples from UK Biobank (application number 81202), the gene-based scores were calculated by applying the beta weighting function (α = 1, β = 25) to up-weight rare variants while the CADD (Rentzsch ) raw scores were used as deleteriousness weight and only variants with MAF < 1% were included. The derived scores were used for association test and prediction model with respect to alkaline phosphatase measurements (Field 30610) including also the first four genotyping principle components, sex, BMI and age as covariates. The association analysis based on a linear regression model detected significance in ALPL, GPLD1 and ASGR1 genes, all of which have been previously associated with alkaline phosphatase (Nioi ; Yuan ). In addition, a stochastic gradient boosted decision tree algorithm was identified as the best prediction model once both gene scores and PRS (from Sinnott-Armstrong ) are taken into account and it showed an improved prediction performance compared with PRS-only model. Detailed results, as well as comparisons with other methods, can be found in Supplementary Material.

4 Conclusion

GenRisk is a python package that processes input VCF files to generate both gene-based burden scores and PRS for association tests and development of prediction models. GenRisk provides a framework to model the effects of rare functional variants while considering the polygenic background. Thus, it is suitable for the analysis of phenotypes characterized by a complex genetic architecture.

Funding

C.M. and E.H. were supported by the BONFOR-program of the Medical Faculty, University of Bonn (O-147.0002). Conflict of Interest: none declared.

Data availability

Genome-wide genotyping data, exome-sequencing data, and phenotypic data from the UK Biobank are available upon successful project application (http://www.ukbiobank.ac.uk/about-biobank-uk/). Restrictions apply to the availability of these data, which were used under license for the current study (Project ID: 81202). Click here for additional data file.
  14 in total

1.  Optimal unified approach for rare-variant association testing with application to small-sample case-control whole-exome sequencing studies.

Authors:  Seunggeun Lee; Mary J Emond; Michael J Bamshad; Kathleen C Barnes; Mark J Rieder; Deborah A Nickerson; David C Christiani; Mark M Wurfel; Xihong Lin
Journal:  Am J Hum Genet       Date:  2012-08-02       Impact factor: 11.025

Review 2.  Tutorial: a guide to performing polygenic risk score analyses.

Authors:  Shing Wan Choi; Timothy Shin-Heng Mak; Paul F O'Reilly
Journal:  Nat Protoc       Date:  2020-07-24       Impact factor: 13.491

Review 3.  The impact of rare and low-frequency genetic variants in common disease.

Authors:  Lorenzo Bomba; Klaudia Walter; Nicole Soranzo
Journal:  Genome Biol       Date:  2017-04-27       Impact factor: 13.583

Review 4.  Advances in the Genetics of Hypertension: The Effect of Rare Variants.

Authors:  Alessia Russo; Cornelia Di Gaetano; Giovanni Cugliari; Giuseppe Matullo
Journal:  Int J Mol Sci       Date:  2018-02-28       Impact factor: 5.923

5.  GenePy - a score for estimating gene pathogenicity in individuals using next-generation sequencing data.

Authors:  E Mossotto; J J Ashton; L O'Gorman; R J Pengelly; R M Beattie; B D MacArthur; S Ennis
Journal:  BMC Bioinformatics       Date:  2019-05-16       Impact factor: 3.169

6.  Solving the missing heritability problem.

Authors:  Alexander I Young
Journal:  PLoS Genet       Date:  2019-06-24       Impact factor: 5.917

7.  Genetics of 35 blood and urine biomarkers in the UK Biobank.

Authors:  Nasa Sinnott-Armstrong; Yosuke Tanigawa; Manuel A Rivas; David Amar; Nina Mars; Christian Benner; Matthew Aguirre; Guhan Ram Venkataraman; Michael Wainberg; Hanna M Ollila; Tuomo Kiiskinen; Aki S Havulinna; James P Pirruccello; Junyang Qian; Anna Shcherbina; Fatima Rodriguez; Themistocles L Assimes; Vineeta Agarwala; Robert Tibshirani; Trevor Hastie; Samuli Ripatti; Jonathan K Pritchard; Mark J Daly
Journal:  Nat Genet       Date:  2021-01-18       Impact factor: 38.330

8.  Population-based genome-wide association studies reveal six loci influencing plasma levels of liver enzymes.

Authors:  Xin Yuan; Dawn Waterworth; John R B Perry; Noha Lim; Kijoung Song; John C Chambers; Weihua Zhang; Peter Vollenweider; Heide Stirnadel; Toby Johnson; Sven Bergmann; Noam D Beckmann; Yun Li; Luigi Ferrucci; David Melzer; Dena Hernandez; Andrew Singleton; James Scott; Paul Elliott; Gerard Waeber; Lon Cardon; Timothy M Frayling; Jaspal S Kooner; Vincent Mooser
Journal:  Am J Hum Genet       Date:  2008-10       Impact factor: 11.025

9.  Variant ASGR1 Associated with a Reduced Risk of Coronary Artery Disease.

Authors:  Paul Nioi; Asgeir Sigurdsson; Gudmar Thorleifsson; Hannes Helgason; Arna B Agustsdottir; Gudmundur L Norddahl; Anna Helgadottir; Audur Magnusdottir; Aslaug Jonasdottir; Solveig Gretarsdottir; Ingileif Jonsdottir; Valgerdur Steinthorsdottir; Thorunn Rafnar; Dorine W Swinkels; Tessel E Galesloot; Niels Grarup; Torben Jørgensen; Henrik Vestergaard; Torben Hansen; Torsten Lauritzen; Allan Linneberg; Nele Friedrich; Nikolaj T Krarup; Mogens Fenger; Ulrik Abildgaard; Peter R Hansen; Anders M Galløe; Peter S Braund; Christopher P Nelson; Alistair S Hall; Michael J A Williams; Andre M van Rij; Gregory T Jones; Riyaz S Patel; Allan I Levey; Salim Hayek; Svati H Shah; Muredach Reilly; Gudmundur I Eyjolfsson; Olof Sigurdardottir; Isleifur Olafsson; Lambertus A Kiemeney; Arshed A Quyyumi; Daniel J Rader; William E Kraus; Nilesh J Samani; Oluf Pedersen; Gudmundur Thorgeirsson; Gisli Masson; Hilma Holm; Daniel Gudbjartsson; Patrick Sulem; Unnur Thorsteinsdottir; Kari Stefansson
Journal:  N Engl J Med       Date:  2016-05-18       Impact factor: 91.245

10.  CADD: predicting the deleteriousness of variants throughout the human genome.

Authors:  Philipp Rentzsch; Daniela Witten; Gregory M Cooper; Jay Shendure; Martin Kircher
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.