Zikun Yang1, Chen Wang1, Stephanie Erjavec2, Lynn Petukhova3,4, Angela Christiano2,4, Iuliana Ionita-Laza1. 1. Department of Biostatistics, Columbia University, New York, 10032, NY, USA. 2. Department of Genetics & Development, Columbia University, New York, 10032, NY, USA. 3. Department of Epidemiology, Columbia University, New York, 10032, NY, USA. 4. Department of Dermatology, Columbia University, New York, 10032, NY, USA.
Abstract
MOTIVATION: Predicting regulatory effects of genetic variants is a challenging but important problem in functional genomics. Given the relatively low sensitivity of functional assays, and the pervasiveness of class imbalance in functional genomic data, popular statistical prediction models can sharply underestimate the probability of a regulatory effect. We describe here the presence-only model (PO-EN), a type of semi-supervised model, to predict regulatory effects of genetic variants at sequence-level resolution in a context of interest by integrating a large number of epigenetic features and massively parallel reporter assays (MPRAs). RESULTS: Using experimental data from a variety of MPRAs we show that the presence-only model produces better calibrated predicted probabilities and has increased accuracy relative to state-of-the-art prediction models. Furthermore, we show that the predictions based on pre-trained PO-EN models are useful for prioritizing functional variants among candidate eQTLs and significant SNPs at GWAS loci. In particular, for the costimulatory locus, associated with multiple autoimmune diseases, we show evidence of a regulatory variant residing in an enhancer 24.4 kb downstream of CTLA4, with evidence from capture Hi-C of interaction with CTLA4. Furthermore, the risk allele of the regulatory variant is on the same risk increasing haplotype as a functional coding variant in exon 1 of CTLA4, suggesting that the regulatory variant acts jointly with the coding variant leading to increased risk to disease. AVAILABILITY: The presence-only model is implemented in the R package 'PO.EN', freely available on CRAN. A vignette describing a detailed demonstration of using the proposed PO-EN model can be found on github at https://github.com/Iuliana-Ionita-Laza/PO.EN/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Predicting regulatory effects of genetic variants is a challenging but important problem in functional genomics. Given the relatively low sensitivity of functional assays, and the pervasiveness of class imbalance in functional genomic data, popular statistical prediction models can sharply underestimate the probability of a regulatory effect. We describe here the presence-only model (PO-EN), a type of semi-supervised model, to predict regulatory effects of genetic variants at sequence-level resolution in a context of interest by integrating a large number of epigenetic features and massively parallel reporter assays (MPRAs). RESULTS: Using experimental data from a variety of MPRAs we show that the presence-only model produces better calibrated predicted probabilities and has increased accuracy relative to state-of-the-art prediction models. Furthermore, we show that the predictions based on pre-trained PO-EN models are useful for prioritizing functional variants among candidate eQTLs and significant SNPs at GWAS loci. In particular, for the costimulatory locus, associated with multiple autoimmune diseases, we show evidence of a regulatory variant residing in an enhancer 24.4 kb downstream of CTLA4, with evidence from capture Hi-C of interaction with CTLA4. Furthermore, the risk allele of the regulatory variant is on the same risk increasing haplotype as a functional coding variant in exon 1 of CTLA4, suggesting that the regulatory variant acts jointly with the coding variant leading to increased risk to disease. AVAILABILITY: The presence-only model is implemented in the R package 'PO.EN', freely available on CRAN. A vignette describing a detailed demonstration of using the proposed PO-EN model can be found on github at https://github.com/Iuliana-Ionita-Laza/PO.EN/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Hironori Ueda; Joanna M M Howson; Laura Esposito; Joanne Heward; Hywel Snook; Giselle Chamberlain; Daniel B Rainbow; Kara M D Hunter; Annabel N Smith; Gianfranco Di Genova; Mathias H Herr; Ingrid Dahlman; Felicity Payne; Deborah Smyth; Christopher Lowe; Rebecca C J Twells; Sarah Howlett; Barry Healy; Sarah Nutland; Helen E Rance; Vin Everett; Luc J Smink; Alex C Lam; Heather J Cordell; Neil M Walker; Cristina Bordin; John Hulme; Costantino Motzo; Francesco Cucca; J Fred Hess; Michael L Metzker; Jane Rogers; Simon Gregory; Amit Allahabadia; Ratnasingam Nithiyananthan; Eva Tuomilehto-Wolf; Jaakko Tuomilehto; Polly Bingley; Kathleen M Gillespie; Dag E Undlien; Kjersti S Rønningen; Cristian Guja; Constantin Ionescu-Tîrgovişte; David A Savage; A Peter Maxwell; Dennis J Carson; Chris C Patterson; Jayne A Franklyn; David G Clayton; Laurence B Peterson; Linda S Wicker; John A Todd; Stephen C L Gough Journal: Nature Date: 2003-04-30 Impact factor: 49.962
Authors: Kiran Musunuru; Alanna Strong; Maria Frank-Kamenetsky; Noemi E Lee; Tim Ahfeldt; Katherine V Sachs; Xiaoyu Li; Hui Li; Nicolas Kuperwasser; Vera M Ruda; James P Pirruccello; Brian Muchmore; Ludmila Prokunina-Olsson; Jennifer L Hall; Eric E Schadt; Carlos R Morales; Sissel Lund-Katz; Michael C Phillips; Jamie Wong; William Cantley; Timothy Racie; Kenechi G Ejebe; Marju Orho-Melander; Olle Melander; Victor Koteliansky; Kevin Fitzgerald; Ronald M Krauss; Chad A Cowan; Sekar Kathiresan; Daniel J Rader Journal: Nature Date: 2010-08-05 Impact factor: 49.962
Authors: K Brophy; A W Ryan; J M Thornton; M Abuzakouk; A P Fitzgerald; R M McLoughlin; C O'morain; N P Kennedy; F M Stevens; C Feighery; D Kelleher; R McManus Journal: Genes Immun Date: 2006-01 Impact factor: 2.676