Literature DB >> 35536244

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution.

Meng Yang1,2, Lichao Huang1, Haiping Huang1, Hui Tang1, Nan Zhang1, Huanming Yang3,4, Jihong Wu5,6,7, Feng Mu1.   

Abstract

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.
© The Author(s) 2022. Published by Oxford University Press on behalf of Nucleic Acids Research.

Entities:  

Mesh:

Year:  2022        PMID: 35536244      PMCID: PMC9371931          DOI: 10.1093/nar/gkac326

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   19.160


  62 in total

1.  Genetic regulatory signatures underlying islet gene expression and type 2 diabetes.

Authors:  Arushi Varshney; Laura J Scott; Ryan P Welch; Michael R Erdos; Peter S Chines; Narisu Narisu; Ricardo D'O Albanus; Peter Orchard; Brooke N Wolford; Romy Kursawe; Swarooparani Vadlamudi; Maren E Cannon; John P Didion; John Hensley; Anthony Kirilusha; Lori L Bonnycastle; D Leland Taylor; Richard Watanabe; Karen L Mohlke; Michael Boehnke; Francis S Collins; Stephen C J Parker; Michael L Stitzel
Journal:  Proc Natl Acad Sci U S A       Date:  2017-02-13       Impact factor: 11.205

2.  Detection of nonneutral substitution rates on mammalian phylogenies.

Authors:  Katherine S Pollard; Melissa J Hubisz; Kate R Rosenbloom; Adam Siepel
Journal:  Genome Res       Date:  2009-10-26       Impact factor: 9.043

Review 3.  Statistical power and significance testing in large-scale genetic studies.

Authors:  Pak C Sham; Shaun M Purcell
Journal:  Nat Rev Genet       Date:  2014-05       Impact factor: 53.242

4.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors:  Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal:  Genome Res       Date:  2005-07-15       Impact factor: 9.043

5.  The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms.

Authors:  René Dreos; Giovanna Ambrosini; Romain Groux; Rouaïda Cavin Périer; Philipp Bucher
Journal:  Nucleic Acids Res       Date:  2016-11-28       Impact factor: 16.971

6.  Selene: a PyTorch-based deep learning library for sequence data.

Authors:  Kathleen M Chen; Evan M Cofer; Jian Zhou; Olga G Troyanskaya
Journal:  Nat Methods       Date:  2019-03-28       Impact factor: 28.547

7.  Evaluating the informativeness of deep learning annotations for human complex diseases.

Authors:  Kushal K Dey; Bryce van de Geijn; Samuel Sungil Kim; Farhad Hormozdiari; David R Kelley; Alkes L Price
Journal:  Nat Commun       Date:  2020-09-17       Impact factor: 14.919

8.  ClinVar: public archive of relationships among sequence variation and human phenotype.

Authors:  Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott
Journal:  Nucleic Acids Res       Date:  2013-11-14       Impact factor: 16.971

9.  Understanding multicellular function and disease with human tissue-specific networks.

Authors:  Casey S Greene; Arjun Krishnan; Aaron K Wong; Emanuela Ricciotti; Rene A Zelaya; Daniel S Himmelstein; Ran Zhang; Boris M Hartmann; Elena Zaslavsky; Stuart C Sealfon; Daniel I Chasman; Garret A FitzGerald; Kara Dolinski; Tilo Grosser; Olga G Troyanskaya
Journal:  Nat Genet       Date:  2015-04-27       Impact factor: 38.330

10.  Integration of human pancreatic islet genomic data refines regulatory mechanisms at Type 2 Diabetes susceptibility loci.

Authors:  Matthias Thurner; Martijn van de Bunt; Jason M Torres; Anubha Mahajan; Vibe Nylander; Amanda J Bennett; Kyle J Gaulton; Amy Barrett; Carla Burrows; Christopher G Bell; Robert Lowe; Stephan Beck; Vardhman K Rakyan; Anna L Gloyn; Mark I McCarthy
Journal:  Elife       Date:  2018-02-07       Impact factor: 8.140

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.