Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution.

Literature DB >> 35536244

Integrating convolution and self-attention improves language model of human genome for interpreting non-coding regions at base-resolution.

Meng Yang^1,2, Lichao Huang¹, Haiping Huang¹, Hui Tang¹, Nan Zhang¹, Huanming Yang^3,4, Jihong Wu^5,6,7, Feng Mu¹.

Abstract

Interpretation of non-coding genome remains an unsolved challenge in human genetics due to impracticality of exhaustively annotating biochemically active elements in all conditions. Deep learning based computational approaches emerge recently to help interpret non-coding regions. Here, we present LOGO (Language of Genome), a self-attention based contextualized pre-trained language model containing only two self-attention layers with 1 million parameters as a substantially light architecture that applies self-supervision techniques to learn bidirectional representations of the unlabelled human reference genome. LOGO is then fine-tuned for sequence labelling task, and further extended to variant prioritization task via a special input encoding scheme of alternative alleles followed by adding a convolutional module. Experiments show that LOGO achieves 15% absolute improvement for promoter identification and up to 4.5% absolute improvement for enhancer-promoter interaction prediction. LOGO exhibits state-of-the-art multi-task predictive power on thousands of chromatin features with only 3% parameterization benchmarking against the fully supervised model, DeepSEA and 1% parameterization against a recent BERT-based DNA language model. For allelic-effect prediction, locality introduced by one dimensional convolution shows improved sensitivity and specificity for prioritizing non-coding variants associated with human diseases. In addition, we apply LOGO to interpret type 2 diabetes (T2D) GWAS signals and infer underlying regulatory mechanisms. We make a conceptual analogy between natural language and human genome and demonstrate LOGO is an accurate, fast, scalable, and robust framework to interpret non-coding regions for global sequence labeling as well as for variant prioritization at base-resolution.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35536244 PMCID： PMC9371931 DOI： 10.1093/nar/gkac326

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 19.160

Keyword Cloud
References

62 in total

1. Genetic regulatory signatures underlying islet gene expression and type 2 diabetes.

Authors: Arushi Varshney; Laura J Scott; Ryan P Welch; Michael R Erdos; Peter S Chines; Narisu Narisu; Ricardo D'O Albanus; Peter Orchard; Brooke N Wolford; Romy Kursawe; Swarooparani Vadlamudi; Maren E Cannon; John P Didion; John Hensley; Anthony Kirilusha; Lori L Bonnycastle; D Leland Taylor; Richard Watanabe; Karen L Mohlke; Michael Boehnke; Francis S Collins; Stephen C J Parker; Michael L Stitzel
Journal: Proc Natl Acad Sci U S A Date: 2017-02-13 Impact factor: 11.205

2. Detection of nonneutral substitution rates on mammalian phylogenies.

Authors: Katherine S Pollard; Melissa J Hubisz; Kate R Rosenbloom; Adam Siepel
Journal: Genome Res Date: 2009-10-26 Impact factor: 9.043

Review 3. Statistical power and significance testing in large-scale genetic studies.

Authors: Pak C Sham; Shaun M Purcell
Journal: Nat Rev Genet Date: 2014-05 Impact factor: 53.242

4. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes.

Authors: Adam Siepel; Gill Bejerano; Jakob S Pedersen; Angie S Hinrichs; Minmei Hou; Kate Rosenbloom; Hiram Clawson; John Spieth; Ladeana W Hillier; Stephen Richards; George M Weinstock; Richard K Wilson; Richard A Gibbs; W James Kent; Webb Miller; David Haussler
Journal: Genome Res Date: 2005-07-15 Impact factor: 9.043

5. The eukaryotic promoter database in its 30th year: focus on non-vertebrate organisms.

Authors: René Dreos; Giovanna Ambrosini; Romain Groux; Rouaïda Cavin Périer; Philipp Bucher
Journal: Nucleic Acids Res Date: 2016-11-28 Impact factor: 16.971

6. Selene: a PyTorch-based deep learning library for sequence data.

Authors: Kathleen M Chen; Evan M Cofer; Jian Zhou; Olga G Troyanskaya
Journal: Nat Methods Date: 2019-03-28 Impact factor: 28.547

7. Evaluating the informativeness of deep learning annotations for human complex diseases.

Authors: Kushal K Dey; Bryce van de Geijn; Samuel Sungil Kim; Farhad Hormozdiari; David R Kelley; Alkes L Price
Journal: Nat Commun Date: 2020-09-17 Impact factor: 14.919

8. ClinVar: public archive of relationships among sequence variation and human phenotype.

Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott
Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971

9. Understanding multicellular function and disease with human tissue-specific networks.

Authors: Casey S Greene; Arjun Krishnan; Aaron K Wong; Emanuela Ricciotti; Rene A Zelaya; Daniel S Himmelstein; Ran Zhang; Boris M Hartmann; Elena Zaslavsky; Stuart C Sealfon; Daniel I Chasman; Garret A FitzGerald; Kara Dolinski; Tilo Grosser; Olga G Troyanskaya
Journal: Nat Genet Date: 2015-04-27 Impact factor: 38.330

10. Integration of human pancreatic islet genomic data refines regulatory mechanisms at Type 2 Diabetes susceptibility loci.

Authors: Matthias Thurner; Martijn van de Bunt; Jason M Torres; Anubha Mahajan; Vibe Nylander; Amanda J Bennett; Kyle J Gaulton; Amy Barrett; Carla Burrows; Christopher G Bell; Robert Lowe; Stephan Beck; Vardhman K Rakyan; Anna L Gloyn; Mark I McCarthy
Journal: Elife Date: 2018-02-07 Impact factor: 8.140