Literature DB >> 34156475

Embeddings of genomic region sets capture rich biological associations in lower dimensions.

Erfaneh Gharavi1,2, Aaron Gu1,3, Guangtao Zheng3, Jason P Smith1,4, Hyun Jae Cho1,3, Aidong Zhang3, Donald E Brown2, Nathan C Sheffield1,5,6,4,2.   

Abstract

MOTIVATION: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis.
RESULTS: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY: https://github.com/databio/regionset-embedding.
© The Author(s) (2021). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Year:  2021        PMID: 34156475      PMCID: PMC8652032          DOI: 10.1093/bioinformatics/btab439

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.931


  26 in total

1.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position.

Authors:  Jason D Buenrostro; Paul G Giresi; Lisa C Zaba; Howard Y Chang; William J Greenleaf
Journal:  Nat Methods       Date:  2013-10-06       Impact factor: 28.547

2.  Sequence embedding for fast construction of guide trees for multiple sequence alignment.

Authors:  Gordon Blackshields; Fabian Sievers; Weifeng Shi; Andreas Wilm; Desmond G Higgins
Journal:  Algorithms Mol Biol       Date:  2010-05-14       Impact factor: 1.405

3.  Unsupervised embedding of single-cell Hi-C data.

Authors:  Jie Liu; Dejun Lin; Galip Gürkan Yardimci; William Stafford Noble
Journal:  Bioinformatics       Date:  2018-07-01       Impact factor: 6.937

4.  Transcriptome-scale super-resolved imaging in tissues by RNA seqFISH.

Authors:  Chee-Huat Linus Eng; Michael Lawson; Qian Zhu; Ruben Dries; Noushin Koulena; Yodai Takei; Jina Yun; Christopher Cronin; Christoph Karp; Guo-Cheng Yuan; Long Cai
Journal:  Nature       Date:  2019-03-25       Impact factor: 49.962

5.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation.

Authors:  Balachandran Manavalan; Shaherin Basith; Tae Hwan Shin; Leyi Wei; Gwang Lee
Journal:  Mol Ther Nucleic Acids       Date:  2019-04-30

6.  SCALE method for single-cell ATAC-seq analysis via latent feature extraction.

Authors:  Lei Xiong; Kui Xu; Kang Tian; Yanqiu Shao; Lei Tang; Ge Gao; Michael Zhang; Tao Jiang; Qiangfeng Cliff Zhang
Journal:  Nat Commun       Date:  2019-10-08       Impact factor: 14.919

7.  LOLA: enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor.

Authors:  Nathan C Sheffield; Christoph Bock
Journal:  Bioinformatics       Date:  2015-10-27       Impact factor: 6.937

8.  Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape.

Authors:  Hanjun Dai; Ramzan Umarov; Hiroyuki Kuwahara; Yu Li; Le Song; Xin Gao
Journal:  Bioinformatics       Date:  2017-11-15       Impact factor: 6.937

9.  Factorized embeddings learns rich and biologically meaningful embedding spaces using factorized tensor decomposition.

Authors:  Assya Trofimov; Joseph Paul Cohen; Yoshua Bengio; Claude Perreault; Sébastien Lemieux
Journal:  Bioinformatics       Date:  2020-07-01       Impact factor: 6.937

10.  Colocalization analyses of genomic elements: approaches, recommendations and challenges.

Authors:  Chakravarthi Kanduri; Christoph Bock; Sveinung Gundersen; Eivind Hovig; Geir Kjetil Sandve
Journal:  Bioinformatics       Date:  2019-05-01       Impact factor: 6.937

View more
  1 in total

1.  GenomicDistributions: fast analysis of genomic intervals with Bioconductor.

Authors:  Kristyna Kupkova; Jose Verdezoto Mosquera; Jason P Smith; Michał Stolarczyk; Tessa L Danehy; John T Lawson; Bingjie Xue; John T Stubbs; Nathan LeRoy; Nathan C Sheffield
Journal:  BMC Genomics       Date:  2022-04-12       Impact factor: 3.969

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.