Erfaneh Gharavi1,2, Aaron Gu1,3, Guangtao Zheng3, Jason P Smith1,4, Hyun Jae Cho1,3, Aidong Zhang3, Donald E Brown2, Nathan C Sheffield1,5,6,4,2. 1. Center for Public Health Genomics, University of Virginia. 2. School of Data Science, University of Virginia. 3. Department of Computer Science, University of Virginia. 4. Department of Biochemistry and Molecular Genetics, University of Virginia. 5. Department of Public Health Sciences, University of Virginia. 6. Department of Biomedical Engineering, University of Virginia.
Abstract
MOTIVATION: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY: https://github.com/databio/regionset-embedding.
MOTIVATION: Genomic region sets summarize functional genomics data and define locations of interest in the genome such as regulatory regions or transcription factor binding sites. The number of publicly available region sets has increased dramatically, leading to challenges in data analysis. RESULTS: We propose a new method to represent genomic region sets as vectors, or embeddings, using an adapted word2vec approach. We compared our approach to two simpler methods based on interval unions or term frequency-inverse document frequency and evaluated the methods in three ways: First, by classifying the cell line, antibody, or tissue type of the region set; second, by assessing whether similarity among embeddings can reflect simulated random perturbations of genomic regions; and third, by testing robustness of the proposed representations to different signal thresholds for calling peaks. Our word2vec-based region set embeddings reduce dimensionality from more than a hundred thousand to 100 without significant loss in classification performance. The vector representation could identify cell line, antibody, and tissue type with over 90% accuracy. We also found that the vectors could quantitatively summarize simulated random perturbations to region sets and are more robust to subsampling the data derived from different peak calling thresholds. Our evaluations demonstrate that the vectors retain useful biological information in relatively lower-dimensional spaces. We propose that vector representation of region sets is a promising approach for efficient analysis of genomic region data. AVAILABILITY: https://github.com/databio/regionset-embedding.
Authors: Jason D Buenrostro; Paul G Giresi; Lisa C Zaba; Howard Y Chang; William J Greenleaf Journal: Nat Methods Date: 2013-10-06 Impact factor: 28.547
Authors: Lei Xiong; Kui Xu; Kang Tian; Yanqiu Shao; Lei Tang; Ge Gao; Michael Zhang; Tao Jiang; Qiangfeng Cliff Zhang Journal: Nat Commun Date: 2019-10-08 Impact factor: 14.919
Authors: Kristyna Kupkova; Jose Verdezoto Mosquera; Jason P Smith; Michał Stolarczyk; Tessa L Danehy; John T Lawson; Bingjie Xue; John T Stubbs; Nathan LeRoy; Nathan C Sheffield Journal: BMC Genomics Date: 2022-04-12 Impact factor: 3.969