| Literature DB >> 27142216 |
Eugenia Galeota, Mattia Pelizzola.
Abstract
Public repositories of large-scale biological data currently contain hundreds of thousands of experiments, including high-throughput sequencing and microarray data. The potential of using these resources to assemble data sets combining samples previously not associated is vastly unexplored. This requires the ability to associate samples with clear annotations and to relate experiments matched with different annotation terms. In this study, we illustrate the semantic annotation of Gene Expression Omnibus samples metadata using concepts from biomedical ontologies, focusing on the association of thousands of chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) samples with a given target, tissue and disease state. Next, we demonstrate the feasibility of quantitatively measuring the semantic similarity between different samples, with the aim of combining experiments associated with the same or similar semantic annotations, thus allowing the generation of large data sets without the need of additional experiments. We compared tools based on Unified Medical Language System with tools that use topic-specific ontologies, showing that the second approach outperforms the first both in the annotation process and in the computation of semantic similarity measures. Finally, we demonstrated the potential of this approach by identifying semantically homogeneous groups of ChIP-seq samples targeting the Myc transcription factor, and expanding this data set with semantically coherent epigenetic samples. The semantic information of these data sets proved to be coherent with the ChIP-seq signal and with the current knowledge about this transcription factor.Entities:
Keywords: epigenetics; high-throughput sequencing; natural language processing; semantic annotation; semantic similarity; transcription factor
Mesh:
Year: 2017 PMID: 27142216 PMCID: PMC5429001 DOI: 10.1093/bib/bbw036
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Figure 1.Word cloud with the most frequent targets, tissue and disease terms. (A) The 50 most frequent targets for the considered ChIP-seq experiments excluding the control samples; target size and color shade are proportional to the number samples. (B) As in (A) for the 50 most frequent disease terms identified from Metamap. (C) As in (B) for the tissue terms.
Evaluation of Metamap (MM) and Conceptmapper (CM) semantic annotations for tissue and disease terms
| MM full tissues | 267 | 257 | 2 | 82 | 0.76 | 0.01 | 0.67 | 0.05 | 0.61 |
| MM keyw tissues | 118 | 57 | 6 | 107 | 0.52 | 0.65 | 0.67 | 0.05 | 0.59 |
| MM full diseases | 106 | 112 | 113 | 1 | 0.99 | 0.5 | 0.49 | 0.99 | 0.65 |
| MM keyword diseases | 62 | 36 | 73 | 63 | 0.5 | 0.67 | 0.63 | 0.54 | 0.56 |
| CM tissues | 312 | 33 | 15 | 34 | 0.9 | 0.31 | 0.9 | 0.31 | 0.9 |
| CM diseases | 73 | 19 | 141 | 17 | 0.81 | 0.88 | 0.79 | 0.89 | 0.8 |
Note. For Metamap, the evaluation was done both providing as input the entire sentences (MM full) or only the part of it matching keywords specific for the topic of interest (MM keyw).
Figure 2.Matrix of pair-wise cophenetic correlations between the CUIs tissue dendrograms obtained with the indicated semantic similarity measures. The two groups of semantic similarity metrics discussed in the main text are highlighted.
Figure 3.Agreement between semantic similarity and ChIP-seq signal. (A) Hierarchical clustering of tissue annotations for selected ChIP-seq samples targeting the Myc TF, based on the Intrinsic Lin semantic similarity. (B) Heatmap showing the percentage of peaks shared by the samples on the rows with the samples on each column. The colorbars on the left of the heatmap denote samples having identical GEO Series id (GSE), tissue or Disease id, according to the legend. (C) Percentage of overlap of peaks of ChIP-seq samples targeting Myc, H3K4me3, H3K27ac and Pol2. A first set of samples (upper left block) was identified for these marks in B-lymphoma cell line (tissue) and B-cell lymphoma (disease). Additional samples from different studies were added by relaxing the semantic similarity threshold to include samples for the same marks that are associated to similar tissues and diseases. For each block, the average overlap is reported, excluding same-to-same overlaps, and values in bold are discussed in the text.