| Literature DB >> 27652181 |
Lin Liu1, Lin Tang2, Wen Dong3, Shaowen Yao4, Wei Zhou4.
Abstract
BACKGROUND: With the rapid accumulation of biological datasets, machine learning methods designed to automate data analysis are urgently needed. In recent years, so-called topic models that originated from the field of natural language processing have been receiving much attention in bioinformatics because of their interpretability. Our aim was to review the application and development of topic models for bioinformatics. DESCRIPTION: This paper starts with the description of a topic model, with a focus on the understanding of topic modeling. A general outline is provided on how to build an application in a topic model and how to develop a topic model. Meanwhile, the literature on application of topic models to biological data was searched and analyzed in depth. According to the types of models and the analogy between the concept of document-topic-word and a biological object (as well as the tasks of a topic model), we categorized the related studies and provided an outlook on the use of topic models for the development of bioinformatics applications.Entities:
Keywords: Bioinformatics; Classification; Clustering; Probabilistic generative model; Topic model
Year: 2016 PMID: 27652181 PMCID: PMC5028368 DOI: 10.1186/s40064-016-3252-8
Source DB: PubMed Journal: Springerplus ISSN: 2193-1801
Fig. 1The diagram of topic modeling
An example of a BoW
| d1 | d2 | d3 | d4 | d5 | d6 | |
|---|---|---|---|---|---|---|
| Gene | 2 | 0 | 3 | 0 | 0 | 0 |
| Protein | 0 | 5 | 0 | 0 | 0 | 0 |
| Pathway | 1 | 2 | 0 | 0 | 0 | 0 |
| Microarray | 0 | 0 | 3 | 6 | 0 | 0 |
The top five most frequent words from three topics
| Topics | Protein | Cancer | Computation |
|---|---|---|---|
| Words | Protein | Tumor | Computer |
| Cell | Cancer | Model | |
| Gene | Diseases | Algorithm | |
| DNA | Death | Data | |
| Polypeptide | Medical | Mathematical |
Fig. 2The topic distribution of a document
Fig. 3The graphical model of PLSA
Fig. 4The graphical model of LDA
Fig. 5The tasks of a topic model in bioinformatics
A summary of topic model types in the relevant studies (see “Topic models applied to bioinformatics” section)
| References | Types of topic model |
|---|---|
| Castellani et al. ( | PLSA |
| Caldas et al. ( | LDA |
| Rogers et al. ( | LPD |
| Liu et al. ( | Corr-LDA |
| Sinkkonen et al. ( | topic model for relational data |
| Perina et al. ( | BaLDA |
| Dawson and Kendziorski ( | survLDA |
| Fang et al. ( | Semi-parametric transelliptical topic model |
| Chen et al. ( | LDA-B |
A summary of the analogies between document-topic-word and a biological object in the relevant studies (see ““Document-word-topic” in biological data” section)
| Reference | Words | Topics | Documents | Biological dataset |
|---|---|---|---|---|
| Rogers et al. ( | Genes | Functional groups | Samples | Expression microarray data |
| Masseroli et al. ( | Ontological terms | Latent relationship | Proteins | Protein annotations |
| Chen et al. ( | K-mers of DNA sequences | Taxonomic category/components of the whole genome | DNA sequences | Genomic sequences |
| Caldas et al. ( | Gene sets | Biological process | Experiments | Gene expression dataset |
| Coelho et al. ( | Object classes | Fundamental patterns | Images | Fluorescence images |
| Konietzny et al. ( | A fixed-sized vocabulary of words based on the gene annotations | Functional modules of protein families | Genome annotations | A set of genome annotations |
| Bisgin et al. ( | Endpoint measurements | Diagnostic topics | Drugs | Expression of the HCS endpoints |
| Chen et al. ( | Functional elements (NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway indicators) | Functional groups | Samples | Genome set |
| Pan et al. ( | Local sequential features | Latent topic features | Protein sequences | Protein–protein interaction dataset |
| Castellani et al. ( | Shape descriptors | Brain surface geometric patterns | Images | Magnetic resonance images |
| Pratanwanich and Lio ( | Genes | Pathways | Gene expression profiles | Gene expression data |
| Dawson and Kendziorski ( | Clinical events, treatment protocols, and genomic information from multiple sources | The category of patients | Patients | Patient’s text constructed from clinical and multidimensional genomic analyses |