| Literature DB >> 30367805 |
Duygu Dede Şener1, Daniele Santoni2, Giovanni Felici2, Hasan Oğul1.
Abstract
Finding similarities and differences between metagenomic samples within large repositories has been rather a significant issue for researchers. Over the recent years, content-based retrieval has been suggested by various studies from different perspectives. In this study, a content-based retrieval framework for identifying relevant metagenomic samples is developed. The framework consists of feature extraction, selection methods and similarity measures for whole metagenome sequencing samples. Performance of the developed framework was evaluated on given samples. A ground truth was used to evaluate the system performance such that if the system retrieves patients with the same disease, -called positive samples-, they are labeled as relevant samples otherwise irrelevant. The experimental results show that relevant experiments can be detected by using different fingerprinting approaches. We observed that Latent Semantic Analysis (LSA) Method is a promising fingerprinting approach for representing metagenomic samples and finding relevance among them. Source codes and executable files are available at www.baskent.edu.tr/∼hogul/WMS_retrieval.rar.Entities:
Keywords: Latent Dirichlet Allocation; Latent Semantic Analysis; Topic Model; Whole-metagenome; k-mer; sequence retrieval; sequence similarity
Mesh:
Year: 2018 PMID: 30367805 PMCID: PMC6348744 DOI: 10.1515/jib-2017-0067
Source DB: PubMed Journal: J Integr Bioinform ISSN: 1613-4516
Figure 1:An overview of proposed retrieval framework.
Figure 2:LDA process steps in our framework.
Figure 3:MAP scores of the Log transformed and Variance-stabilized Euclidean distances.
Figure 4:MAP scores of LSA fingerprint extraction method considering different d values.
Figure 5:MAP scores of LDA fingerprint extraction method for different k values.
Top ten ranked 8-mers for latent topics generated by the LDA model.
| Topic 1 | Topic 2 | Topic 3 | Topic 4 | Topic 5 | |
|---|---|---|---|---|---|
| Seq1 | TCGGAAGA | AAAAAGAA | ATGAAGAA | ATGAAAAA | AAAAGAAA |
| Seq2 | CGGAAGAG | AAAAAGGA | GATGCTGA | AAAAGAAG | CAATGGCA |
| Seq3 | CCGATCTC | AATTTTTC | AAGAAGAA | TATCCGGA | CATCATCA |
| Seq4 | AAAAGAAA | AAAAGAAA | AAAAGAAA | CGGAAGAA | GGCATCAA |
| Seq5 | ATCGGAAG | GAAAAAGA | GATGGCAA | AAAGAAGA | CATTGCCA |
| Seq6 | AGAAAGAA | AGAAAAAG | CATCATCG | AAGAAGAA | AAAAATAA |
| Seq7 | GATCGGAA | TATGAAAA | GGCGATGA | CTTTTTCA | ATGCCATA |
| Seq8 | GAAAGAAA | CTTTTTCA | TATCATCA | ATGGAAAA | ACAAGCAA |
| Seq9 | GAAGGAAA | AAGAAAAA | GATGATGC | CCGGAAAA | CATCGACA |
| Seq10 | AGAAGAAA | ATGAAAAA | AGGAAGAA | TGGATGAA | AACAAAAA |
Figure 6:Phylogenetic tree of sequences in the topic 1.
Figure 7:Comparative results of LSA and LDA fingerprint extraction methods with direct comparison by using Log trans. score and Var. stab score.