| Literature DB >> 23386833 |
Hongyu Chen1, Bronwen Martin, Caitlin M Daimon, Stuart Maudsley.
Abstract
Text mining is rapidly becoming an essential technique for the annotation and analysis of large biological data sets. Biomedical literature currently increases at a rate of several thousand papers per week, making automated information retrieval methods the only feasible method of managing this expanding corpus. With the increasing prevalence of open-access journals and constant growth of publicly-available repositories of biomedical literature, literature mining has become much more effective with respect to the extraction of biomedically-relevant data. In recent years, text mining of popular databases such as MEDLINE has evolved from basic term-searches to more sophisticated natural language processing techniques, indexing and retrieval methods, structural analysis and integration of literature with associated metadata. In this review, we will focus on Latent Semantic Indexing (LSI), a computational linguistics technique increasingly used for a variety of biological purposes. It is noted for its ability to consistently outperform benchmark Boolean text searches and co-occurrence models at information retrieval and its power to extract indirect relationships within a data set. LSI has been used successfully to formulate new hypotheses, generate novel connections from existing data, and validate empirical data.Entities:
Keywords: computational linguistics; data mining; drug discovery; latent semantic indexing; molecular interactions
Year: 2013 PMID: 23386833 PMCID: PMC3558626 DOI: 10.3389/fphys.2013.00008
Source DB: PubMed Journal: Front Physiol ISSN: 1664-042X Impact factor: 4.566
Figure 1(A) An example of a term-document matrix with a weighting function (tf-idf). M, D, and T refer to the term-document matrix, the set of all documents in the corpus, and the set of all terms in the corpus, respectively. T1 is an example of a common word that occurs frequently in documents, whereas T3, T4, and T6 are comparatively rarer words and receive a higher weight. (B) An illustration of the dimensionality-reduction step of LSI. U, Σ, and VT are truncated and become Σk, Uk, and VTk, respectively. C, D, and T refer to the set of LSI topics, documents, and terms, respectively. Here, we illustrate a reduction to three dimensions.
Figure 2An illustration of a “latent” link between The principle of Swanson discovery is analogous to this—we have two currently disjointed sets of literature A and C and bridge the gap by introducing an intermediate literature set B.