Literature DB >> 26420781

Recent Advances and Emerging Applications in Text and Data Mining for Biomedical Discovery.

Graciela H Gonzalez, Tasnia Tahsin, Britton C Goodale, Anna C Greene, Casey S Greene.

Abstract

Precision medicine will revolutionize the way we treat and prevent disease. A major barrier to the implementation of precision medicine that clinicians and translational scientists face is understanding the underlying mechanisms of disease. We are starting to address this challenge through automatic approaches for information extraction, representation and analysis. Recent advances in text and data mining have been applied to a broad spectrum of key biomedical questions in genomics, pharmacogenomics and other fields. We present an overview of the fundamental methods for text and data mining, as well as recent advances and emerging applications toward precision medicine.

Entities: Chemical Disease Gene Species

Keywords: biomedical discovery; data mining; gene prioritization; pharmacogenomics; text mining; toxicology

Mesh：

Year: 2015 PMID： 26420781 PMCID： PMC4719073 DOI： 10.1093/bib/bbv087

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

Introduction

Technologies that resulted in the successful completion of the Human Genome project and those that have followed it afford an unprecedented breadth of data collection avenues (whole-genome expression data, chip-based comparative genomic hybridization and proteomics of signal transduction pathways, among many others) and have resulted in exceptional opportunities to advance the understanding of the genetic basis of human disease. However, high-throughput results are usually only the first step in a long discovery process, with subsequent and much more time-consuming experiments that, in the best of cases, culminate in the publication of results in journals and conference proceedings. Rather than stopping at the publication stage, the challenge for precision medicine is then to translate all of these research results into better treatments and improved health. To achieve this goal, a range of analytic methods and computational approaches have evolved from other domains and have been applied to an ever-growing set of specific problem areas. It would be impossible to enumerate the numerous biological questions targeted by computational approaches. We will focus here on an overview of text and data mining methods and their applications to discovery in a broad range of biomedical areas, including biological pathway extraction and reasoning, gene prioritization, precision medicine, pharmacogenomics and toxicology. The advances are plenty and the specific areas of application diverse, but the fundamental motivation is to aid scientists in analyzing available data to suggest a road to discovery, to precise predictions that lead to better health.

Background

Data mining

Data mining is the act of computationally extracting new information from large amounts of data [1], and the biological sciences are generating enormous quantities of data, ushering in the era of ‘big data'. Stephens et al. state that sequencing data alone constitutes ∼35 petabases/year and will grow to ∼1 zettabase/year by 2025 [2]. This creates a large opportunity for the development and deployment of novel mining algorithms, and two recent reviews on data and text mining in the era of big data are found in Che et al. [3] and Herland et al. [4]. A wide variety of methods for extracting value from different types and models of data fall under the umbrella of ‘data mining'. Classification algorithms (decision trees, naïve Bayesian classification and other classifiers), frequent pattern algorithms (association rule mining, sequential pattern mining and others), clustering algorithms (including methods to cluster continuous and categorical data) and graph and network algorithms have all evolved to present a diverse landscape for research and an arsenal to deploy against the toughest data challenges. Most researchers consider some other areas, including text mining, as being under the data mining umbrella. For example, Piatetsky-Shapiro state: ‘Data Mining in my opinion includes: text mining, image mining, web mining, predictive analytics, and much of the techniques we use for dealing with massive data sets, now known as Big Data' [5]. The methods applied to text mining, however, are specialized to such a degree that it is common to view it as a separate area of specialty. Data mining courses do not usually include any text mining material, but rather there are separate courses dedicated to it, and the same applies to textbooks. A complete coverage of data mining techniques is beyond the scope of this article though we have included some important resources that cover this topic. Kernel Methods in Computational Biology by Schölkopf, Tsuda and Vert [6] covers methods specific to Computational Biology. Introduction to Data Mining [7] and Data Mining: Concepts and Techniques, 3rd edn [8] are two popular textbooks in data mining and give an excellent overview of the field. A more concise presentation can be found in the paper by Xindong Wu et al., Top 10 algorithms in data mining [9], which were identified in December 2006 as C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes and CART, covering clustering, classification and association analysis, which are among the most important topics in data mining research: According to Jain et al. in ‘Data clustering: a review', ‘Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters)' [10]. Classification is akin to clustering because it segments data into groups called classes, but unlike clustering, classification analyses require knowledge and specification of how classes are defined. Statistical learning theory seeks ‘to provide a framework for studying the problem of inference that is, of gaining knowledge, making predictions, making decisions or constructing models from a set of data' states Bousquet et al. [11]. A textbook on statistical learning expands on these notions [12]. Association analysis facilitates the unmasking of hidden relationships in large data sets. The discovered associations are then expressed as rules or sets of items that frequently occur together. Challenges to association analysis methods include that discovering such patterns can be computationally expensive given a large input data set and that there could potentially be many spurious associations ‘discovered' that simply occur by chance. A well-known introduction to the topic is found in [13], and in particular, a seminal paper on mining association rules from clinical databases is found in Stilou et al. [14]. Link analysis analyzes hyperlinks and the graph structure of the Web for the ranking of web search results. PageRank is perhaps the best-known algorithm for link analysis [15]. In a notable transition showing the power of new algorithms and data, data mining approaches are now being used to learn, not just the primary features but also context-specific features. For example, initial data mining approaches that constructed gene–gene networks built a single network [16]. In contrast, recent approaches learn multiple context-specific networks, allowing the construction of process-specific [17] and tissue-specific networks [18-20]. An individual is made up of a personalized combination of such context-specific networks, so we anticipate that continued advances in the context specificity of data mining approach will play an important role in the broad implementation of precision medicine.

Text mining

Text mining is a subfield of data mining that seeks to extract valuable new information from unstructured (or semi-structured) sources [21]. Text mining extracts information from within those documents and aggregates the extracted pieces over the entire collection of source documents to uncover or derive new information. This is the preferred view of the field that allows one to distinguish text mining from natural language processing (NLP) [22, 23]. Thus, given as input a set of documents, text mining methods seek to discover novel patterns, relationships and trends contained within the documents. Aiding the overall goal of discovering new information are NLP programs that go from the relatively simple text processing tasks at the lexical or grammatical levels (such as a tokenizing or a part-of-speech tagger), to relatively complex information extraction algorithms [like named entity recognition (NER) to find concepts such as genes or diseases, normalization to map them to their unique identifiers or relationship extraction and sentiment analysis systems, among others]. The greater the complexity of the task, the more likely it is to integrate methods from data mining (such as classification or statistical learning). Although there is no current textbook that can be considered the definite guide on text mining as defined above, there are a couple of classic textbooks that cover fundamental NLP techniques and at least the first covers some of the analytics required to discover information: Speech and Language Processing by Jurafsky and Martin [24] and Foundations of Statistical Natural Language Processing by Manning and Schuetze [25]. The biomedical domain is one of the most interesting application areas for text mining, given both the potential impact of the information that can be discovered and the specific characteristics and volume of information available. The textbook Text mining for biology and medicine [26] offers an overview of the fundamental approaches to biomedical NLP, emphasizing different sub-areas in each chapter, although overall it does not totally adhere to the definition of text mining as a means for discovery given by Hearst [23]. A good non-textbook review of the different subareas is the article ‘Frontiers of biomedical text mining: current progress' [27]. For those just starting in the area, the article ‘Getting Started in Text Mining' [28] is a good starting point. A more in-depth treatment of automated techniques applied to the biomedical literature and its contribution to innovative biomedical research can be found in ‘Text-mining solutions for biomedical research: enabling integrative biology' [29]. Text mining sub-areas, briefly summarized, include: Information Retrieval deals with the problem of finding relevant documents in response to a specific information need (query). An overview of tools for information retrieval from the biomedical literature can be found in [30]. NER is at the core of the automatic extraction of information from text and deals with the problem of finding references to entities (mentions) such as genes, drugs and diseases present in natural language text and tagging them with their location and type. NER is also referred to as ‘entity tagging' or ‘concept extraction'. This is a basic building block for almost all other extraction tasks. NER in the biomedical domain is generally considered to be more difficult than other domains, such as geography or news reports. This is owing to inconsistency in how known entities, such as symptoms or drugs, are named (e.g. nonstandard abbreviations and new ways of referring to them). An open-source NER engine, BANNER [31], with models to recognize genes and diseases mentioned in biomedical text, is currently available for gene and disease NER, and LINNAEUS is available for species [32]. Rebholz-Schuhmann et al. [33] present an overview of the NER solutions for the second CALBC task, including protein, disease, chemical (drug) and species entities. Campos et al. [34] discuss a recent survey of tools for biomedical NER. A system assigning text to a wide range of semantic classes using linguistic rules is presented in [35], illustrating a slightly different than standard NER because classes potentially overlap. Verspoor et al. [36] use the CRAFT corpus to improve the evaluation of gene NER (and some lower-level tasks like part-of-speech and sentence segmentation). Recent work in [37] presents an NER system for extracting gene and protein sequence variants from the biomedical literature. For locating chemical compounds, Krallinger et al. [38] summarize the task that was part of BioCreative IV and give a short overview of some of the techniques used. Named Entity Identification allows the linkage of objects of interest, such as genes, to information that is not detailed in a publication (such as their Entrez Gene identifier) [39]. Two open-source systems using largely dictionary-based approaches to normalize gene names appear in [39-41]. For normalizing disease names, [42] introduces DNorm, a new normalization framework using machine learning, with strong results. Association extraction is one of the higher-level tasks still considered purely an information extraction application. It uses the output from the prior subtasks to produce a list of (binary or higher) associations among the different entities of interest. Catalysts for advances in this area have been the Biocreative and BioNLP shared tasks, with excellent teams from around the world putting their systems to the test against carefully annotated data sets. A survey of submissions to Biocreative III [43] and BioNLP [44, 45] shows a good overview of approaches responsive to the respective shared tasks. Putting together associations into networks of molecular interactions that can explain complex biological processes is the next logical step, and one that still is considered the ‘holy grail' of automatic biomolecular extraction. Ananiadou et al. [46] and Li et al. [47] discuss comprehensive surveys of methods for the extraction of network information from the scientific literature and the evaluation of extraction methods against reference corpora. Semantic-based approaches such as [48] will make their mark in the coming years. Event extraction is similar to association extraction but instead of separately extracting various relations between different entities in text, this task focuses on identifying specific events and the various players involved in it (arguments). For instance, the arguments of a transport event will include the molecule being transported, the cell to which it is being transported and the cell from which it is being transported. Event extraction was a key component of the BioNLP Shared Tasks in both 2011 [45] and 2013 [49], challenging the biomedical community to expand and cultivate their approaches in this area and leading to steadily improving results. Pathway extraction is a budding branch of biomedical text mining closely following the footsteps of event extraction. It involves the automated construction of biological pathways through the extraction and ordering of pathway-related events from text. Although, like [50] and [51], the majority of researchers in this domain have been focusing their efforts on supporting pathway curation through event extraction, rather than entirely automating the process. Tari et al. was able to achieve promising results for the automated synthesis of pharmacokinetic pathways by applying an automated reasoning-based approach for event ordering [52]. The first shared task on Pathway Curation was organized by BioNLP in 2013 [49] to establish the current state-of-the-art performance level for extracting pathway-relevant events such as phosphorylation and transport. In the end, a set of the different subtask solutions are used in a pipeline that allows information to be integrated and analyzed toward knowledge discovery. However, this multiplies the effects of errors down the pipeline, leaving systems highly vulnerable. An overarching challenge for biomedical text mining is to incorporate the many knowledge resources that are available to us into the NLP pipeline. In the biomedical domain, unlike the general text mining domain, we have access to large numbers of extensive, well-curated ontologies and knowledge bases. Biomedical ontologies provide an explicit characterization of a given domain of interest. The quality of data mining efforts would likely increase if existing ontologies (e.g. UMLS [53] and BioPortal [54]) were used as sources of terms in building lexicons, for figuring out what concept subsumes another, and as a way of normalizing alternative names to one identifier. For example, using ontologies as described enabled the use of unstructured clinical notes for generating practice-based evidence on the safety of a highly effective, generic drug for peripheral vascular disease [55]. Today, the data being generated is massive, complex and increasingly diverse owing to recent technological innovations. However, the impact of this data revolution on our lives is hampered by the limited amount of data that has been analyzed. This necessitates data mining tools and methods that can match the scale of the data and support timely decision-making through integration of multiple heterogeneous data sources. Finally, another area in which the field has fallen short is that of making text mining applications that are easily adaptable by end users. Many researchers have developed systems that can be adapted by other text mining specialists, but applications that can be tuned by bench scientists are mostly lacking.

Application areas

Pathway extraction and reasoning

Analyzing the intricate network of biological pathways is an essential precursor to understanding the molecular mechanisms of complex diseases affecting humans. Without acquiring a deeper insight into the underlying mechanisms behind such diseases, we cannot advance in our efforts to design effective solutions for preventing and treating them. However, given the vast amount of data currently available on biological pathways in biomedical publications and databases and the highly interconnected nature of these pathways, any attempt to manually reason over them will invariably prove to be largely ineffective and inefficient. As a result, there is a growing need for computational approaches to address this demanding task through automated pathway analysis. Pathway analysis can be either quantitative or qualitative and is a key focus of the growing field of Systems Biology. Quantitative pathway analysis uses dynamic mathematical models for simulating pathways and can be especially useful in drug discovery and the development of patient-specific dosage guidelines [56]. Some examples of techniques used in this form of analysis include ordinary differential equations [57], Petri Nets [58], and π-calculus [59]. Qualitative pathway analysis uses static, structural representations of pathways to answer qualitative questions about them; for instance it may be used to explain why a certain phenomenon occurs in the pathway based on existing pathway knowledge. Artificial intelligence paradigms, such as symbolic (i.e. explicit representations) or connectionist (i.e. massively parallelized) approaches, can greatly inform this type of pathway analysis [60]. Although some of the techniques principally addressing quantitative pathway analysis, such as Petri Nets and π-calculus, may also be used to perform qualitative pathway analysis, they typically tend to provide limited functionality [61]. Therefore, richer languages such as Maude [62], BioCham [63] and action languages [52, 64, 65] are more popular in this domain. In recent years, hybrid approaches have been applied for qualitative pathway reasoning. For instance, [66] presents a qualitative pathway reasoning system that uses Petri net semantics as the pathway specification language and action languages as the query language. Pathway reasoning, as a technique, relies on either humans defining the pathway information needed or the development of new algorithms to extract, represent and reason over biological pathways, which is an area of growing interest.

Gene prioritization and gene function prediction

Complex diseases present diverse symptoms because they are caused by multiple genes and environmental factors that differ for each individual and can diverge at different stages of the disease process. This complexity is reflective of epistatic effects where causative genes have an impact on the expression of many other genes. Because variant expression levels vary across the genome, it is difficult to determine true causative genes or distinguish key sets affected by the disease from high-throughput experiments. For example, the Affimetrix U133 Plus 2.0 microarray chip from the Repository of Molecular Brain Neoplasia Data shows >7500 2-fold differentially expressed genes in brain cancer tissue when compared with normal brain tissue [67]. The validation of a single causative gene is a long and expensive process [68], often taking up to a year and even longer, which necessitates using gene prioritization to pare down the list of potential gene targets to a manageable size. Gene prioritization methods that suggest the most significant prospects for further validation are critically needed, and method development in this area would greatly facilitate discovery. Many gene prioritization algorithms have been developed to address this problem, such as GeneWanderer [69], GeneSeeker [70], GeneProspector [71], SUSPECTS [72], G2D [73] and Endeavour [74], among others [75, 76]. A comparative review of these methods can be found in Tranchevent et al. [77]. The general premise of these methods is to rank genes based on the similarity between a set of candidate genes compared with genes already known to be associated with the disease (usually called the training set). Similarity is established based on different parameters (depending on the specific method) and may include purely biological measures (such as cytogenetic location, expression patterns, patterns of pathogenic mutations or DNA sequence similarity), biological measures plus annotation of the genes using different protein databases (for example, UniProt [78] and InterPro [79]), or other vocabularies and ontologies (such as the Gene Ontology [80, 81], eVOC [82], MeSH [83] and term vectors from the literature). In these methods, the closer a gene in the candidate list coincides with the profile of the training genes, the higher it is ranked. Gene prioritization includes the areas of gene function prediction. The Critical Assessment of protein Function Annotation experiment was the first large community-wide evaluation of 54 methods that were compared on a core set of annotations using evaluation metrics to ascertain the top methods [84]. Earlier computational methods for prioritization were compared through a large-scale biological assay of yeast mitochondrial phenotypes and found to be effective [85, 86]. A related but distinct gene prioritization problem is the identification of genes with tissue-specific expression patterns [87]. Existing webservers such as GeneMANIA [88, 89] and IMP [90] allow biologists to perform gene prioritization by network connectivity, and servers such as PILGRM allow for prioritization directly by gene expression [91]. Predicted functions, in addition to curated functions, have also shown promise for interpreting the results of genome-wide association studies, which aim to pair genetic variants with associated genes and pathways [92].

Precision medicine and drug repositioning

Precision medicine is determining prevention and treatment strategies based on an individual’s predisposition in an effort to provide more targeted and therefore effective treatments [93]. This area is poised for intense growth based on the ease of obtaining patient data and the development of computational methods with which to analyze this personalized data. While precision medicine is a nascent field, there have been many advances in the personalized treatment of cancer. Some hospitals are already using genetic data to direct treatment options for cancer patients (e.g. BRCA1 and BRCA2 [94], BRAF [95] testing), though drugs targeted to specific mutations lag behind and is an area where computational drug repositioning will potentially have a strong impact [96]. On the clinical side of translational research, the demand for timely and accurate knowledge has the urgency of life itself. Emily Whitehead was the first child with acute lymphoblastic leukemia to be treated and cured with an experimental T cell therapy called CAR T cell therapy at the Children’s Hospital of Philadelphia [97]. The therapy enables the patient’s T cells to recognize and attack malignant B cells, but this treatment can also trigger an intense immune reaction, which Emily experienced. She suffered from a high level of the interleukin 6 protein, and her doctors suggested trying tocilizumab (Actemra), a rheumatoid arthritis drug, to combat the extraneous protein production [97, 98]. This drug returned Emily’s vital signs back to normal. In this case, rather than relying on the serendipity of a team member knowing about the right drug, specialized text mining could have been used to mine the literature for the relevant drugs. In such a scenario, either the literature would be mined in advance, stored in a database that extracts relationships between drugs and genes or proteins or it could be searched in real time. As an example of this, Essack et al. created a sickle cell disease knowledgebase by mining 419 612 PubMed abstracts related to red blood cells, anemia or this disease [99]. Some databases (such as PharmGKB) store such relationships, but are not the result of automatic extraction. Manual curation is still the current standard for such databases, with the value of text mining applications yet to be fully realized. Currently, despite notable advances in entity mention extraction and normalization, the use of text mining is mostly limited to aiding curators to speed up the process. Data and text mining methods are useful for biomedical predictions and can be successfully extended to biomedical discoveries as well. Sirota et al. used publicly available gene expression data for both drugs and diseases to ascertain if Food and Drug Administration-approved drugs could be repositioned for use in new diseases [100]. They discovered and experimentally validated the use of cimetidine, generally used for heartburn and peptic ulcers, as a treatment option for lung adenocarcinoma illustrating the use of a computational approach as an efficient, yet powerful, approach to drug discovery [100, 101]. Frijters et al. successfully found links between genes, drugs, pathways and diseases through their tool CoPub Discovery that mines the biomedical literature for the elucidation of new relationships between these concepts [102]. Based on their predictions, they validated two different drugs’ role in cell proliferation through a cell assay to illustrate the validity of their tool for finding novel associations. This tool may be useful in finding new connections between drugs and their targets, as well as the ability to repurpose drugs for disease treatment.

Data integration

Data integration represents a particularly important type of computational approach. Integrative analyses can identify patterns that are evident across many distinct experiments. Patterns from imperfectly matched experiments are likely to be general responses to a common environment as opposed to unique features of an experiment [103-106]. Integrative analyses, while they have substantial potential to identify general principles, also raise specific challenges, largely driven by potentially undesirable features of the data. For example, Huttenhower et al. [107] found that the mutual information between data sets was largely driven by the experimental platform and not relevant biological signals. To address this challenge, many integrative methods use either carefully curated and selected data sets [100, 101, 108, 109] or supervised machine learning methods [19, 90, 110–118]. As an example of carefully selected data sets, Sirota et al. [100] used a labeled compendium of gene expression experiments of disease state and drug treatment to identify drugs that induced an expression profile that was anti-correlated with disease. In addition, gene expression values were analyzed using rank-based statistics, which may also mitigate platform-specific noise. Supervised analyses can mitigate the effects of technical artifacts by grading each data set by how much information each provides about different aspects of biology. Many methods have been successfully applied to this challenge including Bayesian [90] and ridge regression [118] approaches. For example, Greene et al. [19] used a Bayesian approach to weigh each of approximately 1000 data sets by how well they captured tissue-specific functional relationships. This approach produced tissue-specific networks for 144 human tissues, and networks generated by the tissue-specific Bayesian integration of the complete compendium outperformed an approach that integrated only tissue-specific data sets on both coverage of tissues and overall network metrics. To combat platform-specific signals, Greene et al. [19] calculated the mutual information across data sets for non-related pairs of genes to identify and down-weight data set similarity that was independent of biology. In addition to approaches that rely on the curation of data sets or supervised methods, new techniques based on advances in deep learning are now also being applied to the challenge of data integration [119]. For example, Tan et al. [119] performed an analysis using denoising autoencoders of gene expression to extract features from a set of ∼2000 breast cancer biopsies. In this approach, a neural network model is trained to reconstruct the observed data from data where noise has been added. The identified features corresponded to subtype, estrogen receptor status and other features that had a well-documented role in the biology of breast cancer. Of particular note, the features generalized to an independent data set generated on a distinct platform without a loss in accuracy, suggesting that the model had identified these biological features without overfitting to the platform. Unsupervised methods capable of identifying biological signals without confounding technical artifacts present substantial opportunities for new algorithms that integrate large-scale data compendia where the curated knowledge required by supervised algorithms is limited or unavailable.

Pharmacogenomics

The field of pharmacogenomics has benefitted significantly from recent progress in text and data mining for biomedical discoveries. Pharmacogenomics studies the genetic basis of individual drug responses by exploring the relationship between drugs, genes and diseases and analyzing pharmacokinetic and pharmacodynamic pathways. Pertinent pharmacogenomics-related information is typically extracted through the manual curation of data from pharmacogenomics literature and stored in the freely accessible PharmGKB database. However, the substantial level of advancement in the field of drug detection, gene detection and disease detection along with the increased efficacy of methods for the extraction of relations between drugs, genes and diseases has now made it possible to use automated systems to help with this curation process [120]. Two good reviews on pharmacogenetic text mining have been recently published by [121] and [122], respectively, while in 2014, Laiotaki et al. presented design specifications for building an integrated information system for offering personalized drug recommendations using genotype-to-phenotype knowledge on pharmacogenomics [123]. Every year the field of pharmacogenomic text mining continues to expand in different novel directions, gradually turning the vision of personalized medicine into reality.

Toxicology

The field of toxicology has an increasing need for text and data mining approaches capable of predicting chemical–biological interactions of thousands of chemicals that humans are exposed to either intentionally (via pharmaceuticals, diet) or unintentionally (contaminated air, water and food). Substantial data are required to identify potential toxicological effects of each chemical, and for regulators, such as the US Environmental Protection Agency (EPA) and European Chemical Agency, to make decisions protective of human health. The EPA inventories chemicals in commerce under the Toxic Substances Control Act (TSCA) [124]. In a 2009 review, ‘The toxicity data landscape for environmental chemicals', Judson et al. reported 75 000 chemicals in the TSCA database and identified 9912 chemicals under prioritization for testing by the EPA [125]. A lack of toxicity data limits the ability of regulators to make informed decisions and for health agencies to assess risk and respond in the case of exposure. Judson et al. further report that evaluation of almost 10 000 chemicals under the current testing paradigm would be both cost and time prohibitive, as in vivo studies require 2–3 years and millions of dollars per chemical [125]. The urgent need for more data has prompted development of in vitro high-content and High-Throughput Screening (HTS) methods to evaluate many biological endpoints relatively inexpensively. Multiple data mining approaches will be required to use these data to address the large knowledge gaps in toxicology. These include both broad analyses that leverage HTS and in vivo data across chemicals to predict biological effects of new compounds, as well as deeper analysis of genome-wide data sets at multiple levels of biological organization to predict how chemicals disrupt biological processes. Regulatory agency research initiatives, combined with increasing use of HTS and high-content approaches by independent researchers, are rapidly expanding the universe of toxicological data available to the public. A vast array of data is currently being collected through Tox21, a multi-agency collaborative HTS effort to identify chemical–biological interactions and chemical concentrations that cause toxic effects [126]. The EPA ToxCast program is evaluating chemical toxicity with 700 biochemical and cell-based HTS assays and using this information to identify chemical signatures that predict potential toxicity and prioritize chemicals for further testing. In parallel with the expansion of pharmacogenomic approaches to pharmaceutical development, computational toxicology has used data mining to identify features of environmental chemicals that mediate activity leading to potential adverse effects. Ekins et al. reviewed quantitative structure activity relationship and machine learning models that have been developed to predict specific toxicity endpoints such as hepatotoxicity, cardiotoxicity and genotoxicity from HTS, molecule descriptor and literature data compilations [127]. Predictive power of models developed from the first ToxCast HTS data set (∼300 chemicals) was limited, potentially because of a lack of redundant chemicals with positive signal to cover the array of mechanisms that lead to toxic effects in vivo or the large chemical space between training and test data sets [128, 129]. Prediction of whole animal toxicity encompassing diverse biological endpoints presents a particular challenge because a chemical can disrupt multiple molecular pathways and have different effects depending on the biological context. For example, 2,3,7,8-tetrachlorodibenzo-p-dioxin is an activator of the aryl hydrocarbon receptor pathway and tumor promoter [130]. Exposure during early development, however, leads to developmental abnormalities, including heart defects [131]. Accurate descriptors and classification of chemicals in training sets is essential, but depends on rich data sets as well as knowledge of biological pathways [132]. Transcriptomic, proteomic and metabolomics studies in the context of chemical exposures are beginning to provide biological pathway information that is critical to understanding mechanisms of chemical-induced toxicity. Several projects, lead by Tox21 collaborators and others, aim to identify and classify signals of chemical exposure from transcriptome data [133]. Gusenleitner et al. used the National Toxicology Program DrugMatrix and the TG-Gates (Toxico genomics project-Genomics Assisted Toxicity Evaluations) databases, which contain over 5000 arrays of rat tissues and primary rat hepatocytes exposed to therapeutic, industrial and environmental chemicals, to develop a predictive model of genotoxicity from in vitro data [133]. An analysis of the same data by Tawa et al. identified gene modules associated with liver toxicity [134]. The ability to associate gene modules discovered in other tissues in these data sets with toxicological endpoints is limited by the available clinical pathology and histology annotation. Context-specific algorithms and unsupervised methods therefore have the potential to make great contributions to the field of toxicogenomics. In parallel with the expansion of pharmacogenomic approaches to improve the development of pharmaceuticals and personalized medicine, computational toxicology has used data mining to identify features of environmental chemicals that mediate activity and cause potential toxicity. Several efforts aimed at literature-based chemical annotation are underway, including the Comparative Toxicogenomic Database, which leverages text mining and manual curation to provide chemical–gene–disease interaction data [135]. Accurate classification of new chemicals depends on comprehensive annotation of previously studied chemicals with toxicity information. Data mining across biological contexts will identify the chemical–pathway interactions that increase sensitivity of certain individuals, such as the young or populations with particular genetic polymorphisms, to complex chemical/stressor exposures. Engagement of the broader scientific community is important for addressing the challenges of computational toxicology. With the release of data from 1800 Toxcast chemicals in 2013, the EPA hosted a series of challenges focused on method development for chemical lowest effect level prediction from HTS data [136]. Data mining tools and methods that can integrate vast amounts of heterogeneous data will be needed to prioritize genes, pathways and chemicals for further investigation. A key component to the success of computational approaches in toxicology will be validation of model predictions by scientists at the bench. Centralized model repositories, databases such as the Comparative Toxicogenomics Database and the admetSAR structure activity database [137] and web-based analysis tools are essential to facilitate research community access and leverage existing data to inform future in vitro and in vivo toxicology research.

Conclusion

We have reviewed recent advances in text and data mining in the context of emerging application domains in the biomedical sciences. Computational methods contribute to this field by bringing knowledge from literature, either extracted or curated, together with high-throughput data sets to identify both known and new relationships between genes, pathways, drugs, environmental contaminants and diseases. Different approaches are often used for mining unstructured text and structured biomedical data. For this reason, integrating across both unstructured and structured resources presents additional challenges, but combining these domains will also present new opportunities. Systems that can extract relationships from both literature and data simultaneously present the opportunity to identify meaningful patterns from data, identify literature support for those patterns, and where warranted, identify relationships that are highly consistent in large-scale throughput data sets but absent from literature. This presents the opportunity to develop computational algorithms that not only identify biological principles but also recognize when those principles may represent novel discoveries. Key Points The era of ‘big data' presents biomedical researchers unprecedented challenges and opportunities for discovery. Automatic methods for text and data mining are essential tools that need to be deployed to deal with large data sets of highly heterogeneous, but complimentary, data. Key advances in data and text mining will empower bench scientists rather than replace them. A major challenge in the big data era for text and data mining is the integration of different sources such as curated databases, biomedical literature and results from assays to answer questions or generate novel hypotheses.

Funding

This work was supported by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative [GBMF4552 to C.S.G.] and the National Institute of Environmental Health Science [Ruth L. Kirschstein Postdoctoral Fellowship Award F32ES025082 to B.C.G.] and the National Library of Medicine [R01LM011176 to G.H.G.]

93 in total

Review 1. Petri Net representations in systems biology.

Authors: J W Pinney; D R Westhead; G A McConkey
Journal: Biochem Soc Trans Date: 2003-12 Impact factor: 5.407

2. High-performance gene name normalization with GeNo.

Authors: Joachim Wermter; Katrin Tomanek; Udo Hahn
Journal: Bioinformatics Date: 2009-02-02 Impact factor: 6.937

3. PharmGKB: understanding the effects of individual genetic variants.

Authors: Katrin Sangkuhl; Dorit S Berlin; Russ B Altman; Teri E Klein
Journal: Drug Metab Rev Date: 2008 Impact factor: 4.518

4. Synthesis of pharmacokinetic pathways through knowledge acquisition and automated reasoning.

Authors: Luis Tari; Saadat Anwar; Shanshan Liang; Jörg Hakenberg; Chitta Baral
Journal: Pac Symp Biocomput Date: 2010

5. Precision medicine meets public health: population screening for BRCA1 and BRCA2.

Authors: Ephrat Levy-Lahad; Amnon Lahad; Mary-Claire King
Journal: J Natl Cancer Inst Date: 2014-12-30 Impact factor: 13.506

6. Biological network extraction from scientific literature: state of the art and challenges.

Authors: Chen Li; Maria Liakata; Dietrich Rebholz-Schuhmann
Journal: Brief Bioinform Date: 2013-02-22 Impact factor: 11.622

7. Assessment of NER solutions against the first and second CALBC Silver Standard Corpus.

Authors: Dietrich Rebholz-Schuhmann; Antonio Jimeno Yepes; Chen Li; Senay Kafkas; Ian Lewin; Ning Kang; Peter Corbett; David Milward; Ekaterina Buyko; Elena Beisswanger; Kerstin Hornbostel; Alexandre Kouznetsov; René Witte; Jonas B Laurila; Christopher Jo Baker; Cheng-Ju Kuo; Simone Clematide; Fabio Rinaldi; Richárd Farkas; György Móra; Kazuo Hara; Laura I Furlong; Michael Rautschka; Mariana Lara Neves; Alberto Pascual-Montano; Qi Wei; Nigel Collier; Md Faisal Mahbub Chowdhury; Alberto Lavelli; Rafael Berlanga; Roser Morante; Vincent Van Asch; Walter Daelemans; José Luís Marina; Erik van Mulligen; Jan Kors; Udo Hahn
Journal: J Biomed Semantics Date: 2011-10-06

8. Speeding disease gene discovery by sequence based candidate prioritization.

Authors: Euan A Adie; Richard R Adams; Kathryn L Evans; David J Porteous; Ben S Pickard
Journal: BMC Bioinformatics Date: 2005-03-14 Impact factor: 3.169

9. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015.

Authors: Allan Peter Davis; Cynthia J Grondin; Kelley Lennon-Hopkins; Cynthia Saraceni-Richards; Daniela Sciaky; Benjamin L King; Thomas C Wiegers; Carolyn J Mattingly
Journal: Nucleic Acids Res Date: 2014-10-17 Impact factor: 16.971

10. Genomic models of short-term exposure accurately predict long-term chemical carcinogenicity and identify putative mechanisms of action.

Authors: Daniel Gusenleitner; Scott S Auerbach; Tisha Melia; Harold F Gómez; David H Sherr; Stefano Monti
Journal: PLoS One Date: 2014-07-24 Impact factor: 3.240

40 in total

1. Bridging semantics and syntax with graph algorithms-state-of-the-art of extracting biomedical relations.

Authors: Yuan Luo; Özlem Uzuner; Peter Szolovits
Journal: Brief Bioinform Date: 2016-02-05 Impact factor: 11.622

Review 2. Natural Language Processing for EHR-Based Pharmacovigilance: A Structured Review.

Authors: Yuan Luo; William K Thompson; Timothy M Herr; Zexian Zeng; Mark A Berendsen; Siddhartha R Jonnalagadda; Matthew B Carson; Justin Starren
Journal: Drug Saf Date: 2017-11 Impact factor: 5.606

3. Biomedical text mining for research rigor and integrity: tasks, challenges, directions.

Authors: Halil Kilicoglu
Journal: Brief Bioinform Date: 2018-11-27 Impact factor: 11.622

Review 4. Integrative Analysis of CD133 mRNA in Human Cancers Based on Data Mining.

Authors: Gui-Min Wen; Fei-Fei Mou; Wei Hou; Dan Wang; Pu Xia
Journal: Stem Cell Rev Rep Date: 2019-02 Impact factor: 5.739

Review 5. The academic, economic and societal impacts of Open Access: an evidence-based review.

Authors: Jonathan P Tennant; François Waldner; Damien C Jacques; Paola Masuzzo; Lauren B Collister; Chris H J Hartgerink
Journal: F1000Res Date: 2016-04-11

6. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

Authors: David Westergaard; Hans-Henrik Stærfeldt; Christian Tønsberg; Lars Juhl Jensen; Søren Brunak
Journal: PLoS Comput Biol Date: 2018-02-15 Impact factor: 4.475

7. A survey on data mining techniques used in medicine.

Authors: Saba Maleki Birjandi; Seyed Hossein Khasteh
Journal: J Diabetes Metab Disord Date: 2021-08-31

8. A novel data-driven workflow combining literature and electronic health records to estimate comorbidities burden for a specific disease: a case study on autoimmune comorbidities in patients with celiac disease.

Authors: Jean-Baptiste Escudié; Bastien Rance; Georgia Malamut; Sherine Khater; Anita Burgun; Christophe Cellier; Anne-Sophie Jannot
Journal: BMC Med Inform Decis Mak Date: 2017-09-29 Impact factor: 2.796

9. Ambiguity in medical concept normalization: An analysis of types and coverage in electronic health record datasets.

Authors: Denis Newman-Griffis; Guy Divita; Bart Desmet; Ayah Zirikly; Carolyn P Rosé; Eric Fosler-Lussier
Journal: J Am Med Inform Assoc Date: 2021-03-01 Impact factor: 4.497

Review 10. Data-Driven Modeling of Pregnancy-Related Complications.

Authors: Camilo Espinosa; Martin Becker; Ivana Marić; Ronald J Wong; Gary M Shaw; Brice Gaudilliere; Nima Aghaeepour; David K Stevenson
Journal: Trends Mol Med Date: 2021-02-08 Impact factor: 15.272