Philippe Thomas1, Pawel Durek2, Illés Solt2, Bertram Klinger2, Franziska Witzel2, Pascal Schulthess2, Yvonne Mayer1, Domonkos Tikk1, Nils Blüthgen2, Ulf Leser1. 1. Humboldt-Universität zu Berlin, Institute for Computer Science, Knowledge Management in Bioinformatics, 10099 Berlin, Germany, Institute of Pathology, Charité-Universitätsmedizin Berlin, Deutsches Rheuma Forschungszentrum, Charitéplatz 1, 10117 Berlin, Germany, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary and Integrative Research Institute for the Life Sciences, Humboldt Universität zu Berlin, Philippstr. 13 Haus 18, 10115 Berlin, Germany. 2. Humboldt-Universität zu Berlin, Institute for Computer Science, Knowledge Management in Bioinformatics, 10099 Berlin, Germany, Institute of Pathology, Charité-Universitätsmedizin Berlin, Deutsches Rheuma Forschungszentrum, Charitéplatz 1, 10117 Berlin, Germany, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary and Integrative Research Institute for the Life Sciences, Humboldt Universität zu Berlin, Philippstr. 13 Haus 18, 10115 Berlin, Germany Humboldt-Universität zu Berlin, Institute for Computer Science, Knowledge Management in Bioinformatics, 10099 Berlin, Germany, Institute of Pathology, Charité-Universitätsmedizin Berlin, Deutsches Rheuma Forschungszentrum, Charitéplatz 1, 10117 Berlin, Germany, Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary and Integrative Research Institute for the Life Sciences, Humboldt Universität zu Berlin, Philippstr. 13 Haus 18, 10115 Berlin, Germany.
Abstract
MOTIVATION: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION: Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT: leser@informatik.hu-berlin.de or nils.bluethgen@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. RESULTS: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. CONCLUSIONS: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. AVAILABILITY AND IMPLEMENTATION: Web-service is freely accessible at http://fastforward.sys-bio.net/. CONTACT: leser@informatik.hu-berlin.de or nils.bluethgen@charite.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Jonathan Casper; Ann S Zweig; Chris Villarreal; Cath Tyner; Matthew L Speir; Kate R Rosenbloom; Brian J Raney; Christopher M Lee; Brian T Lee; Donna Karolchik; Angie S Hinrichs; Maximilian Haeussler; Luvina Guruvadoo; Jairo Navarro Gonzalez; David Gibson; Ian T Fiddes; Christopher Eisenhart; Mark Diekhans; Hiram Clawson; Galt P Barber; Joel Armstrong; David Haussler; Robert M Kuhn; W James Kent Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971