Literature DB >> 21493657

'Sciencenet'--towards a global search and share engine for all scientific knowledge.

Dominic S Lütjohann¹, Asmi H Shah, Michael P Christen, Florian Richter, Karsten Knese, Urban Liebel.

Abstract

SUMMARY: Modern biological experiments create vast amounts of data which are geographically distributed. These datasets consist of petabytes of raw data and billions of documents. Yet to the best of our knowledge, a search engine technology that searches and cross-links all different data types in life sciences does not exist. We have developed a prototype distributed scientific search engine technology, 'Sciencenet', which facilitates rapid searching over this large data space. By 'bringing the search engine to the data', we do not require server farms. This platform also allows users to contribute to the search index and publish their large-scale data to support e-Science. Furthermore, a community-driven method guarantees that only scientific content is crawled and presented. Our peer-to-peer approach is sufficiently scalable for the science web without performance or capacity tradeoff.
AVAILABILITY AND IMPLEMENTATION: The free to use search portal web page and the downloadable client are accessible at: http://sciencenet.kit.edu. The web portal for index administration is implemented in ASP.NET, the 'AskMe' experiment publisher is written in Python 2.7, and the backend 'YaCy' search engine is based on Java 1.6.

Entities: Chemical Gene Species

Mesh：

Year: 2011 PMID： 21493657 PMCID： PMC3106183 DOI： 10.1093/bioinformatics/btr181

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 MOTIVATION

Most commonly known search engine technologies (Bing, Google) are based on popularity ranking algorithms. However, scientific research has special requirements for search engines that cannot be addressed by popularity ranking in all cases. Special search engines (for example, Scirus (McKiernan, 2005), PubMed, Google Scholar, Web of Science, Scopus) concentrate more on providing content from scientific journals and literature (Falagas ). Other meta search engines cross-link several centralized databases via a single search interface [for example Bioinformatic Harvester (Liebel ), EB-eye (Valentin ), Entrez (Schuler ), Ensembl (Flicek et al., 2010), STRING (Szklarczyk )]. Today's scientific search queries require searching across different data sources that are geographically distributed. Often different data types, like high content screening (HCS) image data or sequence based data (Birney ), require special databases that present a challenge to the global search methods mentioned above. The latest developments in high content/high-throughput screening microscopy (Pepperkok and Ellenberg, 2006) and next-generation sequencing technologies (Metzker, 2010) routinely produce experimental datasets in the terabyte (TB) range resulting in millions of data files. To the best of our knowledge, there is no central database to encompass all experiment datasets due to the fact that large-scale data handling is a challenge for any known data publication platform. Uploading all this data to a centralized database is currently too time consuming and expensive (Schadt ). Also, maintaining a centralized infrastructure over the years is costly (Ball ). Consequently, it is likely that no single library alone will be able to index the entire science web (Lewandowski and Mayr, 2006). Research strongly benefits from accessible data that provides a valuable resource for comparative and novel studies (Campbell, 2009). Thus, a decentralized search and publishing network that can handle multiple data types at different locations will significantly improve the scientific research process.

2 DESIGN AND IMPLEMENTATION

We designed Sciencenet, a distributed peer-to-peer search engine network that can incorporate many different scientific data types like text, large-scale image datasets (Swedlow and Eliceiri, 2009), DNA/protein sequences (Ansorge, 2009) and mass spectrometry (MS) data (Gstaiger and Aebersold, 2009), which are published on web servers. It facilitates linking search results to other related heterogeneous data sources. To ensure the scalability of the data space, documents are located via a Distributed Hash Table (DHT) (Balakrishnan ). This avoids asking every peer to receive a complete search result. Our DHT rule allows storing index elements for a single search request on several peers. Due to concurrent queries, the more peers contribute, the better the response time gets. The distributed Sciencenet software platform has the following key elements: (1) A large-scale index technology capable of handling billions of documents belonging to the scientific web. Based on KIT's 350 000 web pages and currently 6471 known scientific sites in the whitelist, we estimate a total number of over 2 billion documents to be integrated. Our startup environment for Sciencenet consists of 30 commodity PCs, equipped with 2–24 cores, 4-64 GB of RAM and 500 GB hard disks, each capable of handling 15 million documents, which would just require a total number of about 200 peers for the estimated data space. The operating system is standard Ubuntu 10.04 with Java. This architecture was chosen to mimic a global distributed search engine. These Sciencenet PCs (peers) are configured to crawl (load and analyze) distinct scientific web sites and import repositories that provide an Open Archive Interface (OAI) (Lagoze ). OAI is a standard to import data sources in a fast and structured manner. Currently, 240 million web pages and documents are in the index of our machines. Furthermore, 1 TB of image-based data are available. The scientific community can easily provide server capacity to expand the index and improve search performance. (2) A community-driven method to manage the integration of institutional web sites, databases and journals to improve the quality of the scientific search index. Any scientific web site can be submitted by anyone, and registered users can be part of the process to accept these suggestions to support the growth of the index. (3) A simple ‘one-stop’ search interface for all users. The Sciencenet web site (http://sciencenet.kit.edu) provides a search portal without installation. The search results are presented along with a domain navigator and a tag cloud to refine the search. Alternatively, users can download the free open-source Sciencenet-YaCy client software package, allowing them to access the search network from their machines, perform search queries and access published scientific experiment data from others. The result list can be exported via an Application Programming Interface for further processing in external tools. Due to the preselected index, we consider every search result to be relevant, so pre-computed ranking, like PageRank (Brin and Page, 1998), is not used. The results are ranked using a default ‘ranking matrix’ consisting of a set of 28 statistical ranking criteria, such as ‘word distance’ or ‘appearance in title’ (see Supplementary Material). For each search query, users can customize the values of the ranking matrix with no increase in the overall complexity. (4) An easy to use software tool that allows data publishing and sharing. Users are able to publish and share their own scientific data or web sites. We provide an example module (the ‘AskMe’ tool) for non-text based data integration in the downloadable client. The tool handles large-scale image datasets from HCS experiments by providing a dataset preview. All collected meta information is presented in corresponding experiment descriptor files in both human and computer readable form. Hence, we use the embedded Resource Description Framework RDFa (Birbeck and Adida, 2008). This data publication method follows the principle of a Linked Open Data architecture (Berners-Lee, 2006) and is already the foundation for a semantically enriched web (Jensen and Bork, 2010).

3 CONCLUSION AND OUTLOOK

The combination of the technologies mentioned above makes it possible to search thousands of heterogeneous data sources with billions of documents and datasets. Our decentralized peer-to-peer approach overcomes the performance and capacity limitations of centralized data repositories. Ideally, future modules would allow users to rank and comment on search results.

16 in total

1. 'Harvester': a fast meta search engine of human protein resources.

Authors: Urban Liebel; Bjoern Kindler; Rainer Pepperkok
Journal: Bioinformatics Date: 2004-02-26 Impact factor: 6.937

2. Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses.

Authors: Matthew E Falagas; Eleni I Pitsouni; George A Malietzis; Georgios Pappas
Journal: FASEB J Date: 2007-09-20 Impact factor: 5.191

Review 3. Next-generation DNA sequencing techniques.

Authors: Wilhelm J Ansorge
Journal: N Biotechnol Date: 2009-02-03 Impact factor: 5.079

Review 4. Applying mass spectrometry-based proteomics to genetics, genomics and network biology.

Authors: Matthias Gstaiger; Ruedi Aebersold
Journal: Nat Rev Genet Date: 2009-09 Impact factor: 53.242

5. Data's shameful neglect.

Authors:
Journal: Nature Date: 2009-09-10 Impact factor: 49.962

Review 6. Sequencing technologies - the next generation.

Authors: Michael L Metzker
Journal: Nat Rev Genet Date: 2009-12-08 Impact factor: 53.242

Review 7. Computational solutions to large-scale data management and analysis.

Authors: Eric E Schadt; Michael D Linderman; Jon Sorenson; Lawrence Lee; Garry P Nolan
Journal: Nat Rev Genet Date: 2010-09 Impact factor: 53.242

8. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project.

Authors: Ewan Birney; John A Stamatoyannopoulos; Anindya Dutta; Roderic Guigó; Thomas R Gingeras; Elliott H Margulies; Zhiping Weng; Michael Snyder; Emmanouil T Dermitzakis; Robert E Thurman; Michael S Kuehn; Christopher M Taylor; Shane Neph; Christoph M Koch; Saurabh Asthana; Ankit Malhotra; Ivan Adzhubei; Jason A Greenbaum; Robert M Andrews; Paul Flicek; Patrick J Boyle; Hua Cao; Nigel P Carter; Gayle K Clelland; Sean Davis; Nathan Day; Pawandeep Dhami; Shane C Dillon; Michael O Dorschner; Heike Fiegler; Paul G Giresi; Jeff Goldy; Michael Hawrylycz; Andrew Haydock; Richard Humbert; Keith D James; Brett E Johnson; Ericka M Johnson; Tristan T Frum; Elizabeth R Rosenzweig; Neerja Karnani; Kirsten Lee; Gregory C Lefebvre; Patrick A Navas; Fidencio Neri; Stephen C J Parker; Peter J Sabo; Richard Sandstrom; Anthony Shafer; David Vetrie; Molly Weaver; Sarah Wilcox; Man Yu; Francis S Collins; Job Dekker; Jason D Lieb; Thomas D Tullius; Gregory E Crawford; Shamil Sunyaev; William S Noble; Ian Dunham; France Denoeud; Alexandre Reymond; Philipp Kapranov; Joel Rozowsky; Deyou Zheng; Robert Castelo; Adam Frankish; Jennifer Harrow; Srinka Ghosh; Albin Sandelin; Ivo L Hofacker; Robert Baertsch; Damian Keefe; Sujit Dike; Jill Cheng; Heather A Hirsch; Edward A Sekinger; Julien Lagarde; Josep F Abril; Atif Shahab; Christoph Flamm; Claudia Fried; Jörg Hackermüller; Jana Hertel; Manja Lindemeyer; Kristin Missal; Andrea Tanzer; Stefan Washietl; Jan Korbel; Olof Emanuelsson; Jakob S Pedersen; Nancy Holroyd; Ruth Taylor; David Swarbreck; Nicholas Matthews; Mark C Dickson; Daryl J Thomas; Matthew T Weirauch; James Gilbert; Jorg Drenkow; Ian Bell; XiaoDong Zhao; K G Srinivasan; Wing-Kin Sung; Hong Sain Ooi; Kuo Ping Chiu; Sylvain Foissac; Tyler Alioto; Michael Brent; Lior Pachter; Michael L Tress; Alfonso Valencia; Siew Woh Choo; Chiou Yu Choo; Catherine Ucla; Caroline Manzano; Carine Wyss; Evelyn Cheung; Taane G Clark; James B Brown; Madhavan Ganesh; Sandeep Patel; Hari Tammana; Jacqueline Chrast; Charlotte N Henrichsen; Chikatoshi Kai; Jun Kawai; Ugrappa Nagalakshmi; Jiaqian Wu; Zheng Lian; Jin Lian; Peter Newburger; Xueqing Zhang; Peter Bickel; John S Mattick; Piero Carninci; Yoshihide Hayashizaki; Sherman Weissman; Tim Hubbard; Richard M Myers; Jane Rogers; Peter F Stadler; Todd M Lowe; Chia-Lin Wei; Yijun Ruan; Kevin Struhl; Mark Gerstein; Stylianos E Antonarakis; Yutao Fu; Eric D Green; Ulaş Karaöz; Adam Siepel; James Taylor; Laura A Liefer; Kris A Wetterstrand; Peter J Good; Elise A Feingold; Mark S Guyer; Gregory M Cooper; George Asimenos; Colin N Dewey; Minmei Hou; Sergey Nikolaev; Juan I Montoya-Burgos; Ari Löytynoja; Simon Whelan; Fabio Pardi; Tim Massingham; Haiyan Huang; Nancy R Zhang; Ian Holmes; James C Mullikin; Abel Ureta-Vidal; Benedict Paten; Michael Seringhaus; Deanna Church; Kate Rosenbloom; W James Kent; Eric A Stone; Serafim Batzoglou; Nick Goldman; Ross C Hardison; David Haussler; Webb Miller; Arend Sidow; Nathan D Trinklein; Zhengdong D Zhang; Leah Barrera; Rhona Stuart; David C King; Adam Ameur; Stefan Enroth; Mark C Bieda; Jonghwan Kim; Akshay A Bhinge; Nan Jiang; Jun Liu; Fei Yao; Vinsensius B Vega; Charlie W H Lee; Patrick Ng; Atif Shahab; Annie Yang; Zarmik Moqtaderi; Zhou Zhu; Xiaoqin Xu; Sharon Squazzo; Matthew J Oberley; David Inman; Michael A Singer; Todd A Richmond; Kyle J Munn; Alvaro Rada-Iglesias; Ola Wallerman; Jan Komorowski; Joanna C Fowler; Phillippe Couttet; Alexander W Bruce; Oliver M Dovey; Peter D Ellis; Cordelia F Langford; David A Nix; Ghia Euskirchen; Stephen Hartman; Alexander E Urban; Peter Kraus; Sara Van Calcar; Nate Heintzman; Tae Hoon Kim; Kun Wang; Chunxu Qu; Gary Hon; Rosa Luna; Christopher K Glass; M Geoff Rosenfeld; Shelley Force Aldred; Sara J Cooper; Anason Halees; Jane M Lin; Hennady P Shulha; Xiaoling Zhang; Mousheng Xu; Jaafar N S Haidar; Yong Yu; Yijun Ruan; Vishwanath R Iyer; Roland D Green; Claes Wadelius; Peggy J Farnham; Bing Ren; Rachel A Harte; Angie S Hinrichs; Heather Trumbower; Hiram Clawson; Jennifer Hillman-Jackson; Ann S Zweig; Kayla Smith; Archana Thakkapallayil; Galt Barber; Robert M Kuhn; Donna Karolchik; Lluis Armengol; Christine P Bird; Paul I W de Bakker; Andrew D Kern; Nuria Lopez-Bigas; Joel D Martin; Barbara E Stranger; Abigail Woodroffe; Eugene Davydov; Antigone Dimas; Eduardo Eyras; Ingileif B Hallgrímsdóttir; Julian Huppert; Michael C Zody; Gonçalo R Abecasis; Xavier Estivill; Gerard G Bouffard; Xiaobin Guan; Nancy F Hansen; Jacquelyn R Idol; Valerie V B Maduro; Baishali Maskeri; Jennifer C McDowell; Morgan Park; Pamela J Thomas; Alice C Young; Robert W Blakesley; Donna M Muzny; Erica Sodergren; David A Wheeler; Kim C Worley; Huaiyang Jiang; George M Weinstock; Richard A Gibbs; Tina Graves; Robert Fulton; Elaine R Mardis; Richard K Wilson; Michele Clamp; James Cuff; Sante Gnerre; David B Jaffe; Jean L Chang; Kerstin Lindblad-Toh; Eric S Lander; Maxim Koriabine; Mikhail Nefedov; Kazutoyo Osoegawa; Yuko Yoshinaga; Baoli Zhu; Pieter J de Jong
Journal: Nature Date: 2007-06-14 Impact factor: 49.962

9. Ensembl 2011.

Authors: Paul Flicek; M Ridwan Amode; Daniel Barrell; Kathryn Beal; Simon Brent; Yuan Chen; Peter Clapham; Guy Coates; Susan Fairley; Stephen Fitzgerald; Leo Gordon; Maurice Hendrix; Thibaut Hourlier; Nathan Johnson; Andreas Kähäri; Damian Keefe; Stephen Keenan; Rhoda Kinsella; Felix Kokocinski; Eugene Kulesha; Pontus Larsson; Ian Longden; William McLaren; Bert Overduin; Bethan Pritchard; Harpreet Singh Riat; Daniel Rios; Graham R S Ritchie; Magali Ruffier; Michael Schuster; Daniel Sobral; Giulietta Spudich; Y Amy Tang; Stephen Trevanion; Jana Vandrovcova; Albert J Vilella; Simon White; Steven P Wilder; Amonida Zadissa; Jorge Zamora; Bronwen L Aken; Ewan Birney; Fiona Cunningham; Ian Dunham; Richard Durbin; Xosé M Fernández-Suarez; Javier Herrero; Tim J P Hubbard; Anne Parker; Glenn Proctor; Jan Vogel; Stephen M J Searle
Journal: Nucleic Acids Res Date: 2010-11-02 Impact factor: 16.971