Literature DB >> 31984253

Crowd-Sourced Chemistry: Considerations for Building a Standardized Database to Improve Omic Analyses.

Jaqueline A Picache¹, Jody C May¹, John A McLean¹.

Abstract

Mass spectrometry (MS) is used in multiple omics disciplines to generate large collections of data. This data enables advancements in biomedical research by providing global profiles of a given system. One of the main barriers to generating these profiles is the inability to accurately annotate omics data, especially small molecules. To complement pre-existing large databases that are not quite complete, research groups devote efforts to generating personal libraries to annotate their data. Scientific progress is impeded during the generation of these personal libraries because the data contained within them is often redundant and/or incompatible with other databases. To overcome these redundancies and incompatibilities, we propose that communal, crowd-sourced databases be curated in a standardized fashion. A small number of groups have shown this model is feasible and successful. While the needs of a specific field will dictate the functionality of a communal database, we discuss some features to consider during database development. Special emphasis is made on standardization of terminology, documentation, format, reference materials, and quality assurance practices. These standardization procedures enable a field to have higher confidence in the quality of the data within a given database. We also discuss the three conceptual pillars of database design as well as how crowd-sourcing is practiced. Generating open-source databases requires front-end effort, but the result is a well curated, high quality data set that all can use. Having a resource such as this fosters collaboration and scientific advancement.

Entities: Gene Species

Year: 2020 PMID： 31984253 PMCID： PMC6977078 DOI： 10.1021/acsomega.9b03708

Source DB: PubMed Journal: ACS Omega ISSN： 2470-1343

Introduction

Mass spectrometry serves as a foundational analytical technology in untargeted omics experiments.[1] In recent decades, MS has enabled the collection of big data in biomedical research. As more data is collected and the age of big data matures, many opportunities arise to gain insightful knowledge about biomedical systems that were not previously accessible.[2] However, many of these opportunities remain unseized due to challenges in annotating omics data, especially in the realm of small molecules.[1,3,4] Waldman and Terzic aptly describe why annotating big data is difficult: “While the goal is to extract insights from complex, noisy, and heterogeneous datasets, barriers have included the speed of data handling, curation and the veracity of the data, the sheer volume of data, and the heterogeneity of data to be integrated.”[5] To overcome such barriers, mass spectrometrists have turned to bioinformatic solutions which include curating data sets and building databases, data libraries, and/or data repositories. While these terms are often used interchangeably, their definitions have nuanced differences as described in Table .[6] Annotation of omics data relies heavily on matches from database queries.[4] Success in the annotation process is contingent upon the quality of the database being queried as well as the amount of unique information known about the omic compound in question. A few prominent, large-scale databases include the Human Metabolome Database, PubChem, and UniProt.[7−9]

Table 1

Data Collection Terms

term	description
data set	a collection of data[6]
database	an organized collection of records that is standardized to enable searching and retrieval of content[6]
data library	a collection of data materials in various formats with the purpose of providing information to a target group[6]
repository	a collection of digital documents stored for preservation and public access[6]

All three of these databases rely on crowd-sourced information. Generally, crowd-sourcing is an active solicitation of content, ideas, or services from a large community. When performed by scientific database curators, crowd-sourcing involves active parsing of the scientific literature to update and addend contents in an automated fashion. This automated crowd-sourcing process is necessary given that there are reports of >290,000 proteins and >25,000 endogenous metabolites in humans.[7,10] While databases such as those previously mentioned provide an important service to the biomedical research community, they remain incomplete, and in some cases, it is challenging to recognize where they are incomplete. As a result, research groups end up developing their own data libraries or databases. Consequences of building personalized libraries and databases include a loss of time and resources due to a redundancy of data acquisition and curation, limited scientific collaboration due to incompatibilities (e.g., informatics, jargon, etc.), and research opacity because raw data is often not referenced or otherwise available.[11] To alleviate these consequences, we propose that field experts build a crowd-sourced database that integrates into successful pre-existing workflows. It should be noted that contributors to the database (i.e., the crowd) will most likely also be field experts. Since database developing is an iterative process, open dialogue between the developer and crowd is encouraged to meet field specific needs. Two examples of successful crowd-sourced databases are the MassBank of North America (MONA) and the Unified Collision Cross Section Compendium. MONA was the first public repository for small molecule mass spectral data.[12] The Unified CCS Compendium is a database of drift-tube ion mobility mass spectrometry data of omic compounds.[13] Here, we discuss a model to create a crowd-sourced omics database including five pillars of database features that need to be considered. Further, we discuss design concepts and how crowd-sourcing is currently done within the research community.

Database Features

A generalized schematic of how an omic database is developed is shown in Figure . Specifically, an initial data set is processed via data curation, standardization, and annotation with metadata. Next, the curated data set is compiled into a database that gets disseminated to others within a field. These end users utilize the information within the database to gain knowledge about their own experiments, which leads to novel scientific conclusions and the formulation of future questions. These new conclusions become newly generated data sets which then undergo dereplication, validation, and standardization to be added to the pre-existing database. Even though this process only contains five general stages, much should be considered along each step. It is recommended that the following features be considered before data acquisition and curation as well as development of a database begins.

Figure 1

General Database Development. Databases start with an initial data set that undergoes standardization. This standardized database is disseminated through the research–peer review cycle. Subsequently, new data is added to the existing database and the cycle begins again.

Standardization Requirements

The overall goal of a database is to create a collection of data that end users can use with as few barriers as possible.[14] One way to minimize barriers that end users will face is to create a standardized system which includes a standard data type, reporting format, terminology, quality control process, metadata inclusion, and/or reference material information, as shown in Figure . Data type refers to what kind of data the database will contain. Will it be data from one specific technology or technique? Will this data be in the primary (raw) or secondary (processed) form? Primary data is preferred for scientific transparency. However, it is often larger and will require more computational storage space and data management resources. Secondary data is more common due to their smaller storage requirements and ease of use. Most end users prefer to look at conclusive or summative data.[15]

Figure 2

Database Features. To maximize the utility of their database, developers should consider the data type, standard terminology, included metadata, reference materials, and data management systems when designing their database. Reporting format and standard terminology must be considered. If a database contains primary data, how will that data be uploaded by the user? The database management system will need the capabilities to handle large data file transfers as well as automated indexing of addended data. Database management systems are discussed further in section . Databases that contain secondary data are easier to manage in terms of indexing and storage needs. However, the database developers need to create a standard format that is both informative and facile enough for end-users to comply with. Database specific terminology must be defined from the beginning so that users understand what is required of them and how to use the data within.[14] Such terminology should unambiguously convey experimental design, data acquisition, and data processing parameters. Furthermore, any information needed to provide a context for the reported results should also be included. This enables other users to fully understand the stated conclusions and compare studies from different research groups. One crucial aspect of databases is that each record within the database needs to have a unique identifier.[14] In metabolomics, this can be a compound’s InChI Key or molecular structure. In genomics, this could be a specific gene locus. This unique identifier enables universal indexing of records without ambiguity and quick data import and export from the database.

Metadata Documentation

Metadata is defined as “minimum information needed to ensure that submitted data are sufficient for clear interpretation and querying by other scientists.”[14] As previously mentioned, inclusion of contextual information as well as experimental procedures is imperative. Providing metadata maximizes a data set’s utility by allowing others to understand, reproduce, and build off of reported work. Database developers should provide guidelines about what type of information is needed for interpretation and querying for a specific field. Alternatively, a database can contain primary references for a given data set such that end users can obtain the metadata elsewhere. The Minimum Information for Biological and Biomedical Investigations (MIBBI) is a useful resource when deciding what metadata should be included.[16] MIBBI contains registries of reporting efforts for biological/biomedical studies as well as field specific recommendations which include the Minimum Information about a Genome Sequence (MIGS) and Minimum Information about a Metagenomic Sequence (MIMS) for genomic and transcriptomic data, the Minimum Information about a Proteomics Experiment (MIAPE) for proteomics data, and the Core Information for Metabolomics Reporting (CIMR) for metabolomics data.[16−19]

Reference Materials

The standardization process goes beyond informatics and reporting. It is recommended that database guidelines include a physical standard such that the reagent can be added as a control in experiments. This standard would serve as a stable reference point for data quality when compared to known experimental values for said reference standard.[14] Having a reference standard that is accessible to a breadth of end users enables data comparisons and quality checks between experiments, across platforms, and between research groups. This is particularly important when used in omics experiments in clinical/diagnostic settings. The reference standard choice is often decided by the users within the field, but standard materials and associated measurements are provided by the National Institute of Standards Technology (https://www.nist.gov/services-resources/standards-and-measurements) in the United States and the Laboratory of the Government Chemist (http://lgc.co.uk) in the United Kingdom.

Quality Assurance

The success of any database is contingent upon its quality assurance (QA). QA for a database is the process that ensures that data and informatic tools within meet a certain standard as dictated by the database design model and specifications.[20] QA is a twofold process: The first is during initial development of the database. It begins with the standardization procedures previously described as well as developing tools that can audit the database intermittently. These audits should ensure that all of the data is represented accurately and as planned, and that all of the database functionality is operating properly.[20] The second process is when new data is added to the database. Procedures should be in place to vet the quality of the incoming data such that it meets the standardization requirements previously set forth.[20] This ensures that the integrity of the database is maintained.

Design Concepts

Conceptual Design

The conceptual design of a database defines the data requirements of and the application of the database in question.[15] This includes the metadata requirements as well as the standardized data type and format as previously discussed. This pillar of the database design is field and data specific. Discussions about the aforementioned database features from section should be included during the conceptual design phase.

Logical Design

The logical design of a database involves the implementation and management systems of the database.[15] It can be thought of as the “back-end” design of a database. Particular attention should be paid to determining how and when data will be normalized and background/noise corrected and which, if any, further transformations will occur. Often, genomic and transcriptomic databases contain primary data that is normalized by the database infrastructure during the data submission process. Proteomic and metabolomic data is usually presented as secondary data that includes the larger context of the experiment performed.[15] Additionally, proteomic and metabolomic data have more variety in potential output, in terms of content and size, when compared to genomic and transcriptomic data. As a result, this type of data is normalized and background subtracted before submission to a database. Developers need to also consider if data sets are to be kept separate or merged. Data sets can be merged to save space and represent a crowd-sourced conclusion. Keeping them separate enables study comparisons. Both options are used in bioinformatics, and the overall aim of a given database will determine which is more suitable. Once data is addended into a database, it needs to be maintained. Several options exist to retain order and search capabilities of a database. For large data sets, SQL Server can be used. It works well with relational data, especially if individual records have many attributes associated with them.[21] SQL data sets can be transformed into the XML data format which is amenable to many informatic solutions and coding languages. For smaller data sets, developers can utilize spreadsheet-based solutions which are easily hosted online and can be transferred via CSV data format. Automated maintenance and quality checks are recommended for both options and should be determined before development begins. However, maintenance is an iterative process and should be adjusted as needed. These tendencies are general and individual developers should choose the appropriate logistical design for their specific type of data.

Physical Design

The physical design of a database involves determining the hardware necessary to support the database as well as the design of any graphical interfaces needed.[15] It can be thought of as the front-end design pillar in which the ease of use by experts in the field is the top priority. Developers should determine if their database will be hosted via an application or online. Additionally, developers should decide if and what to archive (i.e., should outdated results be kept?) as well as design tools that can query live and archived data.[20] Informatic tools such as statistical models and data visualization graphics are also designed during this phase.[15] This state of the design process is the most open-ended, and graphical output can vary widely. Furthermore, it is the most iterative stage, as databases are likely to change depending on their contents and new tools being added. Figure summarizes the three design concepts of planning a database.

Figure 3

Database Design Concepts. The three phases of database design include the conceptual, logical, and physical stages.

Crowd-Sourcing Data

The past decade has seen a push for data sharing and crowd-sourcing research.[11,22−24] The age of big data has matured alongside the ongoing improvements in computational power and data storage capacity. These concurrent movements allow researchers to gather more data than that which they could have collected independently and perform wide-scale studies not previously possible. There are two main crowd-sourcing techniques being used: (1) data-mining from publicly available large data sets and (2) crowd-sourcing data acquisition and/or analysis.[24] The first technique is often used in public health studies where large numbers of data points are needed and through which patient histories are sifted.[2,24] The second type of crowd-sourcing is used in multiple disciplines within biomedical research including computational chemistry, genomics, medicinal chemistry, natural product discovery, pharmacology, proteomics, and toxicology to name a few.[11,22,25−27] Crowd-sourcing data collection can reduce bias in data acquisition and concluded results.[23] On the other hand, crowd-sourcing data analysis increases the transparency of the experiment and results by having multiple groups reach a concordance about the study and its conclusions.[23] In both scenarios, there is an increase in constructive discourse about the results and conclusions of the study due to the egalitarian nature of crowd-sourcing.[23] While crowd-sourcing provides a wealth of information, it comes with conditions that should be considered. It often requires a lot of time, resources, and personnel to maintain large data sets. Additionally, there is less control over experimental conditions and data collection quality.[23] Despite these caveats, crowd-sourcing has still proven to be a powerful technique, especially when combined with machine learning and new bioinformatic strategies.[2,5,24] Companies like Amazon, Google, and IBM have shown the advantages of using machine learning to better understand the habits of their customers. Researchers are doing the same with techniques like self-organizing maps, neural networks, and classification algorithms.[13,28−30] All of these methods require a large data input, and researchers are using crowd-sourced data to perform them. Results may be more informative about the state of a given system when crowd-sourced databases are used, especially when integrated into omics analysis workflows.

Conclusions and Outlook

Biomedical research is moving toward using big data that is often crowd-sourced in order to make more general conclusions of observed phenomena. Using crowd-sourced data has many benefits including a large sample pool, increased transparency throughout the scientific method, and more constructive discourse within a field or project. However, crowd-sourced data has a variable level of quality which can compromise results. By creating databases with crowd-sourced data sets, quality assurance procedures can be put into place. While this process is laborious, the end result is a highly curated database that can be used for the foreseeable future. While there is no one metric of success for a database, one gauge can be how widely disseminated and utilized the database is. The Unified CCS Compendium is an example of a successful database given that it is used by field experts internationally in a variety of studies, both fundamental and biomedical based investigations. This success can be attributed to the Compendium being user-friendly and user-focused. Metadata standards as well as inclusion guidelines are explicitly provided to contributors. Furthermore, a standardized spreadsheet-based reporting format is provided along with guidelines about the quality control process. Further discussion and specific details on these attributes have been previously reported.[13] Two final considerations pertain to (1) funding crowd-sourced databases and (2) practical considerations for the longevity and ongoing maintenance of these databases once they are developed. Resolutions to both of these considerations are ongoing discussions within the informatics community, and there is no one solution. One potential funding resource is collaboration with other research groups, the private sector, and/or a government agency. However, we propose a governmental/private sector alliance to retain the open-access databases post development while accepting contributions from academic researchers. This provides for any practical resources needed to maintain these large databases as well as continued open dialogue between all three sectors. Ultimately, this facilitated collaboration will enable biomedical researchers to take advantage of the opportunities and discoveries that this wealth of data presents.

26 in total

1. Envisioning data sharing for the biocomputing community.

Authors: Enrico Riccardi; Sergio Pantano; Raffaello Potestio
Journal: Interface Focus Date: 2019-04-19 Impact factor: 3.906

2. Perspectives on Data Analysis in Metabolomics: Points of Agreement and Disagreement from the 2018 ASMS Fall Workshop.

Authors: Erin S Baker; Gary J Patti
Journal: J Am Soc Mass Spectrom Date: 2019-08-22 Impact factor: 3.109

3. Robust Chemistry: The Importance of Data and Methods Sharing.

Authors: Mattias Björnmalm; Frank Caruso
Journal: Angew Chem Int Ed Engl Date: 2017-12-01 Impact factor: 15.336

4. Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge.

Authors: Jody C May; John A McLean
Journal: Annu Rev Anal Chem (Palo Alto Calif) Date: 2016-03-30 Impact factor: 10.745

5. Big Data Transforms Discovery-Utilization Therapeutics Continuum.

Authors: S A Waldman; A Terzic
Journal: Clin Pharmacol Ther Date: 2016-03 Impact factor: 6.875

6. HMDB 4.0: the human metabolome database for 2018.

Authors: David S Wishart; Yannick Djoumbou Feunang; Ana Marcu; An Chi Guo; Kevin Liang; Rosa Vázquez-Fresno; Tanvir Sajed; Daniel Johnson; Carin Li; Naama Karu; Zinat Sayeeda; Elvis Lo; Nazanin Assempour; Mark Berjanskii; Sandeep Singhal; David Arndt; Yonjie Liang; Hasan Badran; Jason Grant; Arnau Serra-Cayuela; Yifeng Liu; Rupa Mandal; Vanessa Neveu; Allison Pon; Craig Knox; Michael Wilson; Claudine Manach; Augustin Scalbert
Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971

7. Collision cross section compendium to annotate and predict multi-omic compound identities.

Authors: Jaqueline A Picache; Bailey S Rose; Andrzej Balinski; Katrina L Leaptrot; Stacy D Sherrod; Jody C May; John A McLean
Journal: Chem Sci Date: 2018-11-27 Impact factor: 9.825

8. A draft map of the human proteome.

Authors: Min-Sik Kim; Sneha M Pinto; Derese Getnet; Raja Sekhar Nirujogi; Srikanth S Manda; Raghothama Chaerkady; Anil K Madugundu; Dhanashree S Kelkar; Ruth Isserlin; Shobhit Jain; Joji K Thomas; Babylakshmi Muthusamy; Pamela Leal-Rojas; Praveen Kumar; Nandini A Sahasrabuddhe; Lavanya Balakrishnan; Jayshree Advani; Bijesh George; Santosh Renuse; Lakshmi Dhevi N Selvan; Arun H Patil; Vishalakshi Nanjappa; Aneesha Radhakrishnan; Samarjeet Prasad; Tejaswini Subbannayya; Rajesh Raju; Manish Kumar; Sreelakshmi K Sreenivasamurthy; Arivusudar Marimuthu; Gajanan J Sathe; Sandip Chavan; Keshava K Datta; Yashwanth Subbannayya; Apeksha Sahu; Soujanya D Yelamanchi; Savita Jayaram; Pavithra Rajagopalan; Jyoti Sharma; Krishna R Murthy; Nazia Syed; Renu Goel; Aafaque A Khan; Sartaj Ahmad; Gourav Dey; Keshav Mudgal; Aditi Chatterjee; Tai-Chung Huang; Jun Zhong; Xinyan Wu; Patrick G Shaw; Donald Freed; Muhammad S Zahari; Kanchan K Mukherjee; Subramanian Shankar; Anita Mahadevan; Henry Lam; Christopher J Mitchell; Susarla Krishna Shankar; Parthasarathy Satishchandra; John T Schroeder; Ravi Sirdeshmukh; Anirban Maitra; Steven D Leach; Charles G Drake; Marc K Halushka; T S Keshava Prasad; Ralph H Hruban; Candace L Kerr; Gary D Bader; Christine A Iacobuzio-Donahue; Harsha Gowda; Akhilesh Pandey
Journal: Nature Date: 2014-05-29 Impact factor: 49.962

9. Data sharing in large research consortia: experiences and recommendations from ENGAGE.

Authors: Isabelle Budin-Ljøsne; Julia Isaeva; Bartha Maria Knoppers; Anne Marie Tassé; Huei-yi Shen; Mark I McCarthy; Jennifer R Harris
Journal: Eur J Hum Genet Date: 2013-06-19 Impact factor: 4.246

10. PubChem 2019 update: improved access to chemical data.

Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

4 in total

1. Combining Isotopologue Workflows and Simultaneous Multidimensional Separations to Detect, Identify, and Validate Metabolites in Untargeted Analyses.

Authors: James N Dodds; Lingjue Wang; Gary J Patti; Erin S Baker
Journal: Anal Chem Date: 2022-01-28 Impact factor: 6.986