Jaqueline A Picache1, Jody C May1, John A McLean1. 1. Department of Chemistry, Center for Innovative Technology, Vanderbilt Institute of Chemical Biology, Vanderbilt Institute for Integrative Biosystems Research and Education, Vanderbilt University, Nashville, Tennessee 37235, United States.
Abstract
Mass spectrometry (MS) is used in multiple omics disciplines to generate large collections of data. This data enables advancements in biomedical research by providing global profiles of a given system. One of the main barriers to generating these profiles is the inability to accurately annotate omics data, especially small molecules. To complement pre-existing large databases that are not quite complete, research groups devote efforts to generating personal libraries to annotate their data. Scientific progress is impeded during the generation of these personal libraries because the data contained within them is often redundant and/or incompatible with other databases. To overcome these redundancies and incompatibilities, we propose that communal, crowd-sourced databases be curated in a standardized fashion. A small number of groups have shown this model is feasible and successful. While the needs of a specific field will dictate the functionality of a communal database, we discuss some features to consider during database development. Special emphasis is made on standardization of terminology, documentation, format, reference materials, and quality assurance practices. These standardization procedures enable a field to have higher confidence in the quality of the data within a given database. We also discuss the three conceptual pillars of database design as well as how crowd-sourcing is practiced. Generating open-source databases requires front-end effort, but the result is a well curated, high quality data set that all can use. Having a resource such as this fosters collaboration and scientific advancement.
Mass spectrometry (MS) is used in multiple omics disciplines to generate large collections of data. This data enables advancements in biomedical research by providing global profiles of a given system. One of the main barriers to generating these profiles is the inability to accurately annotate omics data, especially small molecules. To complement pre-existing large databases that are not quite complete, research groups devote efforts to generating personal libraries to annotate their data. Scientific progress is impeded during the generation of these personal libraries because the data contained within them is often redundant and/or incompatible with other databases. To overcome these redundancies and incompatibilities, we propose that communal, crowd-sourced databases be curated in a standardized fashion. A small number of groups have shown this model is feasible and successful. While the needs of a specific field will dictate the functionality of a communal database, we discuss some features to consider during database development. Special emphasis is made on standardization of terminology, documentation, format, reference materials, and quality assurance practices. These standardization procedures enable a field to have higher confidence in the quality of the data within a given database. We also discuss the three conceptual pillars of database design as well as how crowd-sourcing is practiced. Generating open-source databases requires front-end effort, but the result is a well curated, high quality data set that all can use. Having a resource such as this fosters collaboration and scientific advancement.
Mass spectrometry serves as a foundational analytical technology
in untargeted omics experiments.[1] In recent
decades, MS has enabled the collection of big data in biomedical research.
As more data is collected and the age of big data matures, many opportunities
arise to gain insightful knowledge about biomedical systems that were
not previously accessible.[2] However, many
of these opportunities remain unseized due to challenges in annotating
omics data, especially in the realm of small molecules.[1,3,4] Waldman and Terzic aptly describe
why annotating big data is difficult:“While
the goal is to extract insights from complex, noisy,
and heterogeneous datasets, barriers have included the speed of data
handling, curation and the veracity of the data, the sheer volume
of data, and the heterogeneity of data to be integrated.”[5]To overcome such barriers,
mass spectrometrists have turned to
bioinformatic solutions which include curating data sets and building
databases, data libraries, and/or data repositories. While these terms
are often used interchangeably, their definitions have nuanced differences
as described in Table .[6] Annotation of omics data relies heavily
on matches from database queries.[4] Success
in the annotation process is contingent upon the quality of the database
being queried as well as the amount of unique information known about
the omic compound in question. A few prominent, large-scale databases
include the Human Metabolome Database, PubChem, and UniProt.[7−9]
Table 1
Data Collection Terms
term
description
data set
a collection of data[6]
database
an organized collection of records that is standardized to
enable searching and retrieval of content[6]
data library
a collection of data
materials
in various formats with the purpose of providing information to a target group[6]
repository
a collection of digital
documents stored for preservation and
public access[6]
All three of these databases rely on crowd-sourced
information.
Generally, crowd-sourcing is an active solicitation of content, ideas,
or services from a large community. When performed by scientific database
curators, crowd-sourcing involves active parsing of the scientific
literature to update and addend contents in an automated fashion.
This automated crowd-sourcing process is necessary given that there
are reports of >290,000 proteins and >25,000 endogenous metabolites
in humans.[7,10] While databases such as those previously
mentioned provide an important service to the biomedical research
community, they remain incomplete, and in some cases, it is challenging
to recognize where they are incomplete. As a result, research groups
end up developing their own data libraries or databases. Consequences
of building personalized libraries and databases include a loss of
time and resources due to a redundancy of data acquisition and curation,
limited scientific collaboration due to incompatibilities (e.g., informatics,
jargon, etc.), and research opacity because raw data is often not
referenced or otherwise available.[11] To
alleviate these consequences, we propose that field experts build
a crowd-sourced database that integrates into successful pre-existing
workflows. It should be noted that contributors to the database (i.e.,
the crowd) will most likely also be field experts. Since database
developing is an iterative process, open dialogue between the developer
and crowd is encouraged to meet field specific needs. Two examples
of successful crowd-sourced databases are the MassBank of North America
(MONA) and the Unified Collision Cross Section Compendium. MONA was
the first public repository for small molecule mass spectral data.[12] The Unified CCS Compendium is a database of
drift-tube ion mobility mass spectrometry data of omic compounds.[13] Here, we discuss a model to create a crowd-sourced
omics database including five pillars of database features that need
to be considered. Further, we discuss design concepts and how crowd-sourcing
is currently done within the research community.
Database
Features
A generalized schematic of how an omic database
is developed is
shown in Figure .
Specifically, an initial data set is processed via data curation,
standardization, and annotation with metadata. Next, the curated data
set is compiled into a database that gets disseminated to others within
a field. These end users utilize the information within the database
to gain knowledge about their own experiments, which leads to novel
scientific conclusions and the formulation of future questions. These
new conclusions become newly generated data sets which then undergo
dereplication, validation, and standardization to be added to the
pre-existing database. Even though this process only contains five
general stages, much should be considered along each step. It is recommended
that the following features be considered before data acquisition
and curation as well as development of a database begins.
Figure 1
General Database Development. Databases start with an
initial data
set that undergoes standardization. This standardized database is
disseminated through the research–peer review cycle. Subsequently,
new data is added to the existing database and the cycle begins again.
General Database Development. Databases start with an
initial data
set that undergoes standardization. This standardized database is
disseminated through the research–peer review cycle. Subsequently,
new data is added to the existing database and the cycle begins again.
Standardization Requirements
The overall goal of a
database is to create a collection of data that end users can use
with as few barriers as possible.[14] One
way to minimize barriers that end users will face is to create a standardized
system which includes a standard data type, reporting format, terminology,
quality control process, metadata inclusion, and/or reference material
information, as shown in Figure . Data type refers to what kind of data the database
will contain. Will it be data from one specific technology or technique?
Will this data be in the primary (raw) or secondary (processed) form?
Primary data is preferred for scientific transparency. However, it
is often larger and will require more computational storage space
and data management resources. Secondary data is more common due to
their smaller storage requirements and ease of use. Most end users
prefer to look at conclusive or summative data.[15]
Figure 2
Database Features. To maximize the utility of their database, developers
should consider the data type, standard terminology, included metadata,
reference materials, and data management systems when designing their
database.
Database Features. To maximize the utility of their database, developers
should consider the data type, standard terminology, included metadata,
reference materials, and data management systems when designing their
database.Reporting format and standard
terminology must be considered. If
a database contains primary data, how will that data be uploaded by
the user? The database management system will need the capabilities
to handle large data file transfers as well as automated indexing
of addended data. Database management systems are discussed further
in section . Databases
that contain secondary data are easier to manage in terms of indexing
and storage needs. However, the database developers need to create
a standard format that is both informative and facile enough for end-users
to comply with. Database specific terminology must be defined from
the beginning so that users understand what is required of them and
how to use the data within.[14] Such terminology
should unambiguously convey experimental design, data acquisition,
and data processing parameters. Furthermore, any information needed
to provide a context for the reported results should also be included.
This enables other users to fully understand the stated conclusions
and compare studies from different research groups. One crucial aspect
of databases is that each record within the database needs to have
a unique identifier.[14] In metabolomics,
this can be a compound’s InChI Key or molecular structure.
In genomics, this could be a specific gene locus. This unique identifier
enables universal indexing of records without ambiguity and quick
data import and export from the database.
Metadata
Documentation
Metadata is defined as “minimum
information needed to ensure that submitted data are sufficient for
clear interpretation and querying by other scientists.”[14] As previously mentioned, inclusion of contextual
information as well as experimental procedures is imperative. Providing
metadata maximizes a data set’s utility by allowing others
to understand, reproduce, and build off of reported work. Database
developers should provide guidelines about what type of information
is needed for interpretation and querying for a specific field. Alternatively,
a database can contain primary references for a given data set such
that end users can obtain the metadata elsewhere. The Minimum Information
for Biological and Biomedical Investigations (MIBBI) is a useful resource
when deciding what metadata should be included.[16] MIBBI contains registries of reporting efforts for biological/biomedical
studies as well as field specific recommendations which include the
Minimum Information about a Genome Sequence (MIGS) and Minimum Information
about a Metagenomic Sequence (MIMS) for genomic and transcriptomic
data, the Minimum Information about a Proteomics Experiment (MIAPE)
for proteomics data, and the Core Information for Metabolomics Reporting
(CIMR) for metabolomics data.[16−19]
Reference Materials
The standardization
process goes
beyond informatics and reporting. It is recommended that database
guidelines include a physical standard such that the reagent can be
added as a control in experiments. This standard would serve as a
stable reference point for data quality when compared to known experimental
values for said reference standard.[14] Having
a reference standard that is accessible to a breadth of end users
enables data comparisons and quality checks between experiments, across
platforms, and between research groups. This is particularly important
when used in omics experiments in clinical/diagnostic settings. The
reference standard choice is often decided by the users within the
field, but standard materials and associated measurements are provided
by the National Institute of Standards Technology (https://www.nist.gov/services-resources/standards-and-measurements) in the United States and the Laboratory of the Government Chemist
(http://lgc.co.uk) in the United
Kingdom.
Quality Assurance
The success of any database is contingent
upon its quality assurance (QA). QA for a database is the process
that ensures that data and informatic tools within meet a certain
standard as dictated by the database design model and specifications.[20] QA is a twofold process: The first is during
initial development of the database. It begins with the standardization
procedures previously described as well as developing tools that can
audit the database intermittently. These audits should ensure that
all of the data is represented accurately and as planned, and that
all of the database functionality is operating properly.[20] The second process is when new data is added
to the database. Procedures should be in place to vet the quality
of the incoming data such that it meets the standardization requirements
previously set forth.[20] This ensures that
the integrity of the database is maintained.
Design Concepts
Conceptual Design
The conceptual
design of a database
defines the data requirements of and the application of the database
in question.[15] This includes the metadata
requirements as well as the standardized data type and format as previously
discussed. This pillar of the database design is field and data specific.
Discussions about the aforementioned database features from section should be included
during the conceptual design phase.
Logical Design
The logical design of a database involves
the implementation and management systems of the database.[15] It can be thought of as the “back-end”
design of a database. Particular attention should be paid to determining
how and when data will be normalized and background/noise corrected
and which, if any, further transformations will occur. Often, genomic
and transcriptomic databases contain primary data that is normalized
by the database infrastructure during the data submission process.
Proteomic and metabolomic data is usually presented as secondary data
that includes the larger context of the experiment performed.[15] Additionally, proteomic and metabolomic data
have more variety in potential output, in terms of content and size,
when compared to genomic and transcriptomic data. As a result, this
type of data is normalized and background subtracted before submission
to a database. Developers need to also consider if data sets are to
be kept separate or merged. Data sets can be merged to save space
and represent a crowd-sourced conclusion. Keeping them separate enables
study comparisons. Both options are used in bioinformatics, and the
overall aim of a given database will determine which is more suitable.Once data is addended into a database, it needs to be maintained.
Several options exist to retain order and search capabilities of a
database. For large data sets, SQL Server can be used. It works well
with relational data, especially if individual records have many attributes
associated with them.[21] SQL data sets can
be transformed into the XML data format which is amenable to many
informatic solutions and coding languages. For smaller data sets,
developers can utilize spreadsheet-based solutions which are easily
hosted online and can be transferred via CSV data format. Automated
maintenance and quality checks are recommended for both options and
should be determined before development begins. However, maintenance
is an iterative process and should be adjusted as needed. These tendencies
are general and individual developers should choose the appropriate
logistical design for their specific type of data.
Physical Design
The physical design of a database involves
determining the hardware necessary to support the database as well
as the design of any graphical interfaces needed.[15] It can be thought of as the front-end design pillar in
which the ease of use by experts in the field is the top priority.
Developers should determine if their database will be hosted via an
application or online. Additionally, developers should decide if and
what to archive (i.e., should outdated results be kept?) as well as
design tools that can query live and archived data.[20] Informatic tools such as statistical models and data visualization
graphics are also designed during this phase.[15] This state of the design process is the most open-ended, and graphical
output can vary widely. Furthermore, it is the most iterative stage,
as databases are likely to change depending on their contents and
new tools being added. Figure summarizes the three design concepts of planning a database.
Figure 3
Database
Design Concepts. The three phases of database design include
the conceptual, logical, and physical stages.
Database
Design Concepts. The three phases of database design include
the conceptual, logical, and physical stages.
Crowd-Sourcing Data
The past decade has seen
a push for data sharing and crowd-sourcing
research.[11,22−24] The age of big data
has matured alongside the ongoing improvements in computational power
and data storage capacity. These concurrent movements allow researchers
to gather more data than that which they could have collected independently
and perform wide-scale studies not previously possible. There are
two main crowd-sourcing techniques being used: (1) data-mining from
publicly available large data sets and (2) crowd-sourcing data acquisition
and/or analysis.[24] The first technique
is often used in public health studies where large numbers of data
points are needed and through which patient histories are sifted.[2,24] The second type of crowd-sourcing is used in multiple disciplines
within biomedical research including computational chemistry, genomics,
medicinal chemistry, natural product discovery, pharmacology, proteomics,
and toxicology to name a few.[11,22,25−27] Crowd-sourcing data collection can reduce bias in
data acquisition and concluded results.[23] On the other hand, crowd-sourcing data analysis increases the transparency
of the experiment and results by having multiple groups reach a concordance
about the study and its conclusions.[23] In
both scenarios, there is an increase in constructive discourse about
the results and conclusions of the study due to the egalitarian nature
of crowd-sourcing.[23] While crowd-sourcing
provides a wealth of information, it comes with conditions that should
be considered. It often requires a lot of time, resources, and personnel
to maintain large data sets. Additionally, there is less control over
experimental conditions and data collection quality.[23]Despite these caveats, crowd-sourcing has still proven
to be a
powerful technique, especially when combined with machine learning
and new bioinformatic strategies.[2,5,24] Companies like Amazon, Google, and IBM have shown
the advantages of using machine learning to better understand the
habits of their customers. Researchers are doing the same with techniques
like self-organizing maps, neural networks, and classification algorithms.[13,28−30] All of these methods require a large data input,
and researchers are using crowd-sourced data to perform them. Results
may be more informative about the state of a given system when crowd-sourced
databases are used, especially when integrated into omics analysis
workflows.
Conclusions and Outlook
Biomedical
research is moving toward using big data that is often
crowd-sourced in order to make more general conclusions of observed
phenomena. Using crowd-sourced data has many benefits including a
large sample pool, increased transparency throughout the scientific
method, and more constructive discourse within a field or project.
However, crowd-sourced data has a variable level of quality which
can compromise results. By creating databases with crowd-sourced data
sets, quality assurance procedures can be put into place. While this
process is laborious, the end result is a highly curated database
that can be used for the foreseeable future.While there is
no one metric of success for a database, one gauge
can be how widely disseminated and utilized the database is. The Unified
CCS Compendium is an example of a successful database given that it
is used by field experts internationally in a variety of studies,
both fundamental and biomedical based investigations. This success
can be attributed to the Compendium being user-friendly and user-focused.
Metadata standards as well as inclusion guidelines are explicitly
provided to contributors. Furthermore, a standardized spreadsheet-based
reporting format is provided along with guidelines about the quality
control process. Further discussion and specific details on these
attributes have been previously reported.[13]Two final considerations pertain to (1) funding crowd-sourced
databases
and (2) practical considerations for the longevity and ongoing maintenance
of these databases once they are developed. Resolutions to both of
these considerations are ongoing discussions within the informatics
community, and there is no one solution. One potential funding resource
is collaboration with other research groups, the private sector, and/or
a government agency. However, we propose a governmental/private sector
alliance to retain the open-access databases post development while
accepting contributions from academic researchers. This provides for
any practical resources needed to maintain these large databases as
well as continued open dialogue between all three sectors. Ultimately,
this facilitated collaboration will enable biomedical researchers
to take advantage of the opportunities and discoveries that this wealth
of data presents.
Authors: David S Wishart; Yannick Djoumbou Feunang; Ana Marcu; An Chi Guo; Kevin Liang; Rosa Vázquez-Fresno; Tanvir Sajed; Daniel Johnson; Carin Li; Naama Karu; Zinat Sayeeda; Elvis Lo; Nazanin Assempour; Mark Berjanskii; Sandeep Singhal; David Arndt; Yonjie Liang; Hasan Badran; Jason Grant; Arnau Serra-Cayuela; Yifeng Liu; Rupa Mandal; Vanessa Neveu; Allison Pon; Craig Knox; Michael Wilson; Claudine Manach; Augustin Scalbert Journal: Nucleic Acids Res Date: 2018-01-04 Impact factor: 16.971
Authors: Jaqueline A Picache; Bailey S Rose; Andrzej Balinski; Katrina L Leaptrot; Stacy D Sherrod; Jody C May; John A McLean Journal: Chem Sci Date: 2018-11-27 Impact factor: 9.825
Authors: Min-Sik Kim; Sneha M Pinto; Derese Getnet; Raja Sekhar Nirujogi; Srikanth S Manda; Raghothama Chaerkady; Anil K Madugundu; Dhanashree S Kelkar; Ruth Isserlin; Shobhit Jain; Joji K Thomas; Babylakshmi Muthusamy; Pamela Leal-Rojas; Praveen Kumar; Nandini A Sahasrabuddhe; Lavanya Balakrishnan; Jayshree Advani; Bijesh George; Santosh Renuse; Lakshmi Dhevi N Selvan; Arun H Patil; Vishalakshi Nanjappa; Aneesha Radhakrishnan; Samarjeet Prasad; Tejaswini Subbannayya; Rajesh Raju; Manish Kumar; Sreelakshmi K Sreenivasamurthy; Arivusudar Marimuthu; Gajanan J Sathe; Sandip Chavan; Keshava K Datta; Yashwanth Subbannayya; Apeksha Sahu; Soujanya D Yelamanchi; Savita Jayaram; Pavithra Rajagopalan; Jyoti Sharma; Krishna R Murthy; Nazia Syed; Renu Goel; Aafaque A Khan; Sartaj Ahmad; Gourav Dey; Keshav Mudgal; Aditi Chatterjee; Tai-Chung Huang; Jun Zhong; Xinyan Wu; Patrick G Shaw; Donald Freed; Muhammad S Zahari; Kanchan K Mukherjee; Subramanian Shankar; Anita Mahadevan; Henry Lam; Christopher J Mitchell; Susarla Krishna Shankar; Parthasarathy Satishchandra; John T Schroeder; Ravi Sirdeshmukh; Anirban Maitra; Steven D Leach; Charles G Drake; Marc K Halushka; T S Keshava Prasad; Ralph H Hruban; Candace L Kerr; Gary D Bader; Christine A Iacobuzio-Donahue; Harsha Gowda; Akhilesh Pandey Journal: Nature Date: 2014-05-29 Impact factor: 49.962
Authors: Isabelle Budin-Ljøsne; Julia Isaeva; Bartha Maria Knoppers; Anne Marie Tassé; Huei-yi Shen; Mark I McCarthy; Jennifer R Harris Journal: Eur J Hum Genet Date: 2013-06-19 Impact factor: 4.246
Authors: Sunghwan Kim; Jie Chen; Tiejun Cheng; Asta Gindulyte; Jia He; Siqian He; Qingliang Li; Benjamin A Shoemaker; Paul A Thiessen; Bo Yu; Leonid Zaslavsky; Jian Zhang; Evan E Bolton Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Sneha P Couvillion; Neha Agrawal; Sean M Colby; Kristoffer R Brandvold; Thomas O Metz Journal: Front Cell Infect Microbiol Date: 2020-07-31 Impact factor: 5.293