Literature DB >> 28346087

PubChem BioAssay: A Decade's Development toward Open High-Throughput Screening Data Sharing.

Yanli Wang1, Tiejun Cheng1, Stephen H Bryant1.   

Abstract

High-throughput screening (HTS) is now routinely conducted for drug discovery by both pharmaceutical companies and screening centers at academic institutions and universities. Rapid advance in assay development, robot automation, and computer technology has led to the generation of terabytes of data in screening laboratories. Despite the technology development toward HTS productivity, fewer efforts were devoted to HTS data integration and sharing. As a result, the huge amount of HTS data was rarely made available to the public. To fill this gap, the PubChem BioAssay database ( https://www.ncbi.nlm.nih.gov/pcassay/ ) was set up in 2004 to provide open access to the screening results tested on chemicals and RNAi reagents. With more than 10 years' development and contributions from the community, PubChem has now become the largest public repository for chemical structures and biological data, which provides an information platform to worldwide researchers supporting drug development, medicinal chemistry study, and chemical biology research. This work presents a review of the HTS data content in the PubChem BioAssay database and the progress of data deposition to stimulate knowledge discovery and data sharing. It also provides a description of the database's data standard and basic utilities facilitating information access and use for new users.

Entities:  

Keywords:  PubChem BioAssay; data sharing; high-throughput screening; open access

Mesh:

Substances:

Year:  2017        PMID: 28346087      PMCID: PMC5480605          DOI: 10.1177/2472555216685069

Source DB:  PubMed          Journal:  SLAS Discov        ISSN: 2472-5552            Impact factor:   3.341


Introduction

High-throughput screening (HTS) is a key technology for drug discovery that allows researchers to test hundreds of thousands of samples.[1] Screening of diverse libraries of small molecules has proven to be an essential method for identifying chemical starting points for early-stage drug discovery. Functional genomic screening is now often performed using an RNAi reagent library tailored toward a whole genome to identify genes critical to a biological process under study, answering fundamental biological questions and discovering novel therapeutic targets.[2] Despite being a relatively recent innovation, HTS technology is increasingly empowered by advances in many scientific and technical fields, such as instrumental automation, combinatorial chemical synthesis, and assay technology. It has also been spurred by the breakthrough in biological and genomic research for raising hypothesis and suggesting molecular targets as stimulated by the sequencing of the human genome.[3] In addition to the technical advances, new and exciting trends are emerging. One such trend is the growing HTS capacity in academic settings,[4,5] which used to be dominated by industry. As of 2016, more than 100 screening facilities at universities and academic institutions are registered at the Society for Laboratory Automation and Screening (SLAS; http://www.slas.org/resources/information/academic-screening-facilities/). Another trend is the call, support, and implementation for HTS data sharing,[6] which is widely accepted to be essential for research verification, data reuse, and knowledge discovery. Despite the expanded screening facilities in industry and academia, HTS data sharing was largely lacking. Fortunately, this situation started to change in 2003, given a breakthrough in the open-access movement toward enhancing public access to biomedical research supported by taxpayers. The open-access efforts were led by funding agencies and journal publishers taking steps[7-9] to mandate the deposition of manuscripts in PubMed Central (PMC; https://publicaccess.nih.gov/policy.htm) and research data in public repositories (https://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm). The PubChem project (https://pubchem.ncbi.nlm.nih.gov/)[10] started in 2004 at the National Center for Biotechnology Information (NCBI) in response to the open-access mandate. The PubChem BioAssay database was set up initially to archive the small-molecule HTS data from the National Institute of Health’s (NIH) Molecular Libraries Program (MLP), which funded a U.S.-wide screening center network between 2004 and 2013 targeting chemical probe development.[11] It grew tremendously over the past decade in both data capacity and utility,[12] with assay data contributed by more than 80 organizations and research laboratories. In addition to participation in MLP, PubChem collaborates with other initiatives funded by U.S. government agencies, the European Bioinformatics Institute (EBI), international functional genomics research consortiums, pharmaceutical companies, and journal publishers. For instance, PubChem exchanges small-molecule bioactivity data with ChEMBL,[13] a chemical biology database hosted by EBI primarily based on literature curation. PubChem also collaborates with several other chemical biology curation databases, such as Guide to PHARMACOLOGY,[14] BindingDB,[15] and PDBbind.[16] Another exemplary collaboration was with the RNAi Global Initiative, which put PubChem in outreach with the research groups conducting RNAi screening in the United States and the European biomedical community. This collaboration led to further development of the BioAssay data model and more than 100 large RNAi datasets in PubChem associated with recent publications. Importantly, many of the RNAi datasets, as well as several small-molecule datasets, were submitted to PubChem as required by journals, representing an excellent demonstration of collective efforts from the funding agencies, screening community, and journal publishers to support open access and HTS data sharing. Warehousing the big HTS data with a great diversity of assay protocols and making them easily accessible to the public present a big challenge to the development of PubChem. It requires continuous development regarding archival capacity, data model flexibility, and search and analysis utilities to meet the evolving and changing needs from the community.[17-21] These development efforts were greatly acknowledged, as demonstrated by a recent comprehensive review on the community’s use of the PubChem resources,[12] which was based on more than a thousand research papers published before 2014 by worldwide researchers telling how the PubChem resource was used in support of their research. The review work showed that the large collection of bioactivity data and molecular target information in PubChem BioAssay had greatly facilitated a number of research areas, such as validating compound bioactivity and target, generating bioactivity profile, virtual screening, polypharmacology research, and drug repositioning. Additionally, it is interesting that a significant number of informatics resources and tools were developed by the community to analyze or annotate the PubChem data, as summarized in the supplementary data of that work.[12] The PubChem BioAssay resource continues to be explored by the community, as shown by the growing number of citations for the PubChem resource. Interesting and insightful work using PubChem BioAssay is more likely to be found by searching in PubMed or PMC with the keywords “PubChem BioAssay” or otherwise simply with “PubChem” in general. While most of the applications of PubChem BioAssay paid extensive attention to the bioactive compounds and associated targets, the compounds consistently reported with inactive results across the assay collection in PubChem were recently explored for developing good starting point chemicals with unique activity and clean safety profiles,[22] which illustrated the benefit of archiving inactive data in the public repository. As another example, Helal et al. developed the HTS fingerprints (PubChem HTSFPs) by taking advantage of a large compound library that was tested in hundreds of assays deposited in PubChem BioAssay across a wide panel of targets.[23] The growth of research development built in part on the PubChem BioAssay resource clearly showed researchers’ recognition of the resource and enthusiasm in data mining and knowledge discovery. Kim et al. recently reported the utilization of the HTS toxicity data in PubChem BioAssay for exploring mechanism profiling of hepatotoxicity.[24] Additionally, several previous studies using PubChem BioAssay for toxicity prediction were reviewed by Zhu et al.[25,26] The applications of the PubChem BioAssay data in supporting virtual screening against several biologically critical targets were recently reviewed.[27] While the reviews showed appreciation for PubChem’s effort, it should be emphasized that the success of PubChem should also be attributed to the community for making this possible by using the resource, providing feedback, and more importantly, sharing research data. This work focuses on the development of PubChem BioAssay regarding HTS data collection. The work reviews the community contribution for data sharing and the progress of data deposition, and summarizes the HTS data content in PubChem BioAssay to illustrate the need for efforts from both PubChem and the community toward building a stronger biomedical information resource. In particular, data generated by the multiple stages of the MLP are analyzed to facilitate access to the chemical probe developments funded by NIH. A brief description about data access and submission is also provided to familiarize new users and depositors with this information resource.

HTS Data Content

PubChem BioAssay currently contains 1 million bioassay records, 30,000 protein and gene targets, 3 million tested substances, 2 million unique chemical structures, and 200 million bioactivity outcomes. Additional statistics can be found in . More than 95% of the data content in PubChem BioAssay is contributed by the HTS projects of small molecules ( and ) or RNAi reagents ( ) from dozens of worldwide screening facilities at universities, academic institutions, and pharmaceutical companies. Initially, the majority of the HTS data in PubChem was submitted by specialist informatics staff from screening centers. However, “wet laboratory” researchers have recently started to submit their data to PubChem. This recent trend has been in response to meeting the need for open access by journal publishers and funding agencies. A few of the HTS data contributors are referenced below for the purpose of illustrating the efforts and the progress being made by the community for data sharing. The entire list of assay depositors can be found at https://pubchem.ncbi.nlm.nih.gov/sources/.
Table 1.

PubChem BioAssay Statistics (as of October 10, 2016).

DescriptionSmall-Molecule AssaysRNAi Assays
Assay records (AIDs)1,218,60191
Substance samples (SIDs)3,224,025352,044
Chemical structures (CIDs)2,283,536
Bioactivity outcomes230,270,0941,033,519
Data points1,499,625,48014,598,030
Species3,5437
Protein targets10,182
Protein targets (human)4,784
Gene targets55,714
Gene targets (human)24,888
Gene targets with phenotype15,866
Table 2.

Summary of MLP’s HTS Assay Projects.

Assay Count[a]
Compound Count[b]
Screening CenterSummaryPrimaryConfirmatoryTestedActiveChemical ProbeProtein Target Count
Broad Institute103136950500,665129,54727233
Burnham Center for Chemical Genomics102206651419,794143,20036450
Columbia University Molecular Screening Center1910197,0929,0679
Emory University Molecular Libraries Screening Center22229348,78024,32620
Johns Hopkins Ion Channel Center25103106345,28137,359423
Molecular Libraries Program, Specialized Chemistry Center, University of Kansas2222,94131210
NIH Chemical Genomics Center (NCGC)17936976443,829244,06435255
New Mexico Molecular Libraries Screening Center (NMMLSC)30167206375,90140,5491569
Penn Center for Molecular Discovery (PCMD)2631224,3774,42416
Southern Research Specialized Biocontainment Screening Center141272355,23816,350511
Southern Research Molecular Libraries Screening Center (SRMLSC)14740224,57131,718211
Scripps Research Institute Molecular Screening Center150468703397,994136,87654574
University of Pittsburgh Molecular Library Screening Center13248223,27725,711116
Vanderbilt Screening Center for GPCRs, Ion Channels and Transporters131573222,81220,078694
Vanderbilt Specialized Chemistry Center10141251,75068363132

AID count.

CID count.

Table 3.

Summary of Small-Molecule HTS Screens (Excluding MLP).

Data SourceAssay CountCompound Count
Protein Target Count
Tested[a]Active
Abbott Labs27,5674,912
ChemBank1065,2011,629
Chemical genetic matrix213,0481,568
Cheminformatics & Chemogenomics Research Group (CCRG), Indiana University School of Informatics362,500970
Chen Lab, School of Medicine, Emory University11,947151
Circadian Research, Kay Laboratory, University of California at San Diego (UCSD)21,27615
UCLA Molecular Screening Shared Resource11,3855
NCI’s Developmental Therapeutics Program (DTP/NCI)173176,92925,036
GlaxoSmithKline (GSK)1514,03814,0382
Genomics Institute of the Novartis Research Foundation (GNF)/Scripps Winzeler Lab15,662274
Gregory J. Crowther613,4512276
ICCB–Longwood/NSRB Screening Facility, Harvard Medical School28528,89310,42615
Meiler Lab, Vanderbilt University1011,3853,2594
Milwaukee Institute for Drug Discovery1317,8081,2511
NCI’s Molecular Targets Development Program (MTDP)499,8588614
NINDS Approved Drug Screening Program341,033190
NIMH’s Psychoactive Drug Screening Program (PDSP)22,7306032
Southern Research Institute10361,1474,8714
Tox211058,7474,66120
UW Madison, Small Molecule Screening Facility169,794380
ChEMBL::Novartis Malaria Screening65,6145,014
ChEMBL::St. Jude Malaria Screening161,523

Only HTS screens testing more than 1,000 samples are included.

Table 4.

Summary of RNAi HTS Projects.

Data SourceAssay CountRNAi Reagent CountGene Target Count
TestedShow Phenotype
Cancer Research UK Cambridge Research Institute133133197
Department of Molecular Cell Biology, Weizmann Institute of Science1858520
Drosophila RNAi Screening Center (DRSC)3731,35614,2763,894
GE Healthcare Dharmacon RNAi Technologies18408405
Iain Fraser141,512252239
InfectX Consortium1115,37218,612
INSERM, Institut National de la Sante et de la Recherche Medicale222,950
Peterson Lab, Genentech115815733
siGENOME Human KINOME Library (BTR reporter screen)171471349
Genomics Institute of the Novartis Research Foundation (GNF)133,36417,453268
Victorian Centre for Functional Genomics, Peter MacCallum Cancer Centre1239,16034,6193,690
VTT Technical Research Centre of Finland (CSMA)11,380660422
PubChem BioAssay Statistics (as of October 10, 2016). Summary of MLP’s HTS Assay Projects. AID count. CID count. Summary of Small-Molecule HTS Screens (Excluding MLP). Only HTS screens testing more than 1,000 samples are included. Summary of RNAi HTS Projects. As the first depositor of the PubChem BioAssay, the Developmental Therapeutics Program at the National Cancer Institute (DTP/NCI)[28] shared the anticancer drug screening data on human tumor cell lines, yeast, and mouse models before PubChem made its first public release back in 2004. This contribution greatly helped PubChem in setting up its initial infrastructure and data processing pipeline. The pioneering work of DTP/NCI was followed by more than a dozen screen centers within the MLP,[11] the NIH’s initiative aimed to develop small-molecule chemical probes for studying the functions of a broad range of proteins and genes. A network of screening facilities at universities and research institutes across the United States, most of which are also listed at SLAS, was funded through two phases of the MLP: the Molecular Libraries Screening Centers Network (MLSCN) and the Molecular Libraries Probe Production Centers Network (MLPCN). As of today, the now ended MLP is still by far the largest HTS data contributor to the PubChem BioAssay database. To facilitate the community’s utilization of the research data generated by the 10-year-long HTS campaign, HTS data generated from the multiple stages of the MLP are summarized in the “MLP’s HTS Data” section. The Tox21 program (https://www.epa.gov/chemical-research/toxicology-testing-21st-century-tox21), a collaboration between NIH, the Environmental Protection Agency (EPA), and the Food and Drug Administration (FDA), has had more than 100 datasets from about 30 HTS projects deposited in PubChem BioAssay since 2012. The program tests a library of 10,000 compounds covering a broad range of chemicals found in industrial processes, consumer products, food additives, and human and veterinary drugs. It aims to provide evaluation of the chemicals collected regarding their potential and extent for disrupting biological processes in the human body that may lead to adverse health effects.[29-31] The data generated by the program contains rich information for toxicity evaluation. Novel agonists and antagonists were identified for various biological pathways, such as the retinoic acid receptor (RAR) signaling pathway, NFkB signaling pathway, and endoplasmic reticulum stress response signaling pathway. The Tox21 datasets provide a great opportunity for a comprehensive evaluation of the collected chemicals via the bioactivity and toxicity profile, given a common library that was tested in various pathways similarly to the capacity enabled by the MLP, as discussed later. The ICCB–Longwood Screening Facility at the Harvard Medical School has led the way in the academic sector supporting HTS data sharing.[32] It has remained an active PubChem contributor since 2010 and has deposited data from about 30 HTS projects, which cover a wide range of biological targets, as published in recent years. Datasets from several legacy screening programs supported by NIH, such as the NINDS Approved Drug Screening Program, also found PubChem as their home once the program was finalized. The open-access calling was also applauded by pharmaceutical companies. As an example, GlaxoSmithKline (GSK) contributed its antimalaria drug screening data to PubChem early in 2010, and data from another inhibition activity against kinetoplastid parasites, including Leishmania donovani, Trypanosoma cruzi, and Trypanosoma brucei, in 2015.[33] It is worth noting that a few datasets associated with recent publications were submitted to PubChem lately by researchers who were making the submissions either to meet the open-access requirement by journals or to support data sharing as a voluntary effort. These datasets cover a study reporting inhibitors against human phosphogluconate dehydrogenase (6PGD) published in Nature Cell Biology,[34] a screen of more than 10,000 compounds against five kinases from Plasmodium falciparum published in PLoS One, and a research paper published in the Journal of Biomolecular Screening reporting an HTS strategy for identifying inhibitors of protein–protein interactions with a library of 60,000 compounds.[35] The RNAi Global Initiative Consortium (http://www.rnaiglobal.org/) pioneered the effort of sharing RNAi research via the PubChem system by depositing a viability screen of human kinase and cell cycle genes in 2009. The second milestone was set by the Drosophila RNAi Screening Center (DRSC),[36] a member of the above consortium, which made its first submission in 2011 and since then has remained the largest contributor of RNAi data, with nearly 40 RNAi datasets deposited in PubChem BioAssay. Many of these datasets are primarily associated with publications in prestigious journals such as Nature, Science, Proceedings of the National Academy of Sciences of the United States of America, and Nature Genetics. The exemplary role of DRSC was quickly followed by others. The Victorian Centre for Functional Genomics at the Peter MacCallum Cancer Centre, also a member of the RNAi Global Initiative Consortium, joined forces and has contributed about a dozen datasets starting in 2014, mostly associated with publications in open-access journals.[37] Among the development for RNAi data sharing, the third exciting milestone was the deposition of a siRNA circadian assay by researchers at the Genomics Institute of the Novartis Research Foundation (GNF) in 2009.[38] That submission was made in response to the journal Cell’s recommendation of open access to the dataset, which was the first RNAi data deposition in PubChem by researchers in the course of the publication process. This initial step in response to the request by Cell has been followed by other international peer-reviewed journals and researchers complying with open-access policies. As a result, about 40 RNAi datasets have been submitted to PubChem, including several genome-wide screens. These datasets are primarily associated with publications () in journals promoting the sharing of valuable scientific datasets, such as Science Signaling, a weekly journal by the American Association for the Advancement of Science, and Scientific Data, an open-access journal from the Nature Publishing Group. Although the RNAi datasets are small in volume compared with the small-molecule datasets in PubChem, such submissions for functional genomic studies by far surpass the efforts from the chemical biology and medicinal chemistry research community with respect to early engagement, continuity, scale, and journal coverage. The greater response to RNAi data sharing by the community is not surprising given the historical and steady contributions from biologists to the growth of biological and genomic public databases, such as GenBank, GEO, and Expression Atlas. To further encourage and ease RNAi data submission, PubChem coordinates with vendors of siRNA reagents, such as GE Healthcare Dharmacon RNAi Technologies, Qiagen, Life Technologies, Applied Biosystems, and Ambion, for registering their catalogs in PubChem so that assay data can be referenced with the RNAi products. This effort enabled across-assay comparison for an RNAi sample, which is critical for identifying and confirming gene functionality and evaluating off-target effects of the reagent. As an example, the product M-012023-02 from GE Healthcare Dharmacon RNAi Technologies is shown to have been tested in five assays deposited in PubChem with data associated with five publications retrievable using the tool at https://pubchem.ncbi.nlm.nih.gov/assay/bioactivity.html?sid=152150429. PubMed links are provided by the tool via the PubMed icon, which can be followed to further retrieve all the samples and assay data reported in an article. The RNAi vendors could take a further step to link to such PubChem tools from the catalog at the vendor’s website to validate products, aggregate research data, and promote data sharing. Functional genomics plays a crucial role for understanding the dynamic properties of an organism at the cellular level, which is complementary to chemical genomics for drug development by deciphering the responsible biological pathways for a given disease status and suggesting novel drug targets. Whole-genome-based high-throughput RNAi screening is able to rapidly examine each gene in a genome for its potential effect on the phenotype of interest. Having access to both small-molecule and genome-wide screening allows the data integration from both research disciplines and helps to bring together genomic scientists, chemical biologists, and medicinal chemists to synergize discovery efforts. The joint efforts are critical for exploring biological and chemical space effectively to accelerate the identification and validation of drug target, and the understanding of the mode of action for a small molecule as exemplified by the work from Sundaramurthy et al.[39] While some screening facilities possess both small-molecule and RNAi screening capabilities and others do not, PubChem BioAssay collects, archives, and integrates both types of research results, and enables simultaneous access to functional genomic and chemical genomic HTS data to stimulate the discovery for cross-disciplinary research. Using tools provided at PubChem, one can aggregate RNAi results of a gene that suggest its potential cellular functionality, and meanwhile access the small-molecule bioactivity information to search drugs, chemical probes, agonists, and antagonists that target the same gene (e.g., https://pubchem.ncbi.nlm.nih.gov/assay/bioactivity.html?geneid=659). When starting from a drug molecule, one can combine, compare, and analyze its bioactivity against various protein targets (e.g., https://pubchem.ncbi.nlm.nih.gov/assay/bioactivity.html?cid=9809715) for drug reposition. Both small-molecule and RNAi HTS data can be browsed using the PubChem BioAssay Classification Tree (https://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?p=classification). For example, as shown in , one may click to expand the “HTS Projects” node in the tree, and then access RNAi data or link to the more than 6000 datasets from the MLP by clicking on the count. One may further explore the subtree nodes to browse HTS projects from a particular data source, such as the RNAi data deposited by DRSC under “RNAi HTS,” or the small-molecule data from the Tox21 program under “Small-molecule HTS.”
Figure 1.

Browse HTS projects using the PubChem BioAssay Classification Tree. A subtree node can be expanded by a click on the triangle icon. The count of BioAssay records associated with each node is shown, and clickable linking to the corresponding list of BioAssay records in Entrez.

Browse HTS projects using the PubChem BioAssay Classification Tree. A subtree node can be expanded by a click on the triangle icon. The count of BioAssay records associated with each node is shown, and clickable linking to the corresponding list of BioAssay records in Entrez.

MLP’s HTS Data

Between the fiscal years 2005 and 2014, the MLP carried out more than 600 assay projects and yielded more than 300 chemical probes, along with 6,000 datasets deposited in PubChem BioAssay. Compounds from the NIH’s Molecular Libraries Small Molecule Repository (MLSMR, https://mlsmr.evotec.com/MLSMR_HomePage) were screened across the network of projects. MLSMR grew from 60,000 small molecules in year 2005 to 350,000 by the end of 2009. The library identified and collected compounds from four classes, including specialty sets with known bioactivities, such as drugs, toxins, and metabolites; natural products; targeted libraries with bioactive compounds for protease, kinase, G protein–coupled receptor (GPCR), and so forth; and diversity compounds, with each associated with as many as four close analogs. The structural complexity and diversity of the small molecules in the MLSMR library were analyzed in several studies.[40,41] At the closure of MLP, 380,000 out of the total of 400,000 compounds in the MLSMR had been tested, in multiple assays, and associated with biological data in PubChem. summarizes the association with biological data in PubChem for the compounds in MLSMR, and it can be seen that two-thirds of the small-molecule samples in MLSMR are reported in more than 400 PubChem BioAssay submissions (AIDs). shows a similar summary but counts only the active biological test results, providing an estimate for compounds’ “promiscuity” by gathering all active assay data from MLP. The public availability of this large-scale screening campaign using a common library enabled the generation of a biological activity profile and the systematic investigation of biological target space for a large and diversified chemical collection.
Figure 2.

Summary of compounds in the MLSMR library that are associated with biological data. The x axis provides a count of BioAssay accessions (AIDs); the y axis provides the percentage of the substance samples in MLSMR that are tested across multiple assays at a given count of AIDs. x axis for (a) counts of all tested assays and for (b) counts of only active assays.

Summary of compounds in the MLSMR library that are associated with biological data. The x axis provides a count of BioAssay accessions (AIDs); the y axis provides the percentage of the substance samples in MLSMR that are tested across multiple assays at a given count of AIDs. x axis for (a) counts of all tested assays and for (b) counts of only active assays. The growth of the HTS data from MLP is shown in , including datasets, tested samples, unique chemical structures, bioactivity outcomes, data points, assay targets, and species. More than 4,000 MLP datasets (out of the 6,000 MLP datasets in total) contain biological target specification, while others that do not have molecular target data were either cell based or organism based. Most of the MLP projects started with primary screens using the whole MLSMR library, or a subset of it available at the time of testing. These screens were then followed by multiple dose–response assays for hit confirmation, as well as counterscreens monitoring aspects such as solubility, cytotoxicity, target selectivity, and artifacts. Selectivity screens were often performed against biologically related targets, while toxicity screens were conducted with multiple cell lines. Counterscreens using various assay detection methods were provided as well to rule out false positives. Solubility profiles were generated for the common compound library. As MLP required immediate data deposition, an HTS assay project was often associated with multiple assay submissions as the project advanced and new data were generated. An MLP assay project is represented by a summary AID in the PubChem BioAssay database, and links up all the related datasets deposited over the time reporting the stages of the project, as shown in . Datasets within an assay project may be designated as a group, which is important for interpreting the ultimate outcomes. Such a group of datasets can be accessed from any single assay record within the group through the “Same-Project BioAssays” section on the BioAssay record page (). Users are highly recommended to utilize and combine the dataset group information, together with other types of related assays for data analysis. A summary of the MLP assay projects and their outcomes is provided in . The summary is provided per each screening center that participated in the network, and it shows that there is a wide range regarding the number of assay projects (represented by summary AIDs in ) carried out by the screening centers. The highest productivity was seen for four centers, including the Broad Institute, National Center for Advancing Translational Sciences (NCATS) (formerly NCGC), Burnham Center for Chemical Genomics, and Scripps Research Institute Molecular Screening Center, which in part reflects the funding mechanism that these four screening centers were selected as the “comprehensive centers” at the MLPCN phase for conducting larger-scale HTS projects covering broader research areas.
Figure 3.

Growth of the MLP’s HTS data, including BioAssay records, tested substances, unique chemical structures, bioactivity outcomes, data points, protein targets, and species.

Growth of the MLP’s HTS data, including BioAssay records, tested substances, unique chemical structures, bioactivity outcomes, data points, protein targets, and species. The hit rates of the primary screens testing more than 100,000 small-molecule samples for each MLP center are given in . While the plot shows similar distribution from the four comprehensive screening centers, it is interesting to note the relatively higher hit rates by several specialized centers, such as the Vanderbilt Screening Center for GPCRs, Ion Channels and Transporters, and the Southern Research Molecular Libraries Screening Center (SRMLSC). This is presumably owing to these specialized centers having been more focused on specific research areas and tending to screen targeted compound libraries. For example, the Vanderbilt Screening Center for GPCRs, Ion Channels and Transporters only provided 15 primary screening datasets to PubChem BioAssay, with small-molecule samples no more than 120,000; that is, only a third of MLSMR at most was screened by this center.
Figure 4.

Hit rates for MLP centers. The red dot shows the median of hit rates for each center. Only primary assays that screened more than 100,000 substance samples were included.

Hit rates for MLP centers. The red dot shows the median of hit rates for each center. Only primary assays that screened more than 100,000 substance samples were included. The MLP has shown great productivity and diversity regarding the coverage of assay target and species, and yields of chemical probes, especially in the MLPCN phase, as shown in and . A total of 931 unique protein targets, including the primary targets and the biologically related ones in selectivity counterscreens, were tested by the MLP covering a broad range of protein classes, such as enzyme, membrane receptor, and ion channel. The number of the available datasets and chemical probes developed for each protein target class are shown in for the entire MLP period, as well as partitioned by the MLSCN versus MLPCN phases. The top 20 mostly studied species (134 in total) are given in . More information regarding access to the MLP assay projects, datasets, and chemical probe structures is summarized in , offering tips for Entrez search query for identifying specific information regarding the resource generated by MLP. During the course of the chemical probe development, a number of key assay technologies were developed by the MLP. Some of them were recently reviewed, together with a summary of the MLP’s chemical probes,[42] while others, including the quantitative high-throughput screening (qHTS) technology, which had been employed in almost all the screens carried out by the NIH Chemical Genomics Center (NCGC), were previously described.[43-45]
Figure 5.

A summary of the MLP assay records (AID count) and chemical probes (probe count) among classes of assay targets. The number of assay records at the two phases of MLP are indicated by MLSCN and MLPCN; the count of chemical probes is indicated by “Probe.”

A summary of the MLP assay records (AID count) and chemical probes (probe count) among classes of assay targets. The number of assay records at the two phases of MLP are indicated by MLSCN and MLPCN; the count of chemical probes is indicated by “Probe.”

Data Presentation, Access, and Submission

PubChem BioAssay implements a one-stop data model with necessary flexibility for accommodating data diversity. An assay record is presented in two parts, including metadata and assay result. The metadata section describes the essential information for an assay, including protocol, molecular target, and cross-reference, and the assay result section reports experimental data linking to the tested samples registered in the PubChem Substance database. The PubChem BioAssay data model allows as many readouts to be reported as needed. Meanwhile, it requires the provision of a “summary result” for each tested substance sample as an indication of bioactivity outcome (e.g., active vs. inactive) and an activity score for ranking hits in a screen. For dose–response test, PubChem requires readout of active concentration, such as IC50 or EC50 (in micromoles) as part of the summary result for small-molecule data. For RNAi data, the gene target of an RNAi reagent is required to be specified. This data standard allows the development of computer tools for data integration and comparison across assay, compound, target, and cell line.[19-21] Similarly to the MLP datasets, datasets relating to the same assay project generally can be submitted separately to PubChem, making it critical to combine all related datasets for the best interpretation of the underlying data. PubChem BioAssay designates an assay project via the “Summary” assay model, which provides a comprehensive description of the entire project and links to all individual submissions under it. As described above for the MLP data, such related datasets are presented in the “Same-Project BioAssays” section in an assay record page () prompting data integration. In general, PubChem supports the designation of related BioAssay records regardless of data source, which allows screens from different laboratories to be linked and compared for research result validation. Via this mechanism, a new assay submission can specify one or multiple AIDs as cross-reference; in turn, PubChem would show a reciprocal relationship from the BioAssay record page of any AID involved. In addition, PubChem BioAssay further derives relationships between the assays based on protein and gene target, common screening library, and same publication.[19-21] These computational efforts allow search of assays from biologically related targets, such as to find assays containing targets that share protein sequence similarity, or to find assays with targets that have interactions in a biological pathway. They also allow rapid hit evaluation, such as to identify false positives by using related assays from counterscreenings, or to filter out nonspecific hits by looking into common hits across assays. The links between small-molecule datasets and RNAi screening data allow one to combine and accelerate research from multiple scientific disciplines for discovering novel targets for small-molecule drug development, providing chemical tools for further validation of functional genomics study, and deciphering mechanisms of action for small molecules with the integration of RNAi profiling data. PubChem BioAssay (https://www.ncbi.nlm.nih.gov/pcassay) can be accessed through Entrez, the NCBI information retrieval system. It is cross-linked to other databases in Entrez, such as PubMed, which enables users to access the datasets from the PubMed abstract pages. PubChem BioAssay FTP (ftp://ftp.ncbi.nlm.nih.gov/pubchem/Bioassay/) provides access to all deposited records and derived information. PubChem BioAssay also provides a suite of integrated services (https://pubchem.ncbi.nlm.nih.gov/assay/) enabling users to search, collect, compare, and analyze biological test results. The BioAssay record service provides access to the metadata and entire dataset given the assay accession (AID). As an example, AID 1284, submitted as a dose–response biochemical screen reporting inhibitors of c-Jun N-terminal kinase 3 (JNK3), can be accessed at https://pubchem.ncbi.nlm.nih.gov/bioassay/1284. Additional examples of BioAssay records illustrating various data types with the respective URLs are provided in . Assay data may be submitted via PubChem Upload (https://pubchem.ncbi.nlm.nih.gov/upload/), the PubChem deposition gateway, which provides an extensive set of wizards, in-line help tips, and guided tutorials to assist data submission. Checkpoints for common mistakes are implemented for submission validation to ensure data integrity. PubChem allows depositors to update records and version changes to add, remove, and replace information with all changes archived. PubChem also implements a flexible on-hold mechanism to embargo Substance and BioAssay data to meet special needs from researchers, such as to complete the peer review and publishing process of a journal manuscript, or to wait for the approval of patent application. Additionally, depositors and collaborators have full access to the on-hold data via a secure URL. URLs for important PubChem Upload documents, including login, submission help, FAQs, submission sample files, and guidance for accessing on-hold data are summarized in .

Summary

PubChem BioAssay (https://www.ncbi.nlm.nih.gov/pcassay/) serves as a public repository for archiving biological test results of small molecules and RNAi reagents, which for the first time enabled public access and sharing of large-scale HTS data among the drug discovery and screening community. The complex nature of the HTS data requires a robust information system for tracking data submissions, updates, cross-references, and relationships among datasets. With 12 years’ development and the community’s support, including utilizing the resource, sharing research data, and providing annotations, PubChem has become a widely used public information platform supporting drug development, research for medicinal chemistry, chemical and functional genomics, and bioinformatics and cheminformatics.[12] The scalable infrastructure built by PubChem as a public archival system is far from being fully utilized by the community for stimulating discovery and supporting data validation, reuse, and interpretation. Researchers’ submission of RNAi data to PubChem is showing the screening community’s support of data sharing. However, progress has been slow and inconsistent, which is similar to a recent finding that data sharing is largely lacking in many research fields for NIH-funded research projects.[46] On the other hand, there are evolving and positive changes in that funding agencies are tightening up mandatory data sharing policy by explicitly requiring data deposition in public repository when awarding a grant. Meanwhile, awareness from journals and researchers, as well as their support for a data sharing requirement, is increasing, and many open-access journals have been created in recent years calling for data sharing via public repositories. Being now designated as a public repository by a growing list of journals and publishers, PubChem anticipates continuous growth of data deposition in the era of open science. Additionally, a few other areas are under development involving collaborations between PubChem and the community, which include, but are not limited to, the provision of annotations for assay metadata, the validation of assay results, and the development of software tools for annotating assay submission. A stronger public repository for the chemical biology and functional genomics research would also require collaborations among biologists, medicinal chemists, laboratory screeners, and informaticians to further develop ontology and guidelines for describing assay technology, to enhance metadata annotation, and to define practical criteria for hit identification and readout reports. PubChem welcomes and encourages contributions from the SLAS community to use the resource, provide guidance and suggestions, and share research results.
  43 in total

1.  Data mining a small molecule drug screening representative subset from NIH PubChem.

Authors:  Xiang-Qun Xie; Jian-Zhong Chen
Journal:  J Chem Inf Model       Date:  2008-02-27       Impact factor: 4.956

Review 2.  Advancing Biological Understanding and Therapeutics Discovery with Small-Molecule Probes.

Authors:  Stuart L Schreiber; Joanne D Kotz; Min Li; Jeffrey Aubé; Christopher P Austin; John C Reed; Hugh Rosen; E Lucile White; Larry A Sklar; Craig W Lindsley; Benjamin R Alexander; Joshua A Bittker; Paul A Clemons; Andrea de Souza; Michael A Foley; Michelle Palmer; Alykhan F Shamji; Mathias J Wawer; Owen McManus; Meng Wu; Beiyan Zou; Haibo Yu; Jennifer E Golden; Frank J Schoenen; Anton Simeonov; Ajit Jadhav; Michael R Jackson; Anthony B Pinkerton; Thomas D Y Chung; Patrick R Griffin; Benjamin F Cravatt; Peter S Hodder; William R Roush; Edward Roberts; Dong-Hoon Chung; Colleen B Jonsson; James W Noah; William E Severson; Subramaniam Ananthan; Bruce Edwards; Tudor I Oprea; P Jeffrey Conn; Corey R Hopkins; Michael R Wood; Shaun R Stauffer; Kyle A Emmitte
Journal:  Cell       Date:  2015-06-04       Impact factor: 41.582

3.  Deducing the mechanism of action of compounds identified in phenotypic screens by integrating their multiparametric profiles with a reference genetic screen.

Authors:  Varadharajan Sundaramurthy; Rico Barsacchi; Mikhail Chernykh; Martin Stöter; Nadine Tomschke; Marc Bickle; Yannis Kalaidzidis; Marino Zerial
Journal:  Nat Protoc       Date:  2014-01-30       Impact factor: 13.491

4.  Novel inhibitors for PRMT1 discovered by high-throughput screening using activity-based fluorescence polarization.

Authors:  Myles B C Dillon; Daniel A Bachovchin; Steven J Brown; M G Finn; Hugh Rosen; Benjamin F Cravatt; Kerri A Mowen
Journal:  ACS Chem Biol       Date:  2012-04-20       Impact factor: 5.100

5.  A genome-wide RNAi screen for modifiers of the circadian clock in human cells.

Authors:  Eric E Zhang; Andrew C Liu; Tsuyoshi Hirota; Loren J Miraglia; Genevieve Welch; Pagkapol Y Pongsawakul; Xianzhong Liu; Ann Atwood; Jon W Huss; Jeff Janes; Andrew I Su; John B Hogenesch; Steve A Kay
Journal:  Cell       Date:  2009-09-17       Impact factor: 41.582

6.  PubChem's BioAssay Database.

Authors:  Yanli Wang; Jewen Xiao; Tugba O Suzek; Jian Zhang; Jiyao Wang; Zhigang Zhou; Lianyi Han; Karen Karapetyan; Svetlana Dracheva; Benjamin A Shoemaker; Evan Bolton; Asta Gindulyte; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2011-12-02       Impact factor: 16.971

7.  Genome-wide functional genomic and transcriptomic analyses for genes regulating sensitivity to vorinostat.

Authors:  Katrina J Falkenberg; Cathryn M Gould; Ricky W Johnstone; Kaylene J Simpson
Journal:  Sci Data       Date:  2014-07-08       Impact factor: 6.444

8.  PubChem BioAssay: 2017 update.

Authors:  Yanli Wang; Stephen H Bryant; Tiejun Cheng; Jiyao Wang; Asta Gindulyte; Benjamin A Shoemaker; Paul A Thiessen; Siqian He; Jian Zhang
Journal:  Nucleic Acids Res       Date:  2016-11-29       Impact factor: 16.971

9.  Analysis of the Effects of Cell Stress and Cytotoxicity on In Vitro Assay Activity Across a Diverse Chemical and Assay Space.

Authors:  Richard Judson; Keith Houck; Matt Martin; Ann M Richard; Thomas B Knudsen; Imran Shah; Stephen Little; John Wambaugh; R Woodrow Setzer; Parth Kothiya; Jimmy Phuong; Dayne Filer; Doris Smith; David Reif; Daniel Rotroff; Nicole Kleinstreuer; Nisha Sipes; Menghang Xia; Ruili Huang; Kevin Crofton; Russell S Thomas
Journal:  Toxicol Sci       Date:  2016-09-07       Impact factor: 4.849

10.  An overview of the PubChem BioAssay resource.

Authors:  Yanli Wang; Evan Bolton; Svetlana Dracheva; Karen Karapetyan; Benjamin A Shoemaker; Tugba O Suzek; Jiyao Wang; Jewen Xiao; Jian Zhang; Stephen H Bryant
Journal:  Nucleic Acids Res       Date:  2009-11-19       Impact factor: 16.971

View more
  11 in total

1.  Ectopic suicide inhibition of thioredoxin glutathione reductase.

Authors:  Ilaria Silvestri; Haining Lyu; Francesca Fata; Paul R Banta; Benedetta Mattei; Rodolfo Ippoliti; Andrea Bellelli; Giuseppina Pitari; Matteo Ardini; Valentina Petukhova; Gregory R J Thatcher; Pavel A Petukhov; David L Williams; Francesco Angelucci
Journal:  Free Radic Biol Med       Date:  2019-12-20       Impact factor: 7.376

2.  Development and Testing of Druglike Screening Libraries.

Authors:  Junmei Wang; Yubin Ge; Xiang-Qun Xie
Journal:  J Chem Inf Model       Date:  2019-01-03       Impact factor: 4.956

3.  Comparing Machine Learning Algorithms for Predicting Drug-Induced Liver Injury (DILI).

Authors:  Eni Minerali; Daniel H Foil; Kimberley M Zorn; Thomas R Lane; Sean Ekins
Journal:  Mol Pharm       Date:  2020-06-08       Impact factor: 4.939

Review 4.  100 Years of Suramin.

Authors:  Natalie Wiedemar; Dennis A Hauser; Pascal Mäser
Journal:  Antimicrob Agents Chemother       Date:  2020-02-21       Impact factor: 5.191

5.  The Commoditization of AI for Molecule Design.

Authors:  Fabio Urbina; Sean Ekins
Journal:  Artif Intell Life Sci       Date:  2022-01-24

6.  Quantum Machine Learning Algorithms for Drug Discovery Applications.

Authors:  Kushal Batra; Kimberley M Zorn; Daniel H Foil; Eni Minerali; Victor O Gawriljuk; Thomas R Lane; Sean Ekins
Journal:  J Chem Inf Model       Date:  2021-05-25       Impact factor: 6.162

Review 7.  Benchmarking Data Sets from PubChem BioAssay Data: Current Scenario and Room for Improvement.

Authors:  Viet-Khoa Tran-Nguyen; Didier Rognan
Journal:  Int J Mol Sci       Date:  2020-06-19       Impact factor: 5.923

Review 8.  Challenges of Connecting Chemistry to Pharmacology: Perspectives from Curating the IUPHAR/BPS Guide to PHARMACOLOGY.

Authors:  Christopher Southan; Joanna L Sharman; Elena Faccenda; Adam J Pawson; Simon D Harding; Jamie A Davies
Journal:  ACS Omega       Date:  2018-07-31

9.  Prediction of Compound Profiling Matrices Using Machine Learning.

Authors:  Raquel Rodríguez-Pérez; Tomoyuki Miyao; Swarit Jasial; Martin Vogt; Jürgen Bajorath
Journal:  ACS Omega       Date:  2018-04-30

Review 10.  Caveat Usor: Assessing Differences between Major Chemistry Databases.

Authors:  Christopher Southan
Journal:  ChemMedChem       Date:  2018-02-23       Impact factor: 3.466

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.