Emily Boja1, Živana Težak2, Bing Zhang3, Pei Wang4, Elaine Johanson5, Denise Hinton6, Henry Rodriguez7. 1. Office of Cancer Clinical Proteomics Research, Center for Strategic Scientific Initiatives, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA. emily.boja@nih.gov. 2. Office of In Vitro Diagnostics and Radiological Health, Center for Devices and Radiological Health, US Food and Drug Administration, Silver Spring, MD, USA. zivana.tezak@fda.hhs.gov. 3. Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA. 4. Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA. 5. Office of Health Informatics, Office of the Chief Scientist, Office of the Commissioner, US Food and Drug Administration, Silver Spring, MD, USA. 6. Office of the Chief Scientist, Office of the Commissioner, US Food and Drug Administration, Silver Spring, MD, USA. 7. Office of Cancer Clinical Proteomics Research, Center for Strategic Scientific Initiatives, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
Although genomics has shaped the current scope of precision medicine, it is becoming increasingly clear that molecular phenotypes, such as DNA and RNA profiles and, in particular, protein abundance profiles, are essential to our understanding of biology and for enhancing our ability to achieve the promise of precision medicine for patients. Hence, simultaneous generation and integration of multidimensional multi-omics datasets from a large set of tumor samples, such as those used in the National Cancer Institute’s (NCI) The Cancer Genome Atlas (TCGA; https://cancergenome.nih.gov) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC; https://proteomics.cancer.gov) projects[1-4], is becoming a powerful approach to understanding the molecular basis of diseases and speeding the translation of new discoveries to patient care. This development has been largely enabled by the rapid technological advancement, standardization and harmonization in tumor molecular profiling in recent years. Consequently, several initiatives have been launched to leverage this development for application to clinical practice, including the International Cancer Proteogenome Consortium[5] and the Applied Proteogenomics Organizational Learning and Outcomes[6] programs. These efforts promise to revolutionize our understanding of cancer biology and change the way cancer is treated.The value of multi-omics technologies and datasets lies in the possibility of accurately extracting rich information to help understand the molecular complexities specific to individual patients through use of sophisticated integrative computational algorithms. Such information can be used to reach a deeper understanding of a disease, which then can be applied clinically, for example, to elucidate the relationship between the genome and proteome of a patient’s tumor or to deconvolute tumor heterogeneity associated with clinical outcome. Ideally, individual and population data would ultimately serve to inform a physician and a patient and to help determine the most appropriate treatment options. Furthermore, the comprehensive information obtained on the same sample in multiple dimensions can add value in pinpointing and correcting problems that can be encountered, such as sample mislabeling by accidental swapping of patient samples or data mislabeling (accidental swapping of patient omics data), which could lead to multiple patients receiving the wrong medical treatment, resulting in severe, irreversible consequences.Sample mislabeling that contributes to irreproducible results and invalid conclusions is known to be one of the obstacles in basic and translational research[7]. This is also prevalent in data-rich large-scale omics studies[8,9], in which human errors could arise anywhere in the data production and analysis pipeline—either sample mislabeling (early in the pipeline) or data mislabeling (later in the pipeline).The Food and Drug Administration (FDA) and NCI-CPTAC, with a history of collaboration[10], also have experience in building challenges, such as the precisionFDA Challenges (https://precision.fda.gov/challenges) and NCI–CPTAC DREAM Proteogenomics Challenge (https://www.synapse.org/#!Synapse:syn8228304/wiki/413428), to solve complex problems. Now they are joining forces to launch a Multi-omics Enabled Sample Mislabeling and Correction Challenge (https://precision.fda.gov/mislabeling) in September 2018. The objective of this challenge is to encourage development and evaluation of computational algorithms that can accurately detect and correct mislabeled samples using rich multi-omics datasets, enhancing the assurance that the right data is attributed to the right patient.
Challenge design
The challenge comprises two subchallenges to be conducted sequentially. In Subchallenge 1, participants will be asked to detect mislabeled samples. Participants will be presented with a training dataset and a test dataset, comprising real-world clinical and proteomics data. Mislabeled samples will be known in the training dataset and not known in the test dataset. Using the training dataset, participants will develop computational models to distinguish samples of matched and nonmatched clinical and proteomics data. The computational models will then be used to identify mislabeled samples in the test dataset.In Subchallenge 2, participants will be asked to correct mislabeled samples in richer data. Participants will be presented with real-world RNA profiling data for all samples in both the training and test datasets. Similar to the clinical and proteomics data, newly introduced RNA profiling data will also include mislabeled samples. As with Subchallenge 1, this information will be known in the training dataset, but not in the test dataset. Participants will develop computational algorithms to model the relationships among the three data types in the training dataset and then will apply the computational model to identify and correct instances of single data type sample mislabeling among the trio of data types in the test dataset. Subchallenge results will be independently evaluated (Fig. 1).
Fig. 1
Challenge design and timelines.
Challenge design and timelines.
Anticipated outcome and impact
An immediate outcome envisioned is a flagship challenge manuscript that gives an overview of the challenge data, questions, design, and outcomes[11]. Additionally, the algorithms that the participants propose will be aggregated with the aim of refining a final open-source product to be incorporated into an analysis pipeline and ultimately as part of a quality-management system to reduce errors. This could help speed the translation of multidimensional omics technologies and datasets to the clinic. Meanwhile, NCI and FDA hope to build and expand a community of scientists that will collaborate to solve important problems that prevent the translation of multi-omics data to the clinical labs.
Authors: Li Ding; Matthew H Bailey; Eduard Porta-Pardo; Vesteinn Thorsson; Antonio Colaprico; Denis Bertrand; David L Gibbs; Amila Weerasinghe; Kuan-Lin Huang; Collin Tokheim; Isidro Cortés-Ciriano; Reyka Jayasinghe; Feng Chen; Lihua Yu; Sam Sun; Catharina Olsen; Jaegil Kim; Alison M Taylor; Andrew D Cherniack; Rehan Akbani; Chayaporn Suphavilai; Niranjan Nagarajan; Joshua M Stuart; Gordon B Mills; Matthew A Wyczalkowski; Benjamin G Vincent; Carolyn M Hutter; Jean Claude Zenklusen; Katherine A Hoadley; Michael C Wendl; Llya Shmulevich; Alexander J Lazar; David A Wheeler; Gad Getz Journal: Cell Date: 2018-04-05 Impact factor: 41.582
Authors: Fred E Regnier; Steven J Skates; Mehdi Mesri; Henry Rodriguez; Zivana Tezak; Marina V Kondratovich; Michail A Alterman; Joshua D Levin; Donna Roscoe; Eugene Reilly; James Callaghan; Kellie Kelm; David Brown; Reena Philip; Steven A Carr; Daniel C Liebler; Susan J Fisher; Paul Tempst; Tara Hiltke; Larry G Kessler; Christopher R Kinsinger; David F Ransohoff; Elizabeth Mansfield; N Leigh Anderson Journal: Clin Chem Date: 2009-12-10 Impact factor: 8.327
Authors: Bing Zhang; Jing Wang; Xiaojing Wang; Jing Zhu; Qi Liu; Zhiao Shi; Matthew C Chambers; Lisa J Zimmerman; Kent F Shaddox; Sangtae Kim; Sherri R Davies; Sean Wang; Pei Wang; Christopher R Kinsinger; Robert C Rivers; Henry Rodriguez; R Reid Townsend; Matthew J C Ellis; Steven A Carr; David L Tabb; Robert J Coffey; Robbert J C Slebos; Daniel C Liebler Journal: Nature Date: 2014-07-20 Impact factor: 49.962
Authors: Hui Zhang; Tao Liu; Zhen Zhang; Samuel H Payne; Bai Zhang; Jason E McDermott; Jian-Ying Zhou; Vladislav A Petyuk; Li Chen; Debjit Ray; Shisheng Sun; Feng Yang; Lijun Chen; Jing Wang; Punit Shah; Seong Won Cha; Paul Aiyetan; Sunghee Woo; Yuan Tian; Marina A Gritsenko; Therese R Clauss; Caitlin Choi; Matthew E Monroe; Stefani Thomas; Song Nie; Chaochao Wu; Ronald J Moore; Kun-Hsing Yu; David L Tabb; David Fenyö; Vineet Bafna; Yue Wang; Henry Rodriguez; Emily S Boja; Tara Hiltke; Robert C Rivers; Lori Sokoll; Heng Zhu; Ie-Ming Shih; Leslie Cope; Akhilesh Pandey; Bing Zhang; Michael P Snyder; Douglas A Levine; Richard D Smith; Daniel W Chan; Karin D Rodland Journal: Cell Date: 2016-06-29 Impact factor: 41.582
Authors: James C Costello; Laura M Heiser; Elisabeth Georgii; Mehmet Gönen; Michael P Menden; Nicholas J Wang; Mukesh Bansal; Muhammad Ammad-ud-din; Petteri Hintsanen; Suleiman A Khan; John-Patrick Mpindi; Olli Kallioniemi; Antti Honkela; Tero Aittokallio; Krister Wennerberg; James J Collins; Dan Gallahan; Dinah Singer; Julio Saez-Rodriguez; Samuel Kaski; Joe W Gray; Gustavo Stolovitzky Journal: Nat Biotechnol Date: 2014-06-01 Impact factor: 54.908
Authors: Konrad Zych; Basten L Snoek; Mark Elvin; Miriam Rodriguez; K Joeri Van der Velde; Danny Arends; Harm-Jan Westra; Morris A Swertz; Gino Poulin; Jan E Kammenga; Rainer Breitling; Ritsert C Jansen; Yang Li Journal: PLoS One Date: 2017-02-13 Impact factor: 3.240
Authors: Philipp Mertins; D R Mani; Kelly V Ruggles; Michael A Gillette; Karl R Clauser; Pei Wang; Xianlong Wang; Jana W Qiao; Song Cao; Francesca Petralia; Emily Kawaler; Filip Mundt; Karsten Krug; Zhidong Tu; Jonathan T Lei; Michael L Gatza; Matthew Wilkerson; Charles M Perou; Venkata Yellapantula; Kuan-lin Huang; Chenwei Lin; Michael D McLellan; Ping Yan; Sherri R Davies; R Reid Townsend; Steven J Skates; Jing Wang; Bing Zhang; Christopher R Kinsinger; Mehdi Mesri; Henry Rodriguez; Li Ding; Amanda G Paulovich; David Fenyö; Matthew J Ellis; Steven A Carr Journal: Nature Date: 2016-05-25 Impact factor: 49.962
Authors: Li Wang; Robert P Sebra; John P Sfakianos; Kimaada Allette; Wenhui Wang; Seungyeul Yoo; Nina Bhardwaj; Eric E Schadt; Xin Yao; Matthew D Galsky; Jun Zhu Journal: Genome Med Date: 2020-02-28 Impact factor: 11.117
Authors: Paul M H Tran; Lynn K H Tran; John Nechtman; Bruno Dos Santos; Sharad Purohit; Khaled Bin Satter; Boying Dun; Ravindra Kolhe; Suash Sharma; Roni Bollag; Jin-Xiong She Journal: Sci Rep Date: 2020-11-26 Impact factor: 4.379