| Literature DB >> 31506093 |
Kyle Ellrott1, Alex Buchanan1, Allison Creason1, Michael Mason2, Thomas Schaffter3, Bruce Hoff2, James Eddy2, John M Chilton4, Thomas Yu2, Joshua M Stuart5, Julio Saez-Rodriguez6,7, Gustavo Stolovitzky3, Paul C Boutros8,9,10,11,12, Justin Guinney13,14.
Abstract
Challenges are achieving broad acceptance for addressing many biomedical questions and enabling tool assessment. But ensuring that the methods evaluated are reproducible and reusable is complicated by the diversity of software architectures, input and output file formats, and computing environments. To mitigate these problems, some challenges have leveraged new virtualization and compute methods, requiring participants to submit cloud-ready software packages. We review recent data challenges with innovative approaches to model reproducibility and data sharing, and outline key lessons for improving quantitative biomedical data analysis through crowd-sourced benchmarking challenges.Entities:
Mesh:
Year: 2019 PMID: 31506093 PMCID: PMC6737594 DOI: 10.1186/s13059-019-1794-0
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Fig. 1Challenge cycle overview. For each challenge, participants can form teams of one or more individuals. Challenge teams work together to develop a model (depicted as open box), train their model on training data (purple cylinders) provided by the challenge organizers, containerize their model (closed box with outline), and submit their model to the challenge container repository. Submitted models are run on validation data (green cylinders) on a cloud computing system by the challenge organizers. Once predictions produced by the models are evaluated and scored, results are made available to the challenge teams. Teams can use this information to make improvements to their model and resubmit their optimized model
Fig. 2Challenge features. Challenges used cloud computing services for running and evaluating models including Google Cloud Platform, Openstack, Amazon Web Services, and IBM Cloud. Models were designed to run using either CPUs or GPUs. The type of data used in running and evaluation of models was either real data (obtained from patients or cell lines) or simulated using a computer algorithm. Challenges used genomic data, such as DNA sequencing, RNA sequencing, and gene expression; clinical phenotypes; and/or images. Models could be submitted to a challenge in the form of a galaxy workflow, docker image, or CWL (Common Workflow Language) workflow
Challenge data characteristics
| Challenge | Data types | Data cohorts | Size | Open | |
|---|---|---|---|---|---|
| Digital Mammography | Human clinical Imaging | Kaiser Permanente | 80k patients (640k images) | 13 TB | No |
| MSSM | 1k (15k) | .3 TB | No | ||
| Karolinska | 69k (663k) | 13.2 TB | No | ||
| UCSF | 42k (500k) | 10 TB | No | ||
| CRUK | 7 k | No | |||
| Total | 200k (1818k) | 36.5 TB | |||
| Multiple Myeloma | Human clinical; gene expr; DNAseq; Cytogenetics | MMRF | 797 | 11 GB | Yes |
| PUBLIC | 1444 | 1 GB | Yes | ||
| DFCI | 294 | 76 GB | No | ||
| UAMS | 463 | 6 GB | No | ||
| M2Gen | 105 | 41 GB | No | ||
| Total | 3103 | 135 GB | |||
| SMC-Het | All | 76 | 22 GB | No | |
| SMC-RNA | Simulated; Human clinical; RNA-seq | Training | 31 | 290 GB | Yes |
| Test | 20 | 197 GB | Yes | ||
| Real | 32 | 265 GB | No |
Data cohorts describe the source of the data used in the challenge. MSSM Mount Sinai School of Medicine, UCSF University of California San Francisco, CRUK Cancer Research UK, MMRF Multiple Myeloma Research Foundation, DFCI Dana-Farber Cancer Institute, UAMS University of Arkansas for Medical Sciences, Training synthetically generated data provided to participants, Test synthetically generated data held-out data, Real cell lines spiked in with known constructs. The number of samples in digital mammography includes the number of patients and the number of images in parentheses. Open indicates whether the data was publicly available to participants
Summary of models and teams for challenges
| Challenge | Cloud platforms | Model format | # of models | # of teams |
|---|---|---|---|---|
| Digital Mammography | AWS, IBM Softlayer | Docker | 310 | 57 teams |
| Multiple Myeloma | AWS | Docker | 180 | 71 |
| SMC-Het | ISB-CGC (Google) | Galaxy, Docker | 58 | 31 |
| SMC-RNA | ISG-CGC (Google) | CWL, Docker | 141 | 16 |
| Proteogenomic | AWS | Docker | 449 | 68 |
Number of participants from each challenge, as well as model types and submission counts
Fig. 3a) Distribution of model run times across M2D Challenges. b) Comparison between CPU and disk usage among the M2D Challenges. CPU time is in the total wall time for running a single entry against all test samples used for benchmarking. Disk usage is the size of the testing set in GB. The diagonal line represents the point at which the cost of download egress fees and the cost of compute are equivalent. Below the line a M2D approach is theoretically cheaper