| Literature DB >> 33560848 |
Caleb M Lindgren1, David W Adams1, Benjamin Kimball1, Hannah Boekweg1, Sadie Tayler1, Samuel L Pugh1, Samuel H Payne1.
Abstract
Comprehensive cancer data sets recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) offer great potential for advancing our understanding of how to combat cancer. These data sets include DNA, RNA, protein, and clinical characterization for tumor and normal samples from large cohorts of many different cancer types. The raw data are publicly available at various Cancer Research Data Commons. However, widespread reuse of these data sets is also facilitated by easy access to the processed quantitative data tables. We have created a data application programming interface (API) to distribute these processed tables, implemented as a Python package called cptac. We implement it such that users who prefer to work in R can easily use our package for data access and then transfer the data into R for analysis. Our package distributes the finalized processed CPTAC data sets in a consistent, up-to-date format. This consistency makes it easy to integrate the data with common graphing, statistical, and machine-learning packages for advanced analysis. Additionally, consistent formatting across all cancer types promotes the investigation of pan-cancer trends. The data API structure of directly streaming data within a programming environment enhances the reproducibility. Finally, with the accompanying tutorials, this package provides a novel resource for cancer research education. View the software documentation at https://paynelab.github.io/cptac/. View the GitHub repository at https://github.com/PayneLab/cptac.Entities:
Keywords: CPTAC; Python; R; cancer; data access; data dissemination; genomics; mass spectrometry; proteogenomics; proteomics; reproducibility
Year: 2021 PMID: 33560848 PMCID: PMC8022323 DOI: 10.1021/acs.jproteome.0c00919
Source DB: PubMed Journal: J Proteome Res ISSN: 1535-3893 Impact factor: 4.466
Figure 1CPTAC data API. (top) Cohorts in CPTAC have the same fundamental multiomics and clinical data. (bottom) The cptac Python module is simple to install and use within a Python programming environment. Image courtesy of Nathan Johnson and Darcy Zacchilli. Copyright 2021.
Figure 2Getting data from the cptac API. (A) The data API makes accessing CPTAC data simple and returns data in a native pandas DataFrame. (B) Merging different data types is facilitated by a suite of join functions in the API. (C) Joined mutation and proteomics data from panel B are shown with a boxplot from the seaborn Python graphing module. The example is drawn from use case 2 in the cptac documentation (https://paynelab.github.io/cptac/usecase02_clinical_attributes.html)
User Documentationa
| Tutorial 1: Data intro | Goes over the basics of how to install the package and access data |
| Tutorial 2: pandas | More in-depth description of how to work with the tables using pandas |
| Tutorial 3: Joining DataFrames | Shows how to use built-in
functions from |
| Tutorial 4: MultiIndex | Some tables provided by the package use multilevel column indexes in cases where multiple keys are required to uniquely identify each column. This tutorial describes unique aspects of working with these tables |
| Tutorial 5: Updates | An explanation of how to access and work with data version updates and package version updates |
| Tutorial 6: Python and R | How access the Python API within R |
| Use case 1: Multiomic integration | Data access and integration for multiple omics data types |
| Use case 2: Clinical covariates | Explores metadata for correlation between clinical attributes |
| Use case 3: Clinical and acetylation | Compares acetylation levels between tumor subtypes |
| Use case 4: Mutations and omics | Studies the effects of DNA mutations on protein abundance |
| Use case 5: Enrichment analysis | Uses the GSEApy |
| Use case 6: Derived molecular | Identifies correlation between proteomics and attributes derived from molecular data, for example, MSI status |
| Use case 7: Trans genetic effect | Studies the effect of DNA mutations on the expression of a different protein |
| Use case 8: Outliers | Uses the Blacksheep |
| Use case 9: Clinical outcomes | Uses patient follow-up data to look for correlations between clinical and molecular data and patient survival |
| Use case 10: Pathway overlay | Integrates quantitative molecular data with Reactome pathway maps |
A list of tutorials and use cases to help users explore the data API is available at https://paynelab.github.io/cptac/#documentation.
GSEApy module is available at https://gseapy.readthedocs.io/en/latest/introduction.html.
Blacksheep module is available at https://blacksheep.readthedocs.io/en/master/.
Figure 3Survival analysis with CPTAC data. Using the lifelines Python module, we identify variables that impact patient survival in ovarian cancer. A Kaplan–Meier curve is shown to separate time to death based on: (A) the FIGO stage, (B) the protein expression of PODXL, and (C) RAC2. A Cox proportional hazard assessment is shown in panel D. The images were generated using the code in use case 9 of the cptac documentation (https://paynelab.github.io/cptac/usecase09_clinical_outcomes.html).