Literature DB >> 33560848

Simplified and Unified Access to Cancer Proteogenomic Data.

Caleb M Lindgren¹, David W Adams¹, Benjamin Kimball¹, Hannah Boekweg¹, Sadie Tayler¹, Samuel L Pugh¹, Samuel H Payne¹.

Abstract

Comprehensive cancer data sets recently generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) offer great potential for advancing our understanding of how to combat cancer. These data sets include DNA, RNA, protein, and clinical characterization for tumor and normal samples from large cohorts of many different cancer types. The raw data are publicly available at various Cancer Research Data Commons. However, widespread reuse of these data sets is also facilitated by easy access to the processed quantitative data tables. We have created a data application programming interface (API) to distribute these processed tables, implemented as a Python package called cptac. We implement it such that users who prefer to work in R can easily use our package for data access and then transfer the data into R for analysis. Our package distributes the finalized processed CPTAC data sets in a consistent, up-to-date format. This consistency makes it easy to integrate the data with common graphing, statistical, and machine-learning packages for advanced analysis. Additionally, consistent formatting across all cancer types promotes the investigation of pan-cancer trends. The data API structure of directly streaming data within a programming environment enhances the reproducibility. Finally, with the accompanying tutorials, this package provides a novel resource for cancer research education. View the software documentation at https://paynelab.github.io/cptac/. View the GitHub repository at https://github.com/PayneLab/cptac.

Entities: Chemical Disease Gene Species

Keywords: CPTAC; Python; R; cancer; data access; data dissemination; genomics; mass spectrometry; proteogenomics; proteomics; reproducibility

Year: 2021 PMID： 33560848 PMCID： PMC8022323 DOI： 10.1021/acs.jproteome.0c00919

Source DB: PubMed Journal: J Proteome Res ISSN： 1535-3893 Impact factor: 4.466

Introduction

Large consortia, like the Clinical Proteomic Tumor Analysis Consortium (CPTAC), drive science forward by creating coordinated and structured data sets on a scale that is typically not possible with individual investigators. They amass both a number of samples and a diversity of measurements that require a large collaborative effort. In addition to the primary analysis done by the consortium and published as flagship manuscripts,[1−7] these data sets are designed to be a resource to the scientific community to explore new questions or apply novel methodologies.[8−12] Proteogenomic cancer data are of interest to a wide interdisciplinary audience, and different scientists may want to interact with different data products, for example, raw instrument data versus summarized quantitative tables. Funding agencies have focused on building resources for the dissemination of voluminous raw data files, for example, NCI Genomic Data Commons.[13,14] These data warehouses address the logistical and technological challenges of storing and disseminating terabytes of sequencing and mass spectrometry data; however, raw data repositories have a limited audience, as the reanalysis of raw instrument files requires domain-specific knowledge and significant computational resources. A growing audience of scientists want to directly interact with quantitative proteogenomics data tables and not the raw data. Currently there is not a common method for projects, large or small, to share these data tables in an open and computable format. Although some quantitative data may be shared through the large data warehouses,[15] this mechanism has several drawbacks. First, the final data tables used in a publication are usually highly curated and processed by harmonization, batch correction, normalization, filtering, and so on. The detailed attention in these computational adjustments should not be lost; the public should have access to the exact data tables used in a publication. Second, the Data Commons model, as currently designed, separates multiomics data sets into different Data Commons instances, requiring users to have prior knowledge of which data sets belong together and where they are stored. Finally, computational convenience should be a driving factor in the data storage mechanism, meaning that data should be easy to programmatically access and quickly use in computation. Thus an alternative dissemination mechanism that facilitates accessing and utilizing these data tables is needed to serve a broader scientific community. A convenient method for disseminating coordinated data sets is the data Application Programming Interface (API) model, where data are streamed directly into a programming environment.[16] As an illustration of the need for higher level data distribution methods, consider the widespread use of Jupyter Notebooks[17,18] and other similar programming environments[19] for sharing research. These tools are ideal for both explaining the context of an analysis and showing the exact methodology. However, for this method of sharing research to be successful, it needs an accompanying flexible data distribution method. If a shared notebook uses files that are stored on the original researcher’s computer or are too large to be conveniently streamed, then it will be much less useful to others who wish to replicate and extend that analysis. The data API model solves this problem by ensuring that the exact version of a data set is globally and universally accessible. We present here a data API for proteogenomic cancer data generated by CPTAC, representing six tumor types. We provide access to the finalized data tables that were associated with the flagship publication for each cancer type. All tumor samples are characterized with genomic, transcriptomic, proteomic, and clinical data. The API streams proteogenomic data directly into a pandas DataFrame within a Python programming environment, dramatically improving the simplicity of data access. By using native DataFrame variables, the proteogenomic data easily integrate with common Python libraries for machine learning, graphing, and statistics. By leveraging the interoperability of R and Python, the data and analysis methods of the API can also be easily accessed via R scripts. Along with the API, we released an extensive set of tutorials to demonstrate common data analysis methods in proteogenomic and pan-cancer studies.

Methods

Overview

When reading these methods, it is important to keep in mind the distinction between the data files and the software API used to access them. In our project, these two entities are entirely separate and can be independently changed/updated. The CPTAC Python package, called cptac, is a data API that facilitates access to and utilization of cancer proteogenomic data within Python scripts (Figure ). Similar to other Data-as-a-Service applications, the cptac package gives users on-demand access to structured data. The API parses, loads, integrates, and manipulates the cancer proteogenomic data from the CPTAC consortium. When data are accessed via the API, they are presented to the user as pandas DataFrames. This data structure conveniently and seamlessly integrates with most major machine learning, graphing, and data analytics libraries in Python.

Figure 1

CPTAC data API. (top) Cohorts in CPTAC have the same fundamental multiomics and clinical data. (bottom) The cptac Python module is simple to install and use within a Python programming environment. Image courtesy of Nathan Johnson and Darcy Zacchilli. Copyright 2021. The software code for the cptac package is open source and available at https://github.com/PayneLab/cptac. Formal versions are tagged on GitHub and released as software updates through the Python Package Index (PyPI; https://pypi.org/project/cptac/). Thus developers can work with the code of the API by forking the GitHub repository, but users of the data should install the package on their local computer with “pip install cptac”. See below for more detailed installation instructions. Users should follow the tutorials at https://paynelab.github.io/cptac/#documentation for explicit code demonstrating the use of the API. Developers should follow the instructions in the /devdocs folder on the GitHub repository for software design and implementation requirements.

Installation and Software Dependencies

We strive to make installation of the package as simple and easy as possible. Users must have Python 3.6 or greater installed to run the package. They can then install our package in a single step, using the pip program: pip install cptac. This downloads the package from where it is posted on PyPI and installs it on the user’s computer. The package is then ready to use. The package depends on several other Python libraries including numpy, pandas, requests, and others. pip will automatically install these dependencies when it installs cptac because they are all also available on PyPI. Thus the user does not have to worry about managing the package’s software dependencies. For additional installation details, consult the Installation section in the package documentation (https://paynelab.github.io/cptac/#installation).

Data Tables

The data accessible with the cptac package are the published final data tables for each CPTAC tumor type, namely, the published data tables associated with flagship publications of the CPTAC consortium for seven cancer types: colon,[5] serous ovarian,[4] breast,[6] endometrial,[2] renal clear cell,[1] head and neck squamous cell carcinoma,[7] and lung adenocarcinoma.[3] The data set for every tumor type contains genomic, transcriptomic, proteomic, and phosphoproteomic measurements on the tumor samples. Each patient in the data set is described by a variety of clinical data including demographic, treatment, and outcome.

Package Construction and Organization

The methods of tumor collection and data generation within CPTAC follow a consistent, organized structure (Figure A). Our package has taken advantage of this consistency by creating an abstract class “Dataset”, which defines common variables and get functions to access all data tables that a data set will have, for example, get_clinical(), get_CNV(), get_proteomics(), get_somatic_mutation(), and so on. In addition to simple data access, the cptac package assists users in merging data across DataFrames using native join functions as part of the abstract Dataset class. These work on any combination of omics data and metadata, for example, join_metadata_to_mutations(). The join functions facilitate all merging to ensure that the returned DataFrames are properly keyed and indexed. Each tumor type is coded as a class object that inherits from the abstract Dataset class, that is, a derived class. Thus each derived class only needs to parse its specific data files into pandas DataFrames matching the format required by the abstract Dataset container; then, the rest of the functionality (i.e. get and join functions) is automatically integrated. This construction greatly simplifies the process of adding new tumor types. To minimize the number of DataFrames and keep data descriptors present in the same variables as data values, we have implemented a multilevel index for some omics DataFrames. For example, proteomics and transcriptomics DataFrames reference protein/RNA isoforms both by their common name and by their unique database identifier. Although common names are used as the primary indexing key, the unique database identifiers are used as the secondary key to differentiate between isoforms. A second data type that uses a multi-index is post-translational modification (PTM) proteomics data such as phosphorylation. Here the multi-index comprises the gene name, database identifier, modified amino acid residue(s), and the MS-identified peptide sequence. This is necessary because multiple peptides may be observed in a data set with the same phosphorylated residue, often arising because of incomplete tryptic digestion. Finally, the cptac package contains extra functionality within the utils subpackage, accessed as “from cptac import utils” or “import cptac.utils”. The utils subpackage implements commonly used functions and is continually expanding. To help users identify interacting proteins, a set of utility functions access pathway information from BioPlex, STRING, UniProt, and WikiPathways. The utils submodule also provides several wrappers for common statistical tests, like t tests and linear regression, with automatic correction of the p-value cutoff for multiple hypothesis testing. Finally, utils automates the identification of frequently mutated genes.

Streaming Data through the API

The cptac module gives users access to a large amount of data; the disk space for a single cancer type is 50–100 MB. Because PyPI restricts the package size and because a user may not want all of the data for all cancer types, the data files are not stored directly within the package or its GitHub repository. Initially, the package contains only the URLs needed to access the data. The user must request to download the data set for a particular cancer type, whereon the package uses one of these URLs to download an index file for that data set. This index file contains a list of all versions of the data set and a list of files, URLs, and MD5 hashes contained in each version. After each file is downloaded, the package hashes it and checks against the hash in the index to make sure none of the data was corrupted in the download process. After a user has downloaded a data set, they can load the data into variables in their Python program, for example, “dataset = cptac.Colon()”. The current implementation of the cptac package utilizes Box as a remote storage server. However, the software architecture has isolated the code involving remote data streaming, which makes it trivial to change the remote location.

New Data or API Releases

Before each data set’s flagship publication is finalized, the finalized data tables change periodically as various pipelines are compared and optimized. This is tracked by the consortium as versioned data releases, for example, 1.0 or 2.0 and so on. The package keeps track of these internal release versions. When a manuscript is published, these previous data versions are not recommended but are served by the package as a historical record. The API automatically monitors whether a user is working with the most recent release as follows: Each time the user loads a data set within their Python code, the package checks to make sure that the user has the current index for that data set. If necessary, it downloads a new version of the index from the server. Then, it checks whether the user is using the latest version recorded in the index. If the user did not request the latest version, then the package warns them that they are using an out-of-date version. It then loads whichever data version the user requested. Note that when new data versions are released and downloaded, the package maintains any prior versions that were downloaded to a user’s computer. When a user loads a data set, they can tell the package to load one of these older versions. Thus if changes in a new version affect the results of a user’s analyses, then they can go back and look at the old version to explicitly compare the two outputs. The package also makes sure that the software API itself is up to date. Periodically we release new versions of the package to add or improve functionality. Each time the user imports the package, it downloads a small file from the server that contains the latest release version number. It then compares this with the version number of the installed copy of the package. If they do not match, then it informs the user that their copy of the package is out of date and tells them how to update it using pip. Because Internet connectivity is not universal, the package first checks to see whether there is a connection prior to comparing index files for API or data versions. If it is not possible to download the files needed to check whether the indices and package are up to date, then the package automatically skips this version check and works with whatever it has installed locally.

Error Handling

The table joining and other data manipulation functions in our package provide novel opportunities for users to make mistakes when working with data tables. If they request an operation that causes problems severe enough that the operation needs to be canceled, then the package raises an exception that informs them why it cannot do what they asked. If the operation can be completed but may cause issues for the user, then the package issues a warning to alert them of the potential problem. The package uses the standard Python methods for raising exceptions and issuing warnings. However, the standard way that Python prints warnings and exceptions, with the full stack trace and associated information, can be intimidating for users without a computer programming background. To make these messages easier to decipher, our package defines custom sys.excepthook and warnings.showwarning functions so that whenever the package raises an exception or issues a warning, it will be printed in a concise, approachable format for users. Warnings and exceptions from outside of the package are printed in the normal fashion. However, in an IPython notebook environment, we cannot control how warnings and exceptions are displayed, so this feature is a specific enhancement for users accessing the package through the command line or scripts.

Results

To promote reproducibility, transparency, and collaboration in the analysis of CPTAC proteogenomic data, we created a data API that explicitly links data access and data analysis. The API gives access to the CPTAC data from seven cancer types: colon,[5] serous ovarian,[4] breast,[6] endometrial,[2] renal clear cell,[1] head and neck squamous cell carcinoma, and lung adenocarcinoma.[3] As new data sets are published, they will become publicly available through the API. Data for each cancer type contain information on DNA, RNA and proteins and clinical information (Figure ). DNA data are derived from the whole genome and whole exome sequencing of tumor and blood normal and are processed to yield somatic mutation calls and somatic copy number variation calls. RNA data are from RNA-seq and are processed to yield transcriptomics and frequently circular RNA and micro-RNA. Protein data are from mass spectrometry-based proteomics and contain global proteomics and phosphoproteomics. Clinical data contain patient demographic information, descriptive data for the tumor, and patient follow-up information. We emphasize that our software mirrors the data tables associated with the flagship publication for each CPTAC data set. Although the overall experimental design and data acquisition for each cancer type were consistent, the data processing pipeline used to produce quantitative data matrices was subtly distinct, for example, different protein quantification algorithms. Users interested in specific details of sample acquisition and data processing should consult the original publications. When performing pan-cancer analysis, users should design their analysis such that it is not affected by these differences between data sets, such as by looking for a particular trend in multiple cancer types but not comparing quantitative values. Access to the raw mass spectrometry and genomic data are provided by NCI’s Cancer Research Data Commons (i.e., Proteome Data Commons and Genome Data Commons) and the CPTAC’s Data Coordinating Center. The data API is designed to provide frictionless access to quantitative data tables; its goal is to provide the data in the most convenient format with the least effort. The focus on quantitative data tables and not raw instrument data has several important benefits. First, these relatively small-sized files do not necessitate a voluminous server or special considerations for downloading, which enables the API to work on standard computers with standard Internet connectivity. Second, these processed tables are also completely publicly available and do not contain any private germline information, further facilitating data access. Third, quantitative tables can be provided directly within the programming environment as a simple matrix variable, meaning that a user’s first interaction with data is as a properly parsed and loaded Python/pandas DataFrame (Figure A). In addition to single table access, the API also has native join functions which will merge multiomics data types (Figure B). This simple access dramatically improves a user’s ability for interaction, exploration, and visualization (Figure C). Fourth, quantitative data files are the starting point of hypothesis testing and data analysis. Providing direct access to these data improves the transparency and reproducibility by ensuring that all analysts are accessing the exact same data. This type of universal synchronization is not typically seen when analysts keep their own local versions of data files, which are frequently altered and resaved as renormalized or filtered versions of the original. Moreover, independent file versions often lack a detailed provenance of data manipulation. However, with a data streaming API, all data access is within a programming environment, and any data manipulation is directly exposed in code following the data access. Finally, the data API represents a dramatically simpler method for users to access and analyze data. All data for the CPTAC cohorts are available in the same format from the same simple software interface. There is no need to visit multiple repositories, for example, sequencing and mass spectrometry data at multiple NCI Data Commons.

Figure 2

Getting data from the cptac API. (A) The data API makes accessing CPTAC data simple and returns data in a native pandas DataFrame. (B) Merging different data types is facilitated by a suite of join functions in the API. (C) Joined mutation and proteomics data from panel B are shown with a boxplot from the seaborn Python graphing module. The example is drawn from use case 2 in the cptac documentation (https://paynelab.github.io/cptac/usecase02_clinical_attributes.html)

Versions

The data API is designed to keep track of formal versions of data. This is accomplished through a strict separation of the software code to load the data, and the actual data files. As with many large-scale projects, the data tables used in CPTAC analyses are periodically updated. Such adjustments are common in data analysis, and using a formal data version helps to properly track changes. These updates are often prompted by new patient survival or treatment information from follow-up visits. Distributing these updates throughout the consortium in a coordinated manner is best done through an official data release. Other reasons that prompt a data release include when the consortium has refined a software pipeline involved in generating the quantitative data tables (e.g., a different algorithm for processing raw sequencing or mass spectrometry data). The data set versions used in the CPTAC manuscripts are considered the final versions, and previous versions, although accessible, are not recommended for use. The end user can access these different versions when loading data into their Python environment. (See the Methods.) When working with the data API, the default version is the current published version; however, a user can specify a different version.

Tutorials and User Support

One of the most important goals of the cptac package is to promote reanalysis of the consortium’s data sets. To facilitate this, we have created a comprehensive set of tutorials and use cases (Table ). The tutorials cover the basic mechanics of accessing data through the API and manipulating and merging data tables. The use cases are short examples that use the package to investigate real, biologically meaningful questions; they demonstrate how to use the API for hypothesis-driven biological discovery. Because the CPTAC data sets contain a variety of diverse data types, use cases often explore scientific questions that utilize integrated multiomics data.

Table 1

User Documentationa

Tutorial 1: Data intro	Goes over the basics of how to install the package and access data
Tutorial 2: pandas	More in-depth description of how to work with the tables using pandas
Tutorial 3: Joining DataFrames	Shows how to use built-in functions from cptac to join tables of different data types
Tutorial 4: MultiIndex	Some tables provided by the package use multilevel column indexes in cases where multiple keys are required to uniquely identify each column. This tutorial describes unique aspects of working with these tables
Tutorial 5: Updates	An explanation of how to access and work with data version updates and package version updates
Tutorial 6: Python and R	How access the Python API within R
Use case 1: Multiomic integration	Data access and integration for multiple omics data types
Use case 2: Clinical covariates	Explores metadata for correlation between clinical attributes
Use case 3: Clinical and acetylation	Compares acetylation levels between tumor subtypes
Use case 4: Mutations and omics	Studies the effects of DNA mutations on protein abundance
Use case 5: Enrichment analysis	Uses the GSEApyb module to find enriched pathways
Use case 6: Derived molecular	Identifies correlation between proteomics and attributes derived from molecular data, for example, MSI status
Use case 7: Trans genetic effect	Studies the effect of DNA mutations on the expression of a different protein
Use case 8: Outliers	Uses the Blacksheepc module to study outliers in expression values
Use case 9: Clinical outcomes	Uses patient follow-up data to look for correlations between clinical and molecular data and patient survival
Use case 10: Pathway overlay	Integrates quantitative molecular data with Reactome pathway maps

A list of tutorials and use cases to help users explore the data API is available at https://paynelab.github.io/cptac/#documentation.

GSEApy module is available at https://gseapy.readthedocs.io/en/latest/introduction.html.

Blacksheep module is available at https://blacksheep.readthedocs.io/en/master/.

A list of tutorials and use cases to help users explore the data API is available at https://paynelab.github.io/cptac/#documentation. GSEApy module is available at https://gseapy.readthedocs.io/en/latest/introduction.html. Blacksheep module is available at https://blacksheep.readthedocs.io/en/master/. The tutorials and use cases are written in interactive Python notebooks (Jupyter Notebooks). This format allows explanatory text, code, and output to be seamlessly integrated into a single document. The notebooks can be viewed as static web pages on our GitHub site (https://paynelab.github.io/cptac/#documentation). That site also has directions for users who wish to download the original notebooks and run them interactively or access interactive web versions hosted by Binder. In addition to the Python-based notebooks, tutorial 6 demonstrates how to access the API within the R programming environment. Along with our tutorials and use cases, we also provide user support through our GitHub issues page. This page allows users to submit bug reports, feature requests, and other feedback about our package. Past questions and answers are publicly available for others to view for reference.

Integration with R

The API is natively written in Python; however, there are options that facilitate access in other software languages. Users who prefer to work in R can easily use cptac to access the data they want and then transfer that data into R for analysis. A familiar way to transfer the tables from Python into R would be to load the data in Python using our package, save the tables to tab-delimited plain text files, and then read those files into R. A more seamless method is to use the R package reticulate to interface with cptac. reticulate allows a user to load and interface with Python packages and objects from an R environment and then convert Python objects into native R objects for subsequent use. Users interested in this course of action should consult tutorial 6.

Integrating with External Bioinformatics tools

To enhance usability and functionality, the data API connects to several bioinformatics tools. The first category of tools are those that have a Python implementation. Working with these third-party packages is facilitated by our use of DataFrames to encapsulate omics data, as many Python packages are written for pandas. Gene set analysis is a frequent step in omics data analysis to find common biological functions among a specified set of genes. One of the first of these tools was GSEA,[20] which has been reimplemented in Python in the gseapy module (https://pypi.org/project/gseapy/). A tutorial demonstrating the use of gseapy is available in the cptac documentation. Another common task for CPTAC data is survival analysis. Several Python packages implement analyses like the Kaplan–Meier curves or the Cox proportional hazard test. These tests identify whether time to an event is affected by a specified variable, for example, protein abundance, tumor grade, and so on (Figure ). The cptac tutorials demonstrate the lifelines package (https://pypi.org/project/lifelines/) using data from the API.

Figure 3

Survival analysis with CPTAC data. Using the lifelines Python module, we identify variables that impact patient survival in ovarian cancer. A Kaplan–Meier curve is shown to separate time to death based on: (A) the FIGO stage, (B) the protein expression of PODXL, and (C) RAC2. A Cox proportional hazard assessment is shown in panel D. The images were generated using the code in use case 9 of the cptac documentation (https://paynelab.github.io/cptac/usecase09_clinical_outcomes.html). A second type of external tool that can be integrated with the data API is bioinformatics databases. Many large databases like UniProt have web services that allow for programmatic access to specific information. To integrate our CPTAC data with these, the API wraps REST calls to the database. For example, the API can give users a list of interacting proteins from both UniProt and STRING by wrapping a call to their respective web service. Some other databases either do not have a convenient REST-API or have a small enough database that it can be loaded in full. Proteins belonging to biological pathways are curated by Wikipathways. We have downloaded this open-source information and integrated it into the data API, which users can access through simple function calls.

Discussion

Frictionless data access has become an important goal for science, as it promotes collaboration, transparency, reproducibility, and data reuse. In the past decade, attitudes and expectations in the scientific community have changed with respect to data sharing. Because data reuse extends the value of the financial investment beyond the original grant holder, various funding agencies have begun to set benchmarks for program success based on these more inclusive definitions of impact, as opposed to simple citation metrics. Although the sharing of raw data has a robust infrastructure for many primary data types, processed data tables are frequently still not shared in a simple and convenient manner. Here we present a data API for cancer proteogenomic data associated with the CPTAC consortium. By adopting the goals of the Data-as-a-Service model, our API radically improves access to these data. As a generalized adaptation of the Data-as-a-Service model, our API’s guiding philosophy is that data sets should be accessible within a programming environment. Previous work with a similar on-demand goal is often achieved with a REST-API, which frequently returns data in a structured JSON format. Unfortunately, a JSON object requires parsing and reshaping to obtain the practical data matrix object. Because a data matrix is the most common data type for many scientific domains, our API directly gives users a native DataFrame object. A second advantage of our data API implementation over a REST-API is that it obviates the need to host a Web server, making it simpler for creation and long-term maintenance. In our implementation, data are hosted on a robust commercial server, and the code used to stream the data is minimal and compartmentalized. Therefore, moving files to a different location, for example, switching from Box to Amazon, would require minimal code changes. Numerous scientific projects, large and small, could improve the dissemination of their data through this mechanism.

19 in total

1. A-to-I RNA Editing Contributes to Proteomic Diversity in Cancer.

Authors: Xinxin Peng; Xiaoyan Xu; Yumeng Wang; David H Hawke; Shuangxing Yu; Leng Han; Zhicheng Zhou; Kamalika Mojumdar; Kang Jin Jeong; Marilyne Labrie; Yiu Huen Tsang; Minying Zhang; Yiling Lu; Patrick Hwu; Kenneth L Scott; Han Liang; Gordon B Mills
Journal: Cancer Cell Date: 2018-04-26 Impact factor: 31.743

2. Proteogenomic Analysis of Human Colon Cancer Reveals New Therapeutic Opportunities.

Authors: Suhas Vasaikar; Chen Huang; Xiaojing Wang; Vladislav A Petyuk; Sara R Savage; Bo Wen; Yongchao Dou; Yun Zhang; Zhiao Shi; Osama A Arshad; Marina A Gritsenko; Lisa J Zimmerman; Jason E McDermott; Therese R Clauss; Ronald J Moore; Rui Zhao; Matthew E Monroe; Yi-Ting Wang; Matthew C Chambers; Robbert J C Slebos; Ken S Lau; Qianxing Mo; Li Ding; Matthew Ellis; Mathangi Thiagarajan; Christopher R Kinsinger; Henry Rodriguez; Richard D Smith; Karin D Rodland; Daniel C Liebler; Tao Liu; Bing Zhang
Journal: Cell Date: 2019-04-25 Impact factor: 41.582

3. Proteogenomic Characterization of Endometrial Carcinoma.

Authors: Yongchao Dou; Emily A Kawaler; Daniel Cui Zhou; Marina A Gritsenko; Chen Huang; Lili Blumenberg; Alla Karpova; Vladislav A Petyuk; Sara R Savage; Shankha Satpathy; Wenke Liu; Yige Wu; Chia-Feng Tsai; Bo Wen; Zhi Li; Song Cao; Jamie Moon; Zhiao Shi; MacIntosh Cornwell; Matthew A Wyczalkowski; Rosalie K Chu; Suhas Vasaikar; Hua Zhou; Qingsong Gao; Ronald J Moore; Kai Li; Sunantha Sethuraman; Matthew E Monroe; Rui Zhao; David Heiman; Karsten Krug; Karl Clauser; Ramani Kothadia; Yosef Maruvka; Alexander R Pico; Amanda E Oliphant; Emily L Hoskins; Samuel L Pugh; Sean J I Beecroft; David W Adams; Jonathan C Jarman; Andy Kong; Hui-Yin Chang; Boris Reva; Yuxing Liao; Dmitry Rykunov; Antonio Colaprico; Xi Steven Chen; Andrzej Czekański; Marcin Jędryka; Rafał Matkowski; Maciej Wiznerowicz; Tara Hiltke; Emily Boja; Christopher R Kinsinger; Mehdi Mesri; Ana I Robles; Henry Rodriguez; David Mutch; Katherine Fuh; Matthew J Ellis; Deborah DeLair; Mathangi Thiagarajan; D R Mani; Gad Getz; Michael Noble; Alexey I Nesvizhskii; Pei Wang; Matthew L Anderson; Douglas A Levine; Richard D Smith; Samuel H Payne; Kelly V Ruggles; Karin D Rodland; Li Ding; Bing Zhang; Tao Liu; David Fenyö
Journal: Cell Date: 2020-02-13 Impact factor: 41.582

4. Integration and Analysis of CPTAC Proteomics Data in the Context of Cancer Genomics in the cBioPortal.

Authors: Pamela Wu; Zachary J Heins; James T Muller; Lizabeth Katsnelson; Ino de Bruijn; Adam A Abeshouse; Nikolaus Schultz; David Fenyö; Jianjiong Gao
Journal: Mol Cell Proteomics Date: 2019-07-15 Impact factor: 5.911

5. Toward a Shared Vision for Cancer Genomic Data.

Authors: Robert L Grossman; Allison P Heath; Vincent Ferretti; Harold E Varmus; Douglas R Lowy; Warren A Kibbe; Louis M Staudt
Journal: N Engl J Med Date: 2016-09-22 Impact factor: 91.245

6. Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma.

Authors: David J Clark; Saravana M Dhanasekaran; Francesca Petralia; Jianbo Pan; Xiaoyu Song; Yingwei Hu; Felipe da Veiga Leprevost; Boris Reva; Tung-Shing M Lih; Hui-Yin Chang; Weiping Ma; Chen Huang; Christopher J Ricketts; Lijun Chen; Azra Krek; Yize Li; Dmitry Rykunov; Qing Kay Li; Lin S Chen; Umut Ozbek; Suhas Vasaikar; Yige Wu; Seungyeul Yoo; Shrabanti Chowdhury; Matthew A Wyczalkowski; Jiayi Ji; Michael Schnaubelt; Andy Kong; Sunantha Sethuraman; Dmitry M Avtonomov; Minghui Ao; Antonio Colaprico; Song Cao; Kyung-Cho Cho; Selim Kalayci; Shiyong Ma; Wenke Liu; Kelly Ruggles; Anna Calinawan; Zeynep H Gümüş; Daniel Geiszler; Emily Kawaler; Guo Ci Teo; Bo Wen; Yuping Zhang; Sarah Keegan; Kai Li; Feng Chen; Nathan Edwards; Phillip M Pierorazio; Xi Steven Chen; Christian P Pavlovich; A Ari Hakimi; Gabriel Brominski; James J Hsieh; Andrzej Antczak; Tatiana Omelchenko; Jan Lubinski; Maciej Wiznerowicz; W Marston Linehan; Christopher R Kinsinger; Mathangi Thiagarajan; Emily S Boja; Mehdi Mesri; Tara Hiltke; Ana I Robles; Henry Rodriguez; Jiang Qian; David Fenyö; Bing Zhang; Li Ding; Eric Schadt; Arul M Chinnaiyan; Zhen Zhang; Gilbert S Omenn; Marcin Cieslik; Daniel W Chan; Alexey I Nesvizhskii; Pei Wang; Hui Zhang
Journal: Cell Date: 2019-10-31 Impact factor: 41.582

7. Proteogenomic insights into the biology and treatment of HPV-negative head and neck squamous cell carcinoma.

Authors: Chen Huang; Lijun Chen; Sara R Savage; Rodrigo Vargas Eguez; Yongchao Dou; Yize Li; Felipe da Veiga Leprevost; Eric J Jaehnig; Jonathan T Lei; Bo Wen; Michael Schnaubelt; Karsten Krug; Xiaoyu Song; Marcin Cieślik; Hui-Yin Chang; Matthew A Wyczalkowski; Kai Li; Antonio Colaprico; Qing Kay Li; David J Clark; Yingwei Hu; Liwei Cao; Jianbo Pan; Yuefan Wang; Kyung-Cho Cho; Zhiao Shi; Yuxing Liao; Wen Jiang; Meenakshi Anurag; Jiayi Ji; Seungyeul Yoo; Daniel Cui Zhou; Wen-Wei Liang; Michael Wendl; Pankaj Vats; Steven A Carr; D R Mani; Zhen Zhang; Jiang Qian; Xi S Chen; Alexander R Pico; Pei Wang; Arul M Chinnaiyan; Karen A Ketchum; Christopher R Kinsinger; Ana I Robles; Eunkyung An; Tara Hiltke; Mehdi Mesri; Mathangi Thiagarajan; Alissa M Weaver; Andrew G Sikora; Jan Lubiński; Małgorzata Wierzbicka; Maciej Wiznerowicz; Shankha Satpathy; Michael A Gillette; George Miles; Matthew J Ellis; Gilbert S Omenn; Henry Rodriguez; Emily S Boja; Saravana M Dhanasekaran; Li Ding; Alexey I Nesvizhskii; Adel K El-Naggar; Daniel W Chan; Hui Zhang; Bing Zhang
Journal: Cancer Cell Date: 2021-01-07 Impact factor: 31.743

8. Molecular subtyping of cancer and nomination of kinase candidates for inhibition with phosphoproteomics: Reanalysis of CPTAC ovarian cancer.

Authors: Mengsha Tong; Chunyu Yu; Dongdong Zhan; Ming Zhang; Bei Zhen; Weimin Zhu; Yi Wang; Congying Wu; Fuchu He; Jun Qin; Tingting Li
Journal: EBioMedicine Date: 2018-12-26 Impact factor: 8.143

9. Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks.

Authors: Adam Rule; Amanda Birmingham; Cristal Zuniga; Ilkay Altintas; Shih-Cheng Huang; Rob Knight; Niema Moshiri; Mai H Nguyen; Sara Brin Rosenthal; Fernando Pérez; Peter W Rose
Journal: PLoS Comput Biol Date: 2019-07-25 Impact factor: 4.475

10. ProteomeXchange provides globally coordinated proteomics data submission and dissemination.

Authors: Juan A Vizcaíno; Eric W Deutsch; Rui Wang; Attila Csordas; Florian Reisinger; Daniel Ríos; José A Dianes; Zhi Sun; Terry Farrah; Nuno Bandeira; Pierre-Alain Binz; Ioannis Xenarios; Martin Eisenacher; Gerhard Mayer; Laurent Gatto; Alex Campos; Robert J Chalkley; Hans-Joachim Kraus; Juan Pablo Albar; Salvador Martinez-Bartolomé; Rolf Apweiler; Gilbert S Omenn; Lennart Martens; Andrew R Jones; Henning Hermjakob
Journal: Nat Biotechnol Date: 2014-03 Impact factor: 54.908

4 in total