Literature DB >> 34819397

Q-omics: Smart Software for Assisting Oncology and Cancer Research.

Jieun Lee^1,2, Youngju Kim^1,2, Seonghee Jin¹, Heeseung Yoo¹, Sumin Jeong¹, Euna Jeong³, Sukjoon Yoon^1,3.

Abstract

The rapid increase in collateral omics and phenotypic data has enabled data-driven studies for the fast discovery of cancer targets and biomarkers. Thus, it is necessary to develop convenient tools for general oncologists and cancer scientists to carry out customized data mining without computational expertise. For this purpose, we developed innovative software that enables user-driven analyses assisted by knowledge-based smart systems. Publicly available data on mutations, gene expression, patient survival, immune score, drug screening and RNAi screening were integrated from the TCGA, GDSC, CCLE, NCI, and DepMap databases. The optimal selection of samples and other filtering options were guided by the smart function of the software for data mining and visualization on Kaplan-Meier plots, box plots and scatter plots of publication quality. We implemented unique algorithms for both data mining and visualization, thus simplifying and accelerating user-driven discovery activities on large multiomics datasets. The present Q-omics software program (v0.95) is available at http://qomics.sookmyung.ac.kr.

Entities: Chemical

Keywords: Kaplan-Meier plot; biomarker; cancer bioinformatics; immune infiltrate; omics data mining; smart software

Mesh：

Year: 2021 PMID： 34819397 PMCID： PMC8627836 DOI： 10.14348/molcells.2021.0169

Source DB: PubMed Journal: Mol Cells ISSN： 1016-8478 Impact factor: 5.034

INTRODUCTION

Large collateral datasets, including those on mutations, gene expression, drug/RNAi screening and patient survival, are publicly available from diverse resources (Barretina et al., 2012; Cancer Genome Atlas Research Network et al., 2013; Ghandi et al., 2019; Guan et al., 2019; Iorio et al., 2016; Monks et al., 2018; Shi et al., 2021). Integrated analysis of the cross-association of these datasets provides useful clues for finding novel targets, predictive biomarkers and related mechanisms (Jeong et al., 2020; Shen et al., 2019). For example, many genes and mutations have been found to be associated with the patient survival rate via analyses of datasets from the TCGA database (Cao et al., 2020; Eckstein et al., 2020; Hong et al., 2017; Kitsou et al., 2020; Yang et al., 2011; Zhong et al., 2020). Cell line databases provide clues for the identification of predictive biomarkers against drug resistance and/or sensitivity (Garnett et al., 2012; He et al., 2014; Kim et al., 2016; Li et al., 2021; Yang et al., 2013). Novel targets against subtype-specific cancer mutations have also been suggested (Biswas et al., 2019; Li et al., 2019; Park et al., 2019). An explosive increase in these collateral datasets will provide important resources for diverse data-driven cancer research projects. However, systematic and integrated analyses of these datasets are still challenging to most oncologists and cancer researchers with no computational background. Many web-based tools have been developed to improve the utility of public cancer datasets, such as Oncomine (Rhodes et al., 2004), cBioPortal (Cerami et al., 2012), and TIMER2.0 (Li et al., 2020). Although these web-based applications provide useful tools for a quick data search with significant information, user-oriented customized calculation and data filtration are generally limited from these server-provided functions. Thus, flexible and comprehensive software is required for cancer scientists to carry out customized data processing and computation on their local computers. Here, we attempted to develop innovative smart software for oncologists to easily start their own data mining projects without computational skills. We established two aims for this software. First, the process of data analysis and visualization should be simple and comprehensive by providing a user-friendly graphical interface and an intuitive organization of menus. Second, we tried to implement smart functions that guide users to find optimal outputs, i.e., associated data pairs and graphs, via real-time communication with a server-side knowledge base harboring billions of pre-calculated data pairs. For these purposes, we simultaneously developed stand-alone software with data processing and computation abilities and a server-side knowledge base that can be connected to local software. This report briefly presents the functions and utilities of this software, Q-omics v0.95. The smart system of the implemented knowledge base will be continuously updated with improved visualization options in the user interface. We expect that the present computer-aided, smart data mining system will have general utilities in all fields of oncology and cancer research without the requirements of bioinformatics skills.

MATERIALS AND METHODS

Cell line data

Cell line-based large-scale data consisting of RNA sequencing data (Expression, ver. 20Q1), sgRNA sequencing data (CRISPR, ver. 20Q1), shRNA screening data (Achilles + DRIVE + Marcotte, DEMETER2), mutation data (Mutation Public, ver. 20Q4), and drug response data (Sanger GDSC1 and GDSC2) were obtained from the DepMap portal (https://depmap.org/portal/). RNA sequencing data represent log2-transformed transcripts per million (TMP) + 1 values using RSEM normalization. sgRNA and shRNA data are batch-corrected CERES gene knockout effects (Meyers et al., 2017) and DEMETER2 estimated gene knockdown effects (McFarland et al., 2018), respectively. Mutation data are MAF of gene mutations. Drug response data are published as IC50 (nM) values and we transformed to logarithmic scale pIC50 (M). To analyze associations between datasets, 20 lineages with a sufficient number of common cell lines between RNA sequencing data and other data (sgRNA and drug response) were used in this study. Furthermore, the gene expression data of NCI60 cell lines treated with 15 drugs were obtained from the GEO database (GSE116436) (Monks et al., 2018). Details on the cell line, number of lineages, number of cell lines, and number of genes/drugs are shown in Table 1.

Table 1

Numbers of data points integrated into Q-omics software

	No. of lineages	No. of cell lines/No. of samples	No. of genes/No. of drugs	Data type
Cell line data
Gene expression	20	1,061	19,137	RNA sequencing
sgRNA	20	741	18,110	CRISPR
shRNA	20	587	16,800	RNAi shRNA
Drug response	20	1,001	397	Drug response
Mutation	20	1,281	18,731	Exome sequencing
Drug-induced gene expression	13	60	12,305/15	DNA microarray
Tissue data
Tumor gene expression	33	9,951	38,311	RNA sequencing
Paired normal vs. cancer: gene expression	18	679	38,311	RNA sequencing
Mutation	33	9,100	20,850	Exome sequencing
Immune	33	8,954	64 (cell types)	Cell type enrichment score

Tissue data

Patient RNA sequencing data, clinical data, and mutation data were obtained from the Genomic Data Commons Data Portal (https://portal.gdc.cancer.gov/). In total, 33 cancer types were investigated in this study. For comparisons between normal and tumor data, paired normal and tumor tissue samples from 18 cancer types whose number of matched tissue samples was larger than 2 were collected. RNA sequencing data in FPKM (fragments per kilobase of transcript per million fragments mapped) values are transformed to log2 (TPM + 1) values after downloading. In addition, immune cell enrichment score for TCGA data was obtained from the xCell portal (https://xcell.ucsf.edu/) (Aran et al., 2017). Details on the tissue type, number of lineages, number of samples, and number of genes are shown in Table 1.

Cross-association analysis

To analyze associations between two datasets, we performed a cross-association analysis between phenotypic efficacy and gene expression in our previous work (Jeong et al., 2020). In this study, we extended the concept of cross-association to analyze more diverse datasets, including those on gene expression, sgRNAs, shRNAs, the drug response, and mutations (Fig. 1). Cross-associations between each data type, such as drug versus RNA-seq and shRNA versus mutation, and within the same data type, such as sgRNA versus sgRNA and mutation versus mutation, can be analyzed.

Fig. 1

Overview of data integration in Q-omics software.

Public datasets from the TCGA, GDSC, CCLE, NCI and DepMap were integrated for the cross-association analysis (blue arrow) of between any two datasets.

Two association measures are predictivity and descriptivity. Given any two datasets (X and Y), we assume that x and y are entries of X and Y, respectively. The predictivity of x measures the difference in x values between two groups divided by the median of y. In contrast, the descriptivity of x measures the difference in y values between two groups divided by the median x. Significance was tested using Fisher’s exact test for categorical data (mutation) and Student’s t-test for numerical data (all other data).

Survival analysis

Survival data were analyzed using the Kaplan–Meier (KM) method, and the log-rank test was used to compare the survival outcomes of two groups as a test of statistical significance. Furthermore, the area under the curve (AUC) was calculated to provide an estimate of the size of the difference between two groups. In this study, overall survival (OS) and disease-free survival (DFS) were analyzed. The two analyses differed according to the definition of the primary endpoint: all causes of death during the study period were used to analyze OS, and a tumor event or death was used to analyze DFS. For the single-gene survival analysis, patients were divided into two groups based on high or low expression of the given gene or mutation status of the given gene. The association of two genes can be determined in advance to generate several subgroups for combined-gene survival analysis. Furthermore, for a more sophisticated survival analysis, a subset of patients was selected using clinical information such as sex, stage, or any combination of sex and stage.

Smart search

Q-omics was designed to run locally on user computers. While running Q-omics, time-consuming or data/memory-intensive analyses are performed on the server computer. For example, the cross-association analysis on the user side investigates only in a given lineage, while the smart search retrieves the most highly associated pairs in all 20 lineages from the server side. Similarly, for the survival analysis, the user side calculates the survival rate based on a single gene in a given lineage, while the smart search provides the significance of the survival rate based on a given gene in all 33 lineages.

Box plot analysis

Box plots in Q-omics can be used to visualize differences in the distribution of numerical data between different groups. The differences between two groups were analyzed by calculating the fold change and P value (Student’s t-test). Q-omics also provides a platform for comparisons between drug-induced changes in gene expression. Gene expression data from NCI60 cell lines treated with 15 anticancer agents contained the measured expression values of nine genes at three time points (2, 4, and 24 h) and at three doses (0 nM, low dose, and high dose) (Monks et al., 2018). The low and high doses used varied depending on each drug. For these data, box plots were generated to compare time- and dose-dependent gene expression. Groups were divided based on time points or doses, and box plots were used to display fold changes between time points (4 h vs 2 h and 24 h vs 2 h) or between doses (low dose vs 0 nM and high dose vs 0 nM), respectively, not raw gene expression.

Scatter plot analysis

Scatter plots were used to display relationships between two numeric variables, and the strength and direction of the linear relationships were assessed by Pearson’s correlation coefficient in Q-omics.

Q-omics implementation

Q-omics was implemented in Python 3, and MySQL was used for the smart search.

RESULTS AND DISCUSSION

Q-omics software runs on the user’s computer, providing a graphical interface and computational/visualization modules together with its own local database (Fig. 2A). To assist in user data mining, Q-omics interacts with a server-side knowledge base and retrieves relevant information for analysis. The knowledge base harbors billions of precalculated, significantly associated data pairs with related information such as sample filters and calculation options. Smart algorithms in the knowledge base promptly select data pairs and information that is relevant to the user’s query and then returns it to Q-omics.

Fig. 2

Software workflow and user interface.

(A) The workflow of functional modules and databases between the local software and server-side knowledge base in Q-omics. (B) Main interface of Q-omics software. Search options are separated into “Browse smart data” and “Query-oriented analysis”. “Ouick start examples” are comprehensive options for first-time users. Knowledge-based smart search is enabled for all of the search options.

As described in Fig. 1, users can start data mining with one query (i.e., gene expression, mutations, drugs, or sh/sgRNAs). The front page of Q-omics provides a graphical interface for selecting the analysis type, query and sample type (Fig. 2B). Basically, all analyses are separated into those with patient samples and those with cell lines. Available analyses with patient samples are as follows: (1) survival analyses (Kaplan–Meier plots) according to gene expression and mutations, (2) differential gene expression analyses between normal and cancer cells, and (3) scatter/box plots analyses of gene expression and/or mutation pairs. Available analyses with cell lines are as follows: (1) cross-association analyses between any pair of datasets according to gene expression, mutations, shRNA screening data, sgRNA screening data and drug screening data, (2) change (induction) analyses of gene expression before/after drug treatments, and (3) scatter/box plot analyses of pairs according to gene expression, mutations, shRNAs, sgRNAs and drugs. The menu “Quick start examples” is used to demonstrate graphical outputs and smart functions of the software using the preselected analysis type and user-selected queries. In all analyses, the resulting graphs and data can be saved for further usage. Fig. 3A demonstrates the survival analysis module of the software. A Kaplan–Meier plot of BRCA patient data was generated by using user-selected options: CD24 gene expression with TP53 mutations. The graphical panel provides detailed information on selected samples and further filtering options such as sex and stage. Together with the panel of Kaplan–Meier plots, Q-omics software provides a panel of smart search results (Fig. 3B). This smart panel provides a list of genes that exhibit significant (P < 0.01) associations with the survival rate in combination with user-selected queries, i.e., CD24 gene expression. Users can select one of the genes in the list and see the Kaplan–Meier plot in the new panel. This is very useful for the quick discovery of gene expression changes or mutations that are associated with the queried gene (user’s interest) in the patient survival analysis. This smart list is automatically generated from the server-side knowledge base by using information such as user-selected queries and lineages. The smart system in the server searches genes or mutations that are related (i.e., significantly associated) to the user’s interests from the knowledge base and sends them to the Q-omics user interface. Algorithms in the smart system are improved and updated continuously with the increase in data in the knowledge base.

Fig. 3

Graphical interface of patient survival analysis and related smart search results.

(A) The panel of survival analyses included Kaplan–Meier (KM) plots, sample group information and advanced options for plotting. (B) The panel of gene lists retrieved by the smart algorithm from the server-side knowledge base. In this example, the list shows genes that are significantly (P < 0.01) associated with the user’s query in the KM plot.

Fig. 4A shows the Q-omics output panel of a cross-association analysis between the user-selected drug, cisplatin, and 17,795 sgRNAs in lung cancer cell lines. The present example shows that the responses of 136 sgRNAs exhibit a positive association (P < 0.05) with the cisplatin response (red circle in Fig. 4A), while 179 sgRNAs exhibit a negative association (blue circle in Fig. 4A) with the cisplatin response. A detailed list of hit sgRNAs is displayed on the right side of the panel. Hit selection can be optimized by changing the p-value cutoff or sample separation option (i.e., median or quartile). Specific association patterns between hit sgRNAs and cisplatin can be displayed as box plots or scatter plots (Figs. 4B and 4C).

Fig. 4

Graphical interface of cross-association analysis between datasets using cell lines.

(A) The panel of cross-associations displaying the predictivity and descriptivity scores of all data points. The list on the right side shows hits with significant P values. (B and C) Box plot and scatter plot of a selected hit from the cross-association panel. Box plots and scatter plots are also available for patient sample analyses.

The predictivity and descriptivity measures from the cross-association calculation were reported to be useful for the systematic evaluation of targets and biomarkers from multiomics data (Jeong et al., 2020). Q-omics software provides a simple and easy interface for calculating and analyzing the cross-association between any data pair, such as gene expression, mutations, sh/sgRNA screening data and drug screening data, from diverse resources. Q-omics also provides smart search results related to the user’s query in the cross-association analysis. The software retrieves diverse association patterns with statistical significance to the user’s query from the knowledge base and assists users in the optimal selection of data pairs and visualization. In summary, Q-omics is an innovative software program that enables users to carry out data mining and customized visualization without computational skills. The smart system of the software assists in the identification of new data pairs related to/associated with the user’s interests in real time. This software takes advantage of stand-alone software and web-based applications. Several discovery projects using this software are ongoing, and the results will be published in the near future.

29 in total

1. Association of BRCA1 and BRCA2 mutations with survival, chemotherapy sensitivity, and gene mutator phenotype in patients with ovarian cancer.

Authors: Da Yang; Sofia Khan; Yan Sun; Kenneth Hess; Ilya Shmulevich; Anil K Sood; Wei Zhang
Journal: JAMA Date: 2011-10-12 Impact factor: 56.272

2. The Cancer Genome Atlas Pan-Cancer analysis project.

Authors: John N Weinstein; Eric A Collisson; Gordon B Mills; Kenna R Mills Shaw; Brad A Ozenberger; Kyle Ellrott; Ilya Shmulevich; Chris Sander; Joshua M Stuart
Journal: Nat Genet Date: 2013-10 Impact factor: 38.330

3. Next-generation characterization of the Cancer Cell Line Encyclopedia.

Authors: Mahmoud Ghandi; Franklin W Huang; Judit Jané-Valbuena; Gregory V Kryukov; Christopher C Lo; E Robert McDonald; Jordi Barretina; Ellen T Gelfand; Craig M Bielski; Haoxin Li; Kevin Hu; Alexander Y Andreev-Drakhlin; Jaegil Kim; Julian M Hess; Brian J Haas; François Aguet; Barbara A Weir; Michael V Rothberg; Brenton R Paolella; Michael S Lawrence; Rehan Akbani; Yiling Lu; Hong L Tiv; Prafulla C Gokhale; Antoine de Weck; Ali Amin Mansour; Coyin Oh; Juliann Shih; Kevin Hadi; Yanay Rosen; Jonathan Bistline; Kavitha Venkatesan; Anupama Reddy; Dmitriy Sonkin; Manway Liu; Joseph Lehar; Joshua M Korn; Dale A Porter; Michael D Jones; Javad Golji; Giordano Caponigro; Jordan E Taylor; Caitlin M Dunning; Amanda L Creech; Allison C Warren; James M McFarland; Mahdi Zamanighomi; Audrey Kauffmann; Nicolas Stransky; Marcin Imielinski; Yosef E Maruvka; Andrew D Cherniack; Aviad Tsherniak; Francisca Vazquez; Jacob D Jaffe; Andrew A Lane; David M Weinstock; Cory M Johannessen; Michael P Morrissey; Frank Stegmeier; Robert Schlegel; William C Hahn; Gad Getz; Gordon B Mills; Jesse S Boehm; Todd R Golub; Levi A Garraway; William R Sellers
Journal: Nature Date: 2019-05-08 Impact factor: 49.962

4. The NCI Transcriptional Pharmacodynamics Workbench: A Tool to Examine Dynamic Expression Profiling of Therapeutic Response in the NCI-60 Cell Line Panel.

Authors: Anne Monks; Yingdong Zhao; Curtis Hose; Hossein Hamed; Julia Krushkal; Jianwen Fang; Dmitriy Sonkin; Alida Palmisano; Eric C Polley; Laura K Fogli; Mariam M Konaté; Sarah B Miller; Melanie A Simpson; Andrea Regier Voth; Ming-Chung Li; Erik Harris; Xiaolin Wu; John W Connelly; Annamaria Rapisarda; Beverly A Teicher; Richard Simon; James H Doroshow
Journal: Cancer Res Date: 2018-10-24 Impact factor: 12.701

5. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity.

Authors: Jordi Barretina; Giordano Caponigro; Nicolas Stransky; Kavitha Venkatesan; Adam A Margolin; Sungjoon Kim; Christopher J Wilson; Joseph Lehár; Gregory V Kryukov; Dmitriy Sonkin; Anupama Reddy; Manway Liu; Lauren Murray; Michael F Berger; John E Monahan; Paula Morais; Jodi Meltzer; Adam Korejwa; Judit Jané-Valbuena; Felipa A Mapa; Joseph Thibault; Eva Bric-Furlong; Pichai Raman; Aaron Shipway; Ingo H Engels; Jill Cheng; Guoying K Yu; Jianjun Yu; Peter Aspesi; Melanie de Silva; Kalpana Jagtap; Michael D Jones; Li Wang; Charles Hatton; Emanuele Palescandolo; Supriya Gupta; Scott Mahan; Carrie Sougnez; Robert C Onofrio; Ted Liefeld; Laura MacConaill; Wendy Winckler; Michael Reich; Nanxin Li; Jill P Mesirov; Stacey B Gabriel; Gad Getz; Kristin Ardlie; Vivien Chan; Vic E Myer; Barbara L Weber; Jeff Porter; Markus Warmuth; Peter Finan; Jennifer L Harris; Matthew Meyerson; Todd R Golub; Michael P Morrissey; William R Sellers; Robert Schlegel; Levi A Garraway
Journal: Nature Date: 2012-03-28 Impact factor: 49.962

6. Systematic identification of genomic markers of drug sensitivity in cancer cells.

Authors: Mathew J Garnett; Elena J Edelman; Sonja J Heidorn; Chris D Greenman; Anahita Dastur; King Wai Lau; Patricia Greninger; I Richard Thompson; Xi Luo; Jorge Soares; Qingsong Liu; Francesco Iorio; Didier Surdez; Li Chen; Randy J Milano; Graham R Bignell; Ah T Tam; Helen Davies; Jesse A Stevenson; Syd Barthorpe; Stephen R Lutz; Fiona Kogera; Karl Lawrence; Anne McLaren-Douglas; Xeni Mitropoulos; Tatiana Mironenko; Helen Thi; Laura Richardson; Wenjun Zhou; Frances Jewitt; Tinghu Zhang; Patrick O'Brien; Jessica L Boisvert; Stacey Price; Wooyoung Hur; Wanjuan Yang; Xianming Deng; Adam Butler; Hwan Geun Choi; Jae Won Chang; Jose Baselga; Ivan Stamenkovic; Jeffrey A Engelman; Sreenath V Sharma; Olivier Delattre; Julio Saez-Rodriguez; Nathanael S Gray; Jeffrey Settleman; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Sridhar Ramaswamy; Ultan McDermott; Cyril H Benes
Journal: Nature Date: 2012-03-28 Impact factor: 49.962

7. Computational correction of copy number effect improves specificity of CRISPR-Cas9 essentiality screens in cancer cells.

Authors: Robin M Meyers; Jordan G Bryan; James M McFarland; Barbara A Weir; Ann E Sizemore; Han Xu; Neekesh V Dharia; Phillip G Montgomery; Glenn S Cowley; Sasha Pantel; Amy Goodale; Yenarae Lee; Levi D Ali; Guozhi Jiang; Rakela Lubonja; William F Harrington; Matthew Strickland; Ting Wu; Derek C Hawes; Victor A Zhivich; Meghan R Wyatt; Zohra Kalani; Jaime J Chang; Michael Okamoto; Kimberly Stegmaier; Todd R Golub; Jesse S Boehm; Francisca Vazquez; David E Root; William C Hahn; Aviad Tsherniak
Journal: Nat Genet Date: 2017-10-30 Impact factor: 38.330

8. Anticancer Drug Response Prediction in Cell Lines Using Weighted Graph Regularized Matrix Factorization.

Authors: Na-Na Guan; Yan Zhao; Chun-Chun Wang; Jian-Qiang Li; Xing Chen; Xue Piao
Journal: Mol Ther Nucleic Acids Date: 2019-06-04

9. Genomics of Drug Sensitivity in Cancer (GDSC): a resource for therapeutic biomarker discovery in cancer cells.

Authors: Wanjuan Yang; Jorge Soares; Patricia Greninger; Elena J Edelman; Howard Lightfoot; Simon Forbes; Nidhi Bindal; Dave Beare; James A Smith; I Richard Thompson; Sridhar Ramaswamy; P Andrew Futreal; Daniel A Haber; Michael R Stratton; Cyril Benes; Ultan McDermott; Mathew J Garnett
Journal: Nucleic Acids Res Date: 2012-11-23 Impact factor: 16.971