Literature DB >> 35164854

A cancer graph: a lung cancer property graph database in Neo4j.

David Tuck1.   

Abstract

OBJECTIVES: A novel graph data model of non-small cell lung cancer clinical and genomic data has been constructed with two aims: (1) provide a suitable model for facilitating graph analytics within the Neo4j framework or through tools which can interact through existing Neo4j APIs; and (2) provide a base model extensible to other cancer types and additional datasets such as those derived from electronic health records and other real world sources. DATA DESCRIPTION: Clinical and genomic data integrated with a novel property graph database schema from publicly available datasets and analyses based on The Cancer Genome Atlas lung cancer datasets augmented by with subgraphs patient-patient social network from similarity and correlation as well as individual based biological networks.
© 2022. The Author(s).

Entities:  

Keywords:  Non-small cell lung cancer; Property graph database; The Cancer Genome Atlas

Mesh:

Year:  2022        PMID: 35164854      PMCID: PMC8842806          DOI: 10.1186/s13104-022-05912-9

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Objective

The pathobiology of cancer involves the coordinated dysregulation of multiple processes across molecular, cellular, tissue, and organism scales [1]. Somatic mutations and genomic aberrations are crossed and intertwined with an individual patient’s clinical, social, and medical histories. The complex interrelationships among all of these factors determine disease origin, trajectory, and outcomes of interventions [2]. Strategies that allow operation directly on the topology of the graph structures defined by these relationships are enabled by the development and growing maturity of native graph databases such as TigerGraph and Neo4j [3-5]. This note describes a representation of non-small cell lung cancer in Neo4j, a property graph database platform which natively stores and processes graph data models. A version is available in a GPL3-licensed open-source community edition [6]. Non-small cell lung cancer is the most common cause of cancer deaths worldwide [7, 8], and it has plentiful publicly available genomic, clinical, and molecular data [9-11]. The lung cancer graph database provides an analytic framework for integrating modeling of disease mechanisms on a genome-scale and clinical data from clinical electronic health records, diagnostic studies, therapeutic interventions, and molecular assays. This project utilizes several publicly available open data sources and extends these with calculated variables defining relationships to create a novel graph schema and nested set of subgraphs comprising the Neo4j database. Clinical, demographic, diagnostic, therapeutic, and multiple genomic measures are obtained from TCGA LUAD and LUSC datasets [9, 10]. Multiple analyses have extended available attributes for immunologic, and biologic signaling pathway profiles [11-15] enabling the creation of a graph structure at different scales based on relations among cancer cases, relations among biological molecules, relations of biological networks and processes within individual patients. By adopting graph database technology, this data resource aims to provide a platform to explore the utility of integrative graph-based systems biology analyses to decode the molecular and clinical underpinnings of complex diseases.

Data description

All data files and datasets are deposited in the Harvard Dataverse repository in dataset “A Cancer Graph: A Lung Cancer property graph using Neo4j” [16]. A file containing the entire graph database is provided as a binary database dump (Data file 1 in Table 1). The schema for the property graph is described in Dataset 2 which contains a graphic image of the schema and a json file containing the schema with all entities, attributes, and relationships among all the different entities. Data file 3 contains the commands for generating documentation of the schema, and indexes, for loading the binary file into a Neo4j instance. Data file 3 also provides example commands in the cypher language used by Neo4j which describes how the database was originally generated from input files. These individual input data source files are provided (Dataset 4 in Table 1) as comma separated value formatted files.
Table 1

Overview of data files/data sets

LabelName of data file/data setFile types (file extension)Data repository and identifier(DOI or accession number)
Data file 1lung-cancer-graph-neo4j-2021-07-15T022024.binbinary dump file (bin)

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data file 2README.mdtext/markdown

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data file 3ACancerGraphSchema.pngimage (png)

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data file 4makeIndexesConstraints.cqlcypher (cql)

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data file 5ACancerGraphLoader.cqlcypher (cql)

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data file 6schema.jsonjson

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data set 1Input files (csv format) to create databasecomma separated values (csv)

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Data file set 2Input files (csv format) to create supportive datacomma separated values (csv)

Harvard Dataverse [16]

https://doi.org/10.7910/DVN/RIXLG8

Overview of data files/data sets Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 Harvard Dataverse [16] https://doi.org/10.7910/DVN/RIXLG8 The property graph database consists of (a) publicly available open access data of patients with non-small cell lung cancer and (b) derived variables augmented by relationships defining different subgraphs. The database contains data from > 1000 patients from the Cancer Genome Atlas (TCGA) which contain clinical, diagnostic, and therapeutic data (chemotherapy, radiation, immunotherapy), as well as multiple genomic measures (gene expression, somatic, mutations, copy number, epigenetics). Additional attributes are derived from independent published analyses based on these data, providing signatures related to immunologic, DNA repair, molecular portrait subtypes, and profiles from a variety of biological pathways [11-15]. The dataset also incorporates relevant portions of precedent native graph representations of biological and biomedical systems including Hetio [17, 18] and Reactome [19], both of which use Neo4j platform to represent complex biological networks. This existing framework is supplemented by pathway, genomic and various calculated variables including graph kernels, embedded vector representation of somatic gene mutations, and computed pathway activations. The primary value of the dataset come from calculated relationships which create subgraphs that serve as a substrate for the application of exploration and application of graph algorithms [20-22]. These occur primarily at two different scales: (1) patient-patient network with direct relationships among patients (or tumor samples) based on similarity scores or correlation for genomic features or signatures; (2) biological networks within single patient samples. CancerCase (Patient-based) networks provide graphs of the relationships between patients based on calculation of similarity and correlation scores of molecular signatures such as immune scores or DNA repair profiles. Intra-patient biological signaling activation networks InFlo [14] is a robust systems biology approach for integrative analysis of multi-omics data which can characterize complex biological signaling network activities in any given biological sample. InFlo was applied for individual samples from TCGA including the non-small cell lung cancer samples/Thus calculating a complete biological network activation state for each individual tumor sample. In summary, a novel graph data model has been constructed integrating clinical and molecular data of non-small cell lung cancer patients with aims: (1) a graph model for facilitating graph analytics within the Neo4j framework or through tools via the Neo4j application programming interface (API); and (2) exploratory basis extension to other tumor types or clinical datasets derived from electronic health records.

Limitations

The database is limited in the number of variables, which may not satisfy specific needs. The schema of the database. TCGA is rich in omics but relatively poor in clinical details (comorbidity, frailty assessment, specific lab results, extended pharmacy). And other sources with additional modifications (TCGA is rich in omics but relatively poor in clinical details (comorbidity, frailty assessment, specific lab results, extended pharmacy, etc.).
  15 in total

1.  Graph kernels combined with the neural network on protein classification.

Authors:  Jiang Qiangrong; Qiu Guang
Journal:  J Bioinform Comput Biol       Date:  2019-10       Impact factor: 1.122

2.  The Immune Landscape of Cancer.

Authors:  Vésteinn Thorsson; David L Gibbs; Scott D Brown; Denise Wolf; Dante S Bortone; Tai-Hsien Ou Yang; Eduard Porta-Pardo; Galen F Gao; Christopher L Plaisier; James A Eddy; Elad Ziv; Aedin C Culhane; Evan O Paull; I K Ashok Sivakumar; Andrew J Gentles; Raunaq Malhotra; Farshad Farshidfar; Antonio Colaprico; Joel S Parker; Lisle E Mose; Nam Sy Vo; Jianfang Liu; Yuexin Liu; Janet Rader; Varsha Dhankani; Sheila M Reynolds; Reanne Bowlby; Andrea Califano; Andrew D Cherniack; Dimitris Anastassiou; Davide Bedognetti; Younes Mokrab; Aaron M Newman; Arvind Rao; Ken Chen; Alexander Krasnitz; Hai Hu; Tathiane M Malta; Houtan Noushmehr; Chandra Sekhar Pedamallu; Susan Bullman; Akinyemi I Ojesina; Andrew Lamb; Wanding Zhou; Hui Shen; Toni K Choueiri; John N Weinstein; Justin Guinney; Joel Saltz; Robert A Holt; Charles S Rabkin; Alexander J Lazar; Jonathan S Serody; Elizabeth G Demicco; Mary L Disis; Benjamin G Vincent; Ilya Shmulevich
Journal:  Immunity       Date:  2018-04-05       Impact factor: 43.474

3.  Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas.

Authors:  Joshua D Campbell; Anton Alexandrov; Jaegil Kim; Jeremiah Wala; Alice H Berger; Chandra Sekhar Pedamallu; Sachet A Shukla; Guangwu Guo; Angela N Brooks; Bradley A Murray; Marcin Imielinski; Xin Hu; Shiyun Ling; Rehan Akbani; Mara Rosenberg; Carrie Cibulskis; Aruna Ramachandran; Eric A Collisson; David J Kwiatkowski; Michael S Lawrence; John N Weinstein; Roel G W Verhaak; Catherine J Wu; Peter S Hammerman; Andrew D Cherniack; Gad Getz; Maxim N Artyomov; Robert Schreiber; Ramaswamy Govindan; Matthew Meyerson
Journal:  Nat Genet       Date:  2016-05-09       Impact factor: 38.330

4.  Systematic integration of biomedical knowledge prioritizes drugs for repurposing.

Authors:  Daniel Scott Himmelstein; Antoine Lizee; Christine Hessler; Leo Brueggeman; Sabrina L Chen; Dexter Hadley; Ari Green; Pouya Khankhanian; Sergio E Baranzini
Journal:  Elife       Date:  2017-09-22       Impact factor: 8.140

5.  InFlo: a novel systems biology framework identifies cAMP-CREB1 axis as a key modulator of platinum resistance in ovarian cancer.

Authors:  N Dimitrova; A B Nagaraj; A Razi; S Singh; S Kamalakaran; N Banerjee; P Joseph; A Mankovich; P Mittal; A DiFeo; V Varadan
Journal:  Oncogene       Date:  2016-11-07       Impact factor: 9.867

6.  Exploiting graph kernels for high performance biomedical relation extraction.

Authors:  Nagesh C Panyam; Karin Verspoor; Trevor Cohn; Kotagiri Ramamohanarao
Journal:  J Biomed Semantics       Date:  2018-01-30

7.  An overview of graph databases and their applications in the biomedical domain.

Authors:  Santiago Timón-Reina; Mariano Rincón; Rafael Martínez-Tomás
Journal:  Database (Oxford)       Date:  2021-05-18       Impact factor: 3.451

8.  Whole-genome characterization of lung adenocarcinomas lacking alterations in the RTK/RAS/RAF pathway.

Authors:  Jian Carrot-Zhang; Xiaotong Yao; Siddhartha Devarakonda; Aditya Deshpande; Jeffrey S Damrauer; Tiago Chedraoui Silva; Christopher K Wong; Hyo Young Choi; Ina Felau; A Gordon Robertson; Mauro A A Castro; Lisui Bao; Esther Rheinbay; Eric Minwei Liu; Tuan Trieu; David Haan; Christina Yau; Toshinori Hinoue; Yuexin Liu; Ofer Shapira; Kiran Kumar; Karen L Mungall; Hailei Zhang; Jake June-Koo Lee; Ashton Berger; Galen F Gao; Binyamin Zhitomirsky; Wen-Wei Liang; Meng Zhou; Sitapriya Moorthi; Alice H Berger; Eric A Collisson; Michael C Zody; Li Ding; Andrew D Cherniack; Gad Getz; Olivier Elemento; Christopher C Benz; Josh Stuart; J C Zenklusen; Rameen Beroukhim; Jason C Chang; Joshua D Campbell; D Neil Hayes; Lixing Yang; Peter W Laird; John N Weinstein; David J Kwiatkowski; Ming S Tsao; William D Travis; Ekta Khurana; Benjamin P Berman; Katherine A Hoadley; Nicolas Robine; Matthew Meyerson; Ramaswamy Govindan; Marcin Imielinski
Journal:  Cell Rep       Date:  2021-02-23       Impact factor: 9.995

9.  Comprehensive genomic characterization of squamous cell lung cancers.

Authors: 
Journal:  Nature       Date:  2012-09-09       Impact factor: 49.962

10.  Comprehensive molecular profiling of lung adenocarcinoma.

Authors: 
Journal:  Nature       Date:  2014-07-09       Impact factor: 49.962

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.