Literature DB >> 19906700

Network of Cancer Genes: a web resource to analyze duplicability, orthology and network properties of cancer genes.

Adnan S Syed¹, Matteo D'Antonio, Francesca D Ciccarelli.

Abstract

The Network of Cancer Genes (NCG) collects and integrates data on 736 human genes that are mutated in various types of cancer. For each gene, NCG provides information on duplicability, orthology, evolutionary appearance and topological properties of the encoded protein in a comprehensive version of the human protein-protein interaction network. NCG also stores information on all primary interactors of cancer proteins, thus providing a complete overview of 5357 proteins that constitute direct and indirect determinants of human cancer. With the constant delivery of results from the mutational screenings of cancer genomes, NCG represents a versatile resource for retrieving detailed information on particular cancer genes, as well as for identifying common properties of precompiled lists of cancer genes. NCG is freely available at: http://bio.ifom-ieo-campus.it/ncg.

Entities: Disease Gene Species

Mesh：

Year: 2009 PMID： 19906700 PMCID： PMC2808873 DOI： 10.1093/nar/gkp957

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Cancer is a genetic disease caused by the accumulation of deleterious modifications within the genome of somatic cells (1). During tumorigenesis, genomic instability leads to the progressive acquisition of silent (‘passenger’) and selected (‘driver’) mutations (2). The latter provide cancer cells with selective growth advantages that initiate clonal expansion (3). The Cancer Genome Project (CGP) has the ambitious goal of identifying all genes that are implicated in the development of cancer (4). The Cancer Gene Census (CGC) is a part of CGP and collects information on more than 370 genes whose mutations are causally related to cancer (5). Recently, high-throughput mutational screenings of several cancer types have been promoted with the aim of identifying mutated genes, without any hypothesis-driven bias. So far, four of these high-throughput experiments have been delivered. Overall, they identified 380 Candidate Cancer Genes (CAN-genes) that are mutated in breast, colorectal, pancreatic cancers and glioblastoma (6–8). Furthermore, the pilot experiment from the Tumor Sequencing Project identified 26 genes (TSP-genes) mutated in lung adenocarcinoma (9). Altogether, these studies revealed that the number of cancer genes is surprisingly high and they are functionally more heterogeneous than previously thought. Despite this functional heterogeneity, cancer genes tend to share ‘systems-level properties’ (10), such as higher connectivity and lower duplicability when compared to the rest of human genes (11–13). The presence of shared properties, which are not strictly dependent on the gene function, indicates that cancer genes are fragile components of the human gene repertoire. A number of databases have been set up over the years to collect and organize several types of information related to cancer, such as somatic mutations of cancer genes (14), experimental evidence for their involvement in cancer (15,16) or modifications in gene expression levels (17,18). Other databases are specialized on particular types of cancer (19,20), on single genes (21,22) or on specific genomic modifications (23,24). None of the available resources, however, focuses on properties of cancer genes that are not strictly dependent on their function, but that could help in interpreting cancer as a ‘systems disease’. Here, we present the Network of Cancer Genes (NCG, http://bio.ifom-ieo-campus.it/ncg), a database that stores information on systems-level properties of a comprehensive dataset of more than 730 cancer genes. The collected features are duplicability, evolutionary appearance and topological properties in the human protein–protein interaction network. Protein interactions have been successfully used to infer functional links between proteins (25). In NCG, they are used to understand how the topological properties of the cancer proteins inside the protein–protein interaction network influence their role in cancer. NCG can be used to retrieve information on specific cancer genes, as well as to identify groups of cancer genes with identical properties, thus providing a flexible tool for investigating the complex landscape of cancer genetic determinants. In this paper, along with a general description of the features of NCG, we also provide a specific example of how NCG can be used by reporting the properties of PTEN, a tumor suppressor gene coding for a phosphatidylinositol phosphatase that is impaired in several cancer types.

MATERIALS AND METHODS

Dataset of human cancer genes

We define cancer genes as a collection of 736 genes that are mutated in different cancer types and derive from two different data sources. A total of 375 genes come from the Cancer Gene Census (CGC-genes, December 2008), a manually curated list of genes with at least two independent reports of mutations in primary tumors (5). The census provides information on the tumor type, as well as on the genetic effect of the mutation, i.e. whether the mutation is dominant or recessive (Figure 1A and B). The remaining 396 genes derive from high-throughput mutational screenings performed in glioblastoma (7), breast and colorectal (6), pancreatic (8) cancers (CAN-genes, Figure 1C) and lung adenocarcinoma (TSP-genes) (9). CAN- and TSP-genes result from the effort of massively sequencing the cancer gene repertoire (26), and provide the first unbiased mutational screenings in different cancer types. The lists of literature-curated and high-throughput derived cancer genes show poor overlap (Figure 1D), confirming the cancer-specificity of the mutational landscape (27). We gather the protein sequences associated to the 736 cancer genes from the RefSeq database [March 2009, (28)]. For eight genes no RefSeq is available and Ensembl protein sequence (29) is used instead.

Figure 1.

Cancer genes collected in NCG. Venn diagrams of the different lists of cancer genes stored in NCG. The Cancer Gene Census provides information on the cancer type (A) and on the phenotypic effect of the mutation (B). The CAN-genes reported so far refer to four cancer types (C). The overlap among the different data sources used in this study is overall very poor (D).

Gene duplicability

We define gene duplicability as in Rambaldi et al. (11). In brief, we first align the protein sequences of all human genes to the human genome reference assembly (hg18), using BLAT (30). We then retrieve the best hit of each gene, defined as the locus on the genome with the highest score in terms of coverage. By default, all genes with additional genomic matches that cover at least 60% of the query length are considered duplicable, while genes with no additional hits above this threshold are considered singleton (11). In addition to the results at the default threshold of 60%, we also provide the possibility of inspecting additional hits of the same gene covering higher or lower percentage of the original protein length. For each duplicated locus, we refer to the genome annotation provided by the UCSC Table Browser (31) to assess whether it corresponds to a known gene or instead to non-genic region.

Orthology assignment and evolutionary appearance

We derive the orthology relationships from the eggNOG database (32). Based on these relationships, we assign the evolutionary appearance of each cancer gene, defined as the deepest branch of the tree of life where an ortholog for that gene can be found. Overall, we divide the tree of life into seven main branches: Last Common Ancestor (LCA), which identifies the ancestral cellular organism, Eukaryotes, Opisthokonts, Metazoans, Vertebrates, Mammals and Primates. For example, a human gene whose orthologs are traceable in prokaryotes is considered to have appeared in the LCA, while a human gene with orthologs only in fungi and metazoans, but not in plants, is assumed to be born with Opisthokonts. Depending on the number of paralogs of a given cancer gene at each branch, we also derive the corresponding orthology ratio, defined as the number of co-orthologs of that human gene in a given lineage. This ratio provides a useful indication of the number of intra-lineage duplications that the gene underwent during evolution. Orthology ratio can be 1 to 1 when no duplications occurred; 1 to N, indicating one-to-many relationship; N to 1, corresponding to many-to-one relationship; N to N, when multiple duplications occurred during evolution.

Protein interaction network

In order to gather the most complete representation of the human protein–protein interaction network, we integrate information from five resources: the Human Protein Reference Database (HPRD) (33), BioGRID (34), IntAct (35), the Molecular INTeraction database (MINT) (36) and the Database of Interacting Proteins (DIP) (37). We only consider primary data on interactions between human proteins, i.e. putative interactions inferred from orthology relationships are discarded. The resulting non-redundant network is composed of 68 498 interactions among 11 988 proteins, derived from 19 886 independent literature reports (Table 1). Overall, we find 4621 human proteins that interact with cancer proteins. To provide a complete view of the network of cancer proteins, NCG also allows retrieving information on the systems-level properties for all these primary interactors.

Table 1.

Integration of protein–protein interaction data

Database	Version	Proteins	Interactions	Independent reports
HPRD (33)	1 September 2007	8697	34 938	17 770
BioGRID (34)	1 February 2009	7163	23 588	8815
IntAct (35)	23 January 2009	7066	22 119	1374
MINT (36)	5 February 2009	5151	12 653	1210
DIP (37)	26 January 2009	1108	1326	739
NCG	21 June 2009	11 988	68 498	19 886

Data from five different sources are integrated in NCG. To derive a non-redundant version of the network, proteins are counted as number of non-redundant Entrez IDs. The number of interactions refers to non-redundant primary interactions; the independent reports refer to the number of published papers that define the interactions.

Integration of protein–protein interaction data Data from five different sources are integrated in NCG. To derive a non-redundant version of the network, proteins are counted as number of non-redundant Entrez IDs. The number of interactions refers to non-redundant primary interactions; the independent reports refer to the number of published papers that define the interactions. Given the poor overlap between the five data sets (Table 1), their integration allows a more complete coverage of the real interactions for each human protein. For example, the protein TP53 has a total of 408 interactions in NCG, 237 of which derive from HPRD, 214 from BioGRID, 159 from IntAct, 122 from MINT and 38 from DIP. The primary interaction network for each cancer protein is visualized using Medusa (38) and all interactors are provided with information on their duplicability, orthology, evolutionary appearance and possible involvement in cancer.

Database description

NCG is divided into four sections: (i) the gene summary table, which allows the conversion between different gene and protein identifiers, using the Entrez ID as primary key; (ii) the duplicability table, which includes all results of the BLAT alignments on the human genome; (iii) the orthology table, which stores the orthology relationships; and (iv) the network table, which includes the network properties for each protein. The data collected in NCG are stored in a MySQL database. The web interface to interrogate the database is built in Perl.

RESULTS AND DISCUSSION

Information retrieval

NCG allows retrieving information on cancer genes in three ways: (i) by using different types of identifiers, such as gene symbols, Entrez IDs (39), RefSeq (40) or Ensembl IDs (29), for specific genes or groups of genes of interest; (ii) by selecting precompiled lists of cancer genes; and (iii) by combining different criteria to analyze genes with similar duplicability, orthology and network properties. The primary output of the query is a summary table that provides links to several external databases, such as Entrez (www.ncbi.nlm.nih.gov/Entrez/), HPRD (http://www.hprd.org/), OMIM (http://www.ncbi.nlm.nih.gov/omim/) (41), RefSeq (http://www.ncbi.nlm.nih.gov/RefSeq/) and Ensembl (http://www.ensembl.org/), as well as to detailed reports on duplicability, orthology and network properties.

Duplicability of cancer genes

In accordance with our previous report (11), at 60% coverage we find 104 duplicable cancer genes (14.1% of the total), which are associated with 336 duplicated loci. According to the available genome annotation, 44% of these additional hits correspond to known genes, 15% to more than one gene and 41% to non-genic regions. Only 22% of duplicable cancer genes duplicate in loci with no evidence of transcription, indicating that, although our measure of duplicability is based on direct genome comparison, it mostly detects transcribed paralogs. In the case of the tumor suppressor gene PTEN, we find an almost identical duplicate (97% coverage, 98% identity) corresponding to PTENP1 (Figure 2A). While the activity of PTEN as repressor of the AKT pathway is well documented (42,43), PTENP1 is known to transcribe a processed pseudogene (44,45) but the involvement in cancer has never been reported. At 10% coverage, an additional hit is found, which involves the last 50 amino acids of PTEN and matches to the intronic region between exons 3 and 4 of ANKFN1 (ankyrin-repeat and fibronectin type III domain containing 1, Figure 2A).

Figure 2.

Duplicability, orthology and network properties of the tumor suppressor gene PTEN. (A) Using the PTEN protein sequence as a query, three hits are found on the human genome. The best hit corresponds to genomic locus of PTEN, while the two additional hits account for a recent duplication transcribing for the processed pseudogene PTENP1, and to a short region of identity lying in the intron of ANKFN1, respectively. (B) The orthology ratio reflects the co-orthology relationships of human PTEN at different branching points of the tree of life. The only inparalogs of PTEN in eukaryotes are found in A. thaliana and D. rerio, indicating that this gene maintained a strict singleton status during eukaryotic evolution. (C) PTEN interacts with 35 other human proteins, four of which are cancer proteins and 22 are hubs. This makes PTEN a central node of the human protein-protein interaction network.

Orthologs and evolutionary appearance of cancer genes

NCG collects orthology information for 723 out of 736 cancer genes (98.2%), since 13 genes are not present in the eggNOG database. We find that 61% of cancer genes originated very early in evolution, because orthologs can be traced back either to LCA or to early Eukaryotes. As few as 2.5% of cancer genes appeared with Opisthokonts, 17.7% with Metazoans, 15% with Vertebrates and only the remaining 3.8% with Mammals and Primates. These results are consistent with previous reports, which assess that disease genes are overall depleted in recent genes (46). As expected for an enzyme-coding gene, orthologs of PTEN are detectable in all branches of the tree of life, including prokaryotes, where they belong to the inclusive orthologous group of the tyrosine phosphatases. With the exception of Danio rerio and Arabidopsis thaliana, whose genomes underwent whole-genome duplications (47,48), orthologs of PTEN maintain a strict 1:1 relationship in all eukaryotic branches (Figure 2B). This suggests an early differentiation of PTEN and the maintenance of a strict singleton status during eukaryote evolution.

Network properties of cancer proteins

For each of the 579 cancer proteins with available network information (78.7% of the total), we calculate the degree, i.e. the number of interactions, the clustering coefficient, i.e. the number of interactions between primary interactors, and the betweenness, i.e. the number of shortest paths crossing the protein. These parameters return a measure of connectivity, interconnectivity and centrality, thus providing a glance of the protein topology in the network. On the basis of the network degree, we discriminate between hubs and non-hubs, where the former are defined as the top 5% most connected proteins in the network. Likewise, we identify the central nodes of the network, defined as the proteins with top 5% values of betweenness. Overall, we find 619 human hubs, 78 of which are cancer proteins (13.4% of the total set). This result is comparable to previous reports and confirms that cancer proteins are enriched in hubs when compared to the rest of human proteins (11,12). We also observe that cancer proteins have higher betweenness than the rest of human proteins (P-value <2.2e-16, Wilcoxon test), confirming that they occupy a central position in the network. The PTEN has overall 35 interactors, 21 of which are hubs. This, together with a high betweenness value, makes PTEN a central node that acts as a bypass between several hubs inside the human protein–protein interaction network (Figure 2C). PTEN interacts with TP53 through phosphatase-dependent and -independent mechanisms (49); it is involved in the phosphorylation of ADAM29 (50); it attenuates the activity of the tyrosine kinase receptor PDGFRB (51); and, finally, it is phosphorylated by the serine/threonine kinase STK11 (52). This confirms the tendency of cancer proteins to interact with other cancer proteins, indicating that different components of the key biological processes can contribute to tumorigenesis (11).

FUTURE PROSPECTIVE

In the coming years, we will assist to a continuous delivery of data from the Cancer Genome Project as well as from other large-scale mutational screenings of cancer genes. This massive quantity of information will require ad hoc tools for data organization and mining. NCG represents a first attempt in the direction of a systematic analysis of cancer genes, and it will be constantly updated and expanded with the delivery of new data.

FUNDING

Associazione Italiana Ricerca sul Cancro; Fondazione Cariplo (to F.D.C.). Funding for open access charge: Associazione Italiana Ricerca sul Cancro Fondazione Cariplo. Conflict of interest statement. None declared.

51 in total

1. The Catalogue of Somatic Mutations in Cancer (COSMIC).

Authors: S A Forbes; G Bhamra; S Bamford; E Dawson; C Kok; J Clements; A Menzies; J W Teague; P A Futreal; M R Stratton
Journal: Curr Protoc Hum Genet Date: 2008-04

2. PTEN tumor suppressor associates with NHERF proteins to attenuate PDGF receptor signaling.

Authors: Yoko Takahashi; Fabiana C Morales; Erica L Kreimann; Maria-Magdalena Georgescu
Journal: EMBO J Date: 2006-02-02 Impact factor: 11.598

3. Mapping the cancer genome. Pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies.

Authors: Francis S Collins; Anna D Barker
Journal: Sci Am Date: 2007-03 Impact factor: 2.142

4. Patterns of somatic mutation in human cancer genomes.

Authors: Christopher Greenman; Philip Stephens; Raffaella Smith; Gillian L Dalgliesh; Christopher Hunter; Graham Bignell; Helen Davies; Jon Teague; Adam Butler; Claire Stevens; Sarah Edkins; Sarah O'Meara; Imre Vastrik; Esther E Schmidt; Tim Avis; Syd Barthorpe; Gurpreet Bhamra; Gemma Buck; Bhudipa Choudhury; Jody Clements; Jennifer Cole; Ed Dicks; Simon Forbes; Kris Gray; Kelly Halliday; Rachel Harrison; Katy Hills; Jon Hinton; Andy Jenkinson; David Jones; Andy Menzies; Tatiana Mironenko; Janet Perry; Keiran Raine; Dave Richardson; Rebecca Shepherd; Alexandra Small; Calli Tofts; Jennifer Varian; Tony Webb; Sofie West; Sara Widaa; Andy Yates; Daniel P Cahill; David N Louis; Peter Goldstraw; Andrew G Nicholson; Francis Brasseur; Leendert Looijenga; Barbara L Weber; Yoke-Eng Chiew; Anna DeFazio; Mel F Greaves; Anthony R Green; Peter Campbell; Ewan Birney; Douglas F Easton; Georgia Chenevix-Trench; Min-Han Tan; Sok Kean Khoo; Bin Tean Teh; Siu Tsan Yuen; Suet Yi Leung; Richard Wooster; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2007-03-08 Impact factor: 49.962

5. Entrez Gene: gene-centered information at NCBI.

Authors: Donna Maglott; Jim Ostell; Kim D Pruitt; Tatiana Tatusova
Journal: Nucleic Acids Res Date: 2006-12-05 Impact factor: 16.971

6. IntAct--open source resource for molecular interaction data.

Authors: S Kerrien; Y Alam-Faruque; B Aranda; I Bancarz; A Bridge; C Derow; E Dimmer; M Feuermann; A Friedrichsen; R Huntley; C Kohler; J Khadake; C Leroy; A Liban; C Lieftink; L Montecchi-Palazzi; S Orchard; J Risse; K Robbe; B Roechert; D Thorneycroft; Y Zhang; R Apweiler; H Hermjakob
Journal: Nucleic Acids Res Date: 2006-12-01 Impact factor: 16.971

7. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

8. Global topological features of cancer proteins in the human interactome.

Authors: Pall F Jonsson; Paul A Bates
Journal: Bioinformatics Date: 2006-07-14 Impact factor: 6.937

9. The BioGRID Interaction Database: 2008 update.

Authors: Bobby-Joe Breitkreutz; Chris Stark; Teresa Reguly; Lorrie Boucher; Ashton Breitkreutz; Michael Livstone; Rose Oughtred; Daniel H Lackner; Jürg Bähler; Valerie Wood; Kara Dolinski; Mike Tyers
Journal: Nucleic Acids Res Date: 2007-11-13 Impact factor: 16.971

10. eggNOG: automated construction and annotation of orthologous groups of genes.

Authors: Lars Juhl Jensen; Philippe Julien; Michael Kuhn; Christian von Mering; Jean Muller; Tobias Doerks; Peer Bork
Journal: Nucleic Acids Res Date: 2007-10-16 Impact factor: 16.971

21 in total

1. Human protein reference database and human proteinpedia as discovery resources for molecular biotechnology.

Authors: Renu Goel; Babylakshmi Muthusamy; Akhilesh Pandey; T S Keshava Prasad
Journal: Mol Biotechnol Date: 2011-05 Impact factor: 2.695

Review 2. Translating tumor antigens into cancer vaccines.

Authors: Luigi Buonaguro; Annacarmen Petrizzo; Maria Lina Tornesello; Franco M Buonaguro
Journal: Clin Vaccine Immunol Date: 2010-11-03

Review 3. Cellular hyperproliferation and cancer as evolutionary variables.

Authors: Alejandro Sánchez Alvarado
Journal: Curr Biol Date: 2012-09-11 Impact factor: 10.834

4. Cancer develops, progresses and responds to therapies through restricted perturbation of the protein-protein interaction network.

Authors: Jordi Serra-Musach; Helena Aguilar; Francesco Iorio; Francesc Comellas; Antoni Berenguer; Joan Brunet; Julio Saez-Rodriguez; Miguel Angel Pujana
Journal: Integr Biol (Camb) Date: 2012-07-18 Impact factor: 2.192

5. A genome-wide association study of DSM-IV cannabis dependence.

Authors: Arpana Agrawal; Michael T Lynskey; Anthony Hinrichs; Richard Grucza; Scott F Saccone; Robert Krueger; Rosalind Neuman; William Howells; Sherri Fisher; Louis Fox; Robert Cloninger; Danielle M Dick; Kimberly F Doheny; Howard J Edenberg; Alison M Goate; Victor Hesselbrock; Eric Johnson; John Kramer; Samuel Kuperman; John I Nurnberger; Elizabeth Pugh; Marc Schuckit; Jay Tischfield; John P Rice; Kathleen K Bucholz; Laura J Bierut
Journal: Addict Biol Date: 2010-11-04 Impact factor: 4.280