Literature DB >> 18846217

Genome wide identification of recessive cancer genes by combinatorial mutation analysis.

Stefano Volinia1, Nicoletta Mascellani, Jlenia Marchesini, Angelo Veronese, Elizabeth Ormondroyd, Hansjuerg Alder, Jeff Palatini, Massimo Negrini, Carlo M Croce.   

Abstract

We devised a novel procedure to identify human cancer genes acting in a recessive manner. Our strategy was to combine the contributions of the different types of genetic alterations to loss of function: amino-acid substitutions, frame-shifts, gene deletions. We studied over 20,000 genes in 3 Gigabases of coding sequences and 700 array comparative genomic hybridizations. Recessive genes were scored according to nucleotide mismatches under positive selective pressure, frame-shifts and genomic deletions in cancer. Four different tests were combined together yielding a cancer recessive p-value for each studied gene. One hundred and fifty four candidate recessive cancer genes (p-value < 1.5 x 10(-7), FDR = 0.39) were identified. Strikingly, the prototypical cancer recessive genes TP53, PTEN and CDKN2A all ranked in the top 0.5% genes. The functions significantly affected by cancer mutations are exactly overlapping those of known cancer genes, with the critical exception for the absence of tyrosine kinases, as expected for a recessive gene-set.

Entities:  

Mesh:

Year:  2008        PMID: 18846217      PMCID: PMC2557123          DOI: 10.1371/journal.pone.0003380

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

A variety of approaches have been applied to the identification of cancer genes [1]. Procedures have been developed that allowed identification of genes causative of cellular transformation [2], [3], and of complex processes such as invasiveness and metastasis [4]. In vitro and in vivo methods, using cellular or animal models, led generally to the discovery of dominant cancer genes, or oncogenes. On the other hand, tumor suppressors have been discovered mainly by molecular genetics approaches. Such is the need of identifying additional tumor suppressors, or recessive cancer genes, that new tests for loss-of-function continue to be developed [5]. Many well-characterized cancer genes harbor somatic base substitutions or small insertion/deletions. For example, coding region frame-shifts and point mutations account for 75% of the somatic mutations in CDKN2A and TP53, two major tumor suppressor genes [6], [7], [8]. The oncogene B-raf, first described over 20 years ago, was also shown to be mutated in some human cancers [9], alongside PI3K and some tyrosine phosphatases [10]. Meanwhile, other cancer genes have been discovered through the phenomenon of inherited predisposition. Familial cancer is rare in comparison to non-hereditary cancer, but a number of recessive genes have been identified using linkage analysis [11], [12]. Large scale super-family sequencing projects, i.e. the kinome and phosphatome projects, followed and showed that, although missense mutations are found in some members of these two superfamilies, they are not a common ground for somatic cancer mutations. Greenman and co-workers [13] undertook comprehensive sequencing of 518 protein-kinase-encoding genes in 210 cancers. Kinases have been implicated in many aspects of tumorigenesis and several have now been validated as targets for drug therapy [14]. In their analysis of the collection of cellular kinases, the kinome, Greenman et al. [13] identified 1,000 mutations. Mutations were relatively common in cancers of the lung, stomach, ovary, colon and kidney, and rare in cancers of the testis and breast, and in carcinoid tumors, which are usually found in the gastrointestinal tract. Tumors with defects in DNA-mismatch repair harbored large numbers of mutations, whereas other types of tumor revealed no detectable mutations. To distinguish driver from passenger mutations, Greenman et al. used a statistical model comparing the observed-to-expected ratio of synonymous (no amino-acid change) mutations with that of non-synonymous (altered amino acid) mutations. An increased proportion of non-synonymous mutations implies selection pressure during tumorigenesis. Overall, they identified 158 predicted driver mutations in 120 kinase genes. In contrast to the recurrent mutations in BRAF in malignant melanomas [15] most kinase mutations identified across different tumor types were therefore single hits. More recently, Wood and co-workers [16] used a different strategy, but reached similar conclusions, with the complete sequencing of 20,857 transcripts from 18,191 genes in a limited number of tumors (11 breast and 11 colon). The high number of automatically detected DNA mutations provided immediately the following question: how to identify from a potentially high number of sequence mismatches those that are causative of cancer pathogenesis. A series of subsequent filters revealed that most of them were silent (did not result in amino acid change) and a similar amount were single nucleotide polymorphisms (SNPs). The final number of mutations which were defined as truly somatic affected more than 1000 genes. Interestingly, few common driver mutations were identified among the kinase genes in these studies. This is consistent, for example, with the finding that only 1 out of 18 members of the PI3K family had somatic mutations in cancer [17]. Interesting observations can be made from an accurate global study of the mutations reported in cancer. Futreal et al. [18] conducted such an extended census from bibliography indicating that as many as 299 genes contribute to human cancer. However 70% of these genes are associated with leukemias, lymphomas and mesenchymal tumors, which account for only 10% of cancer incidence. Furthermore about 75% of those genes are associated with translocations, and at least 90% of listed cancer genes are dominant at the cellular level (i.e. activated oncogenes, fusion oncoproteins). Nevertheless, it is generally recognized that the vast majority of germline mutations resulting in cancer predisposition are recessive [18]. Thus it seems likely that most of the cancer genes are recessive and remain still undiscovered. For these reasons we devised a novel method for the identification of candidate recessive cancer genes from genome-scale datasets. We applied our novel procedure to mine data from sequences and comparative genomic hybridizations. Our method takes account of the different gene inactivation modes, ranging from point mutations to whole gene deletions. The assumption underlying our investigation was that, by studying cancer genes from different mutational perspectives and combining the respective probabilities, sequencing noise and polymorphisms could be filtered out and bona fide recessive cancer genes would be identified.

Results

Harvesting candidate mutations from ESTs

In this paper, a novel method was applied to the identification of genes mutated in non-hereditary human cancers (Figure 1). The procedure gathered sequence information from the expression sequence tag (EST) database and an appropriate algorithm was tailored to extract information from “low quality” sequence data. The procedure analyzed more than 3×109 nucleotides of human coding sequence in over 5,600,000 ESTs derived from both healthy and cancerous tissues and cell lines. ESTs are potentially very valuable for mutation studies since they represent cloned single alleles, but are also unverified sequences, with a high rate of sequencing errors [19], [20]. Therefore, in order to exploit the full potential of ESTs we had to develop a method for the detection of bona fide “cancer” mutations in a context of frequent sequencing errors or, at best, polymorphisms. Although previous work [19] attempted to evaluate sequencing error rate in ESTs, we followed an alternate route. Our procedure was based on the assumption that the rate of sequencing errors was constant for each human gene, at each nucleotide position. As a corollary, we assumed that the “gene/position-specific sequencing error rate” was constant across normal and cancer EST libraries. Since base composition, context and sequence are by definition constant within each different human gene, we believed these assumptions were safe. Only exceptions would be due to the tumors harboring DNA repair defects.
Figure 1

The rationale for selection of candidate recessive cancer genes.

The diagram shows the steps in the procedure for the evaluation of mutation probabilities and the data flow towards the identification of candidate recessive cancer genes. Molecular data were extracted from public databases (dbEST and GEO at NCBI, and Stanford Microarray Database). A very large number of alignments (over 4.5 million) was obtained for over 24,000 human genes from BLAST analysis of 3 Gbases of EST sequences. The alignments were parsed to extract mismatches which were deposited in the Cancer Mutome local SQL database. The mismatches were then evaluated by specific procedures to associate mutational p-values to each human gene. In parallel, almost 20,000 human genes were assayed from 744 array CGH to define their propensity to deletion in cancer. The specific mutational p-values were combined to produce a recessive cancer p-value. A genome subset of 154 genes, among which TP53, PTEN, CDKN2A and CDKN2B were present, was selected (cancer p-value<1.5×10−7).

The rationale for selection of candidate recessive cancer genes.

The diagram shows the steps in the procedure for the evaluation of mutation probabilities and the data flow towards the identification of candidate recessive cancer genes. Molecular data were extracted from public databases (dbEST and GEO at NCBI, and Stanford Microarray Database). A very large number of alignments (over 4.5 million) was obtained for over 24,000 human genes from BLAST analysis of 3 Gbases of EST sequences. The alignments were parsed to extract mismatches which were deposited in the Cancer Mutome local SQL database. The mismatches were then evaluated by specific procedures to associate mutational p-values to each human gene. In parallel, almost 20,000 human genes were assayed from 744 array CGH to define their propensity to deletion in cancer. The specific mutational p-values were combined to produce a recessive cancer p-value. A genome subset of 154 genes, among which TP53, PTEN, CDKN2A and CDKN2B were present, was selected (cancer p-value<1.5×10−7). High sequencing noise was expected to be present in the heterogeneous EST database and cancer is a complex multi-faceted genetic disease, therefore a single statistical test would not result in reliable selection of cancer genes. Furthermore, we wanted to focus on recessive genes, inactivated by the occurring events. Thus, to assay the different mutational modes of recessive cancer gene, we accordingly devised a number of mutational tests. The statistical tests were eventually combined to identify the genes that are often inactivated in cancer. Starting from the RefSeq human mRNA repository, 27,184 sequences (defined Queries) were aligned to more than 5.6 million human EST sequences, from 7574 different EST libraries, for a total of almost 3.0 Gbases of coding sequence. BLASTs [21] were run for each query versus the ESTs and 3,839,543 successful alignments were produced (stored in the Alignments SQL table of the Cancer Mutome database) for 24,932 human queries (Stats database table). An average of 150 hits (high scoring pairs, HSP. or sequences) was produced for each query (human gene or splicing variant). The quality control of the BLAST alignments was of the foremost importance for our strategy. In order to minimize the mining of technical errors we defined a stringent threshold for alignment quality (expect≤1E-21) and the low quality ends of alignments were discarded. All (43,965,904) nucleotide mismatches, and gaps/insertions, were recorded in the database Mutations table. Amino acid (AA) substitutions and premature stops (33,614,754 mismatches) were then selected from the alignments (AA_Mutation table). To reduce the complexity, and the expected number of false positives, we decided to evaluate only those genes with a high number of mismatches (irrespective of the samples cancer status). A pre-processing based on inter-quartile range (IQR) was therefore applied and 8,972 genes (IQR higher than 0.5) were retained for further cancer mutation assays. These genes were sufficiently rich in putative mutations (mismatches) to fulfill the role of potential cancer gene candidates. The first component of our strategy was the identification of genes harboring inactivating point mutations. We evaluated the point mutations according frequency, location, capacity to alter the amino acid sequence, and consequences on the reading frame. Our procedure was thus tailored to consider statistically all the above features of a point mutation.

Data mining for amino-acid substitutions and premature terminations

We defined pAA as the probability that a gene displays an excess of amino acids substitutions in cancer when compared to non cancer samples. pNSSR, instead, indicates the probability that the significant amino acids substitutions in the cancer samples are under positive selection pressure. To detect short range clustering of cancer mutations, common in cancer recessive genes, and to balance out noise, i.e. sequencing errors, we chose a paired t test coupled to a sliding window. We normalized the counts of the mismatches in the two classes, cancer and control, by using a gene specific and position specific factor. Null mismatch counts were adjusted to unity, prior to normalization. The normalization values were obtained, for each gene and at each nucleotide position, as the local ratios of the sequenced nucleotides in the cancer and control samples. The paired t test (cancer vs. control, paired for codons) was applied to a sliding window with a length of 25 codons. To perform a robust assay a codon was evaluated only when aligned at least 10 times in each class (cancer and control). Gene specific confidence limits for T scores where generated by bootstrap analysis and a threshold p-value of 0.05 was used to select the significant amino acid positions. For each human gene, a p-value (pAA) was finally associated to the sum of the peaks corresponding to the significant T scores. A sequence mismatch was recorded only once for each EST library. An over-estimation of pAA could be due to passenger mutations, such as those produced by altered DNA repair systems, prevalent in some cancer. Since passenger mutations should be randomly distributed over the genome, an additional test was therefore implemented to refine the pAA. The ratio of non-synonymous (NS) to synonymous (S) DNA mutations is a measure of the selective pressure during tumor progression, as synonymous alterations are unlikely to exert a growth advantage and will be selectively lost [17]. Furthermore, mismatches due to sequencing errors, as well as differential representation (cancer to normal differential expression), are all expected to be neutral with respect to the NS to S ratio. The codons significant for amino acid substitutions (p<0.05) were therefore assayed for positive pressure. As a proof-of-concept, the NS/S ratios in the TP53 mutated region were analyzed by paired t test (p<0.033, FDR = 0.092) and revealed higher values in cancer than in control. Thus we applied the NS to S ratio test to each gene, in cascade after that for the local mutation frequency (pAA) described above. Bootstrap was again used to define the p-values. The probability of a cancer protein having frequent amino acid changes (pAA) coupled to selective positive pressure in cancer (pNSSR), two events which are not independent, was defined as the average of the two respective p-values (pAA-NSSR).

Data mining for frame-shifts in cancer ESTs

Having defined for each human gene a p-value for causal amino acid substitutions in sporadic cancers, we needed a corresponding index for gene inactivation due to open reading frame shifts in exons. Cancer genes can be disrupted by micro-insertions or -deletions in their coding sequence, resulting in an altered primary structure. A genome wide survey of our mismatch database indicated that single nucleotide alterations were by far the most common insertions/deletions in ESTs. We indicated with pFrameshift the probability that a gene had an excess of frame-shifts, due to single nucleotide deletions/insertions in cancer, compared to control tissues. We tested the hypothesis that these mutations were frequent in cancer genes, by studying again TP53. Our assay showed that single nucleotide frame-shifts associated to cancer were non-randomly enriched in TP53. When looking for frame-shifts induced by 1 nucleotide insertions/deletions, an analogous test to that for pAA was designed, as detailed in Experimental Procedures, to generate pFrameshift.

Identification of deleted genes in cancer by high resolution array comparative genomic hybridization

Cancer genes can be affected in their genomic structure by large amplifications and deletions. Recessive cancer genes are expected to be deleted or otherwise inactivated and this component must be included in our mutational model. We therefore assigned to each human gene p-values for deletion in cancer. To obtain such p-values, we compiled data from high resolution comparative genomic hybridizations of 744 tumors into the GeoSoft database. We used array CGH (aCGH), obtained from GEO (NCBI) and SMD (Stanford Microarray Database), with sufficiently high resolution to distinguish the human genes (information for samples and datasets in supplemental Table S1). Each tumor sample was compared to a healthy control sample on a two channel oligonucleotide-based platform. The human genes were evaluated in each sample by using the normalized log2 ratio (tumor over control). Different probes related to the same gene were averaged. Gene symbols were used as keys to unequivocally identify a gene within and across platforms. Data were normalized according to the providers. As a pre-processing step we reduced the assay complexity by retaining only those genes with high variability (standard deviation of log2 ratio>0.2). Then, for each gene we computed the percentiles of the log2 ratios (only for genes measured in at least 300 samples). A gene affected by deletions in tumors would possess a low (negative) log2 ratio 5thpercentile, while one with amplifications would display a high (positive) 95th percentile. Bootstrap analysis (random swap between the tumor and control channels) was used to simulate gene specific 5th and 95th percentiles. Then, gene specific p-values for deletions (pDeletion) were finally calculated as the percentage of simulated 5th percentiles exceeding the real 5th percentiles. At this stage, we had to take in consideration two phenomena, associated to aCGH but not linked to cancer: sex chromosomes and polymorphic structural copy number variations (CNVs). The control sample in aCGHs was frequently from male (more than 50% of aCGHs), while roughly half of the tumors were of female origin and thus lacked the Y-chromosome. Therefore the Y-chromosome genes were expected to appear as deleted, or better “pseudo-deleted”. Conversely, we expected the X chromosome genes, except for those belonging to the pseudo-autosomal region, to appear as “pseudo-amplified”. Genes located in the sex chromosomes indeed behaved correctly, as shown in detail for the pseudo-autosomal region 1 (PAR1) in Xp22 (supplemental Figure S1). Polymorphic CNVs, from normal population variability and not linked to cancer, should also lead to large fold-changes, resulting in high 95th or low 5th percentiles. However, we expected that polymorphic CNVs, not associated to cancer, would not display significant pDeletion values. In fact their 5th percentiles would not qualify as significant after the random swap simulation. CDKN2A and CDKN2B were identified as the most deleted genes in human cancers; PTEN, ATM, and TP53 were also identified as deleted (p-values<0.001). Three thousand and three hundred seventy four genes were significantly deleted (p<0.001).

Combination of mutation analyses: the candidate recessive cancer genes

Cancer genes are affected by different types of point mutations and of chromosomal alterations. We defined a candidate cancer gene as recessive when affected by mutations potentially leading to loss of function; i.e. when it was frequently mutated in its coding region and frequently altered in its genomic structure, in particular deleted. The combination of the different genome wide tests produced a p-value for recessive cancer genes. The recessive cancer gene (pRecessiveCancer) p-value was defined as the product of the three p-values (pAA-NSSR, pFrameshift, pDeletion). One hundred and fifty four human genes were included in the final candidate gene list after combinatorial mutation analysis was performed (pRecessiveCancer<1.5×10−7). The number of cancer recessive genes in a simulation by random association of the four mutation tests was of 60.5 (false detection rate of 0.39). The selection by the combinatorial approach appeared to be specific, since three classical recessive cancer genes, TP53 (16th position), PTEN (92nd) and CDKN2A (135th) were detected. When we compared the candidate gene-set to the whole genome, no major bias emerged towards gene size and structural polymorphisms, as expected from a well-behaved statistical procedure. The recessive cancer gene sizes did not differ significantly from that of the whole human genome (supplemental Figure S2). When we considered copy number variations, the cancer gene-set contained 15 polymorphic CNVs (15/154 or 10%) while 13.6% of all genes scored for pDeletion contained at least one CNV. This difference in proportion was not significant (p>>0.05), suggesting that there was no false enrichment for CNVs by our method, as expected by the design of the algorithm.

Gene ontology and functional analysis

The mechanisms and functional pathways associated with the cancer recessive genes were statistically evaluated. The enrichment in Gene Ontology (GO) terms was assessed by using EASE, at http://david.abcc.ncifcrf.gov. The biological processes significantly affected in the cancer gene set are listed in supplemental Table S2. The significant GO terms grouped by EASE functional clustering were: ATP/nucleotide binding, cell death/apoptosis, cell cycle, mitochondrion, RNA binding, methylation, tumor suppressor, DNA metabolism and DNA repair (EASE enrichment score >2, EASE P-value<1×10−4, Benjamini p-value<0.01). A highly overlapping functional spectrum was obtained for the Cancer Census genes [18]. The most notable exceptions to the overlapping ontologies in the two cancer gene-sets were related to “protein tyrosine kinases”, absent from the candidate recessive list. These proteins are one of the most represented classes of oncogenes, or dominant cancer genes. A functional classification similar to that of EASE was obtained with BinGO and Cytoscape (data not shown), where some of the most significant cellular processes identified were involved in cancer pathogenesis, such as cell cycle, cell death/apoptosis (corrected p-value<1×10−3). Finally, we generated a control set of human genes by random associating the p-values from the four mutation tests. When EASE and BinGO were applied to this control set no significant GO terms were identified.

Discussion

We devised and applied a multi-tier genome-wide data mining assay towards the identification of genes prone to “recessive-type” mutations in cancer. The p-values resulting from each tier were combined to produce a “recessive cancer gene” p-value (Table 1 and 2). Three of the most notable cancer recessive genes, i.e. TP53, PTEN and CDKN2A, ranked 16th, 92nd and 135th, respectively, among all tested human genes. The block diagram of our rationale and the data flow are shown in Figure 1. The tests can be subdivided into two groups: one for detection of point mutations (amino acid substitutions and frame-shifts) and one for structural alterations (large deletions). In principle we could have also used a test for partial gene deletions, but in ESTs intra-gene rearrangements can be confounded with alternative exon splicing.
Table 1

Mutation p-values for the candidate recessive cancer genes.

GENE SYMBOLpDeletionpAApNSSRpAA-NSSRpFrameshiftpRecessive Cancer
NASP 5.00E-050.00050.00050.00050.00051.25E-11
CCNB1 0.00010.00050.00050.00050.00052.50E-11
DDX21 0.00020.00050.00050.00050.0011.00E-10
DHX9 5.00E-050.00050.00050.00050.0041.00E-10
GANAB 5.00E-050.00050.00050.00050.0041.00E-10
ILF3 0.00050.00050.00050.00050.00051.25E-10
AIPL1 5.00E-050.0020.0110.00650.00051.63E-10
NOLC1 5.00E-050.0040.0030.00350.0011.75E-10
MYO1C 5.00E-050.0040.0140.0090.00052.25E-10
NUDC 0.00120.00050.00050.00050.00053.00E-10
PGAM1 0.0020.00050.00050.00050.00055.00E-10
IPO4 0.00030.0030.00050.00180.0015.25E-10
XRCC5 5.00E-050.00050.00050.00050.0215.25E-10
MTO1 5.00E-050.00050.04430.02240.00055.60E-10
ANP32B 5.00E-050.0060.04210.02410.00056.02E-10
TP53 5.00E-050.0220.0310.02650.00056.63E-10
AFG3L2 5.00E-050.0130.0020.00750.0027.50E-10
FAF1 5.00E-050.07370.0060.03980.00059.96E-10
CALR 0.0020.00050.00050.00050.0011.00E-09
SREBF2 0.0040.00050.00050.00050.00051.00E-09
XRCC6 5.00E-050.0070.0020.00450.0051.12E-09
ARMC8 5.00E-050.0020.00050.00130.021.25E-09
GTPBP4 5.00E-050.0050.0020.00350.0081.40E-09
HSPA4 0.00040.0160.0010.00850.00051.70E-09
HDAC1 5.00E-050.0010.00050.00080.04861.82E-09
PGD 5.00E-050.0750.00050.03780.0011.89E-09
VCP 0.0020.00050.00050.00050.0022.00E-09
ATXN2L 0.00250.00050.00050.00050.0022.50E-09
RPL6 0.0010.0010.0090.0050.00052.50E-09
SARS 5.00E-050.09520.0070.05110.0012.56E-09
NCL 0.00010.0050.0010.0030.013.00E-09
PTPRC 0.0120.00050.00050.00050.00053.00E-09
SMARCA4 0.0120.00050.00050.00050.00053.00E-09
CCT3 0.00040.0120.0040.0080.0013.20E-09
NET1 5.00E-050.010.0010.00550.0133.58E-09
HNRPD 5.00E-050.0110.00050.00570.0133.74E-09
SQSTM1 0.010.0010.00050.00080.00053.75E-09
TUBB2C 0.0020.00050.0070.00370.00053.75E-09
C1QBP 0.0020.0010.0070.0040.00054.00E-09
TRAP1 0.0020.00050.00050.00050.0044.00E-09
ALDOA 0.0180.00050.00050.00050.00054.50E-09
RNASEH2A 5.00E-050.11830.06510.09170.0014.59E-09
DDX24 0.0020.0020.00050.00130.0025.00E-09
ILVBL 5.00E-050.0190.0010.010.015.00E-09
SERPINB3 5.00E-050.12050.28370.20210.00055.05E-09
UQCRC1 5.00E-050.0160.00050.00830.0166.60E-09
EEF2 0.0280.00050.00050.00050.00057.00E-09
NUSAP1 5.00E-050.0010.0080.00450.0337.43E-09
DNAJC11 0.00020.16530.0080.08660.00058.66E-09
HSP90AA1 0.0360.00050.00050.00050.00059.00E-09
MYH9 5.00E-050.07090.0020.03650.0059.12E-09
HK1 5.00E-050.010.0010.00550.0349.35E-09
IARS 0.010.0030.0010.0020.00051.00E-08
YBX1 0.0040.00050.00050.00050.0051.00E-08
HDLBP 0.030.00050.0010.00080.00051.12E-08
EWSR1 0.020.00050.0020.00130.00051.25E-08
DHX15 5.00E-050.04560.00050.0230.0111.27E-08
SERPINB4 5.00E-050.35710.66670.51190.00051.28E-08
POLR2A 5.00E-050.0380.0050.02150.0121.29E-08
ALG14 5.00E-050.0980.10230.10020.0031.50E-08
PRMT1 0.0020.0030.0120.00750.0011.50E-08
COX4NB 5.00E-050.0010.0050.0030.10471.57E-08
SPTBN1 5.00E-050.0260.0040.0150.0211.58E-08
PTPRF 5.00E-050.04550.00050.0230.0141.61E-08
KHDRBS1 5.00E-050.1170.0130.0650.0051.62E-08
PABPC1 0.0020.00050.0030.00180.0051.75E-08
CTNNA1 0.0180.00050.00050.00050.0021.80E-08
DDB1 0.0180.00050.00050.00050.0021.80E-08
GNB2L1 0.0740.00050.00050.00050.00051.85E-08
WDR1 0.0020.0030.0010.0020.0052.00E-08
AARS 0.0240.0030.00050.00180.00052.10E-08
NDE1 0.00010.0120.0020.0070.032.10E-08
NQO1 0.0020.0020.00050.00130.0092.25E-08
RUVBL2 5.00E-050.0060.17780.09190.0052.30E-08
ZWINT 5.00E-050.04960.0030.02630.0182.37E-08
HP1BP3 0.00070.0020.0060.0040.0092.52E-08
WDR79 5.00E-050.05010.0020.0260.022.60E-08
SLC25A6 0.0020.00050.0050.00270.0052.75E-08
TYMS 5.00E-050.0370.0090.0230.0242.76E-08
SLC25A3 0.060.00050.00050.00050.0013.00E-08
ACLY 5.00E-050.07980.020.04990.0143.49E-08
ALDH3A1 0.140.00050.00050.00050.00053.50E-08
TTC8 5.00E-050.0150.16260.08880.0083.55E-08
YME1L1 5.00E-050.04030.0150.02760.0263.59E-08
ATP5A1 5.00E-050.0290.0080.01850.0393.61E-08
MRPS2 5.00E-050.0070.09150.04930.0153.69E-08
HNRPH3 5.00E-050.08160.00050.04110.0183.70E-08
IMMT 0.0040.0380.0040.0210.00054.20E-08
IMPDH2 0.0060.0140.00050.00730.0014.35E-08
NCKAP1 5.00E-050.04170.07450.05810.0154.36E-08
TTLL12 5.00E-050.0190.010.01450.06064.39E-08
PTEN 0.0020.07210.0170.04450.00054.45E-08
WBSCR16 0.1820.00050.00050.00050.00054.55E-08
XPNPEP1 5.00E-050.09260.00050.04650.024.65E-08
SREBF1 5.00E-050.06510.31750.19130.0054.78E-08
CCDC5 5.00E-050.09070.0050.04790.0215.02E-08
DDX19B 5.00E-050.0070.00050.00370.26855.03E-08
MAPK6 5.00E-050.06920.22860.14890.0075.21E-08
MAP4 5.00E-050.04420.00050.02230.04695.24E-08
PHB2 0.220.00050.00050.00050.00055.50E-08
SAE1 0.0160.00050.00050.00050.0075.60E-08
TALDO1 5.00E-050.10080.0630.08190.0145.73E-08
AHCY 0.230.00050.00050.00050.00055.75E-08
GTF3C1 0.00010.04960.0010.02530.0235.82E-08
PRPF19 0.0020.05490.0050.02990.0015.99E-08
LASP1 5.00E-050.05220.0070.02960.04096.06E-08
TRIP10 5.00E-050.14180.010.07590.0166.07E-08
HSPD1 0.2440.00050.00050.00050.00056.10E-08
EIF4G2 0.0160.00050.0150.00770.00056.20E-08
SFN 0.170.00050.0010.00080.00056.38E-08
TPM3 0.2740.00050.00050.00050.00056.85E-08
ZNF259 5.00E-050.0040.0110.00750.18967.11E-08
MAD2L2 5.00E-050.05290.07130.06210.0247.45E-08
GSK3B 5.00E-050.09690.21390.15540.017.77E-08
SH3BP5 0.0030.05310.0020.02760.0018.27E-08
CNDP2 5.00E-050.07980.0040.04190.04078.53E-08
PRKD2 5.00E-050.11080.12080.11580.0158.69E-08
CAPG 0.1420.00050.0020.00130.00058.87E-08
CAPNS1 0.0420.00050.0080.00430.00058.93E-08
YY1 5.00E-050.22860.09880.16370.0119.00E-08
ACSL5 5.00E-050.13750.05810.09780.0199.29E-08
CCT6A 0.3820.00050.00050.00050.00059.55E-08
RPUSD3 5.00E-050.14180.0150.07840.0259.80E-08
SBF1 0.0060.0080.00050.00430.0041.02E-07
YWHAE 5.00E-050.0310.0240.02750.07391.02E-07
XPO1 0.2740.00050.0010.00080.00051.03E-07
CRELD2 5.00E-050.0220.0290.02550.08181.04E-07
PDCD10 5.00E-050.030.0150.02250.09261.04E-07
HNRPF 5.00E-050.0230.0240.02350.09031.06E-07
RFT1 5.00E-050.030.0050.01750.12311.08E-07
BAX 5.00E-050.39220.23260.31240.0071.09E-07
EFTUD2 0.4460.00050.00050.00050.00051.11E-07
EEF1D 0.4480.00050.00050.00050.00051.12E-07
FDPS 0.0320.0010.0130.0070.00051.12E-07
CDKN2A 5.00E-050.0120.05790.03490.06481.13E-07
PFKP 5.00E-050.030.0010.01550.14761.14E-07
TACC3 5.00E-050.0360.0050.02050.1171.20E-07
FPGS 0.00010.0390.06590.05240.0231.21E-07
WDR74 5.00E-050.16670.0080.08730.0281.22E-07
CDKN2B 5.00E-050.0120.06670.03930.06321.24E-07
SFPQ 5.00E-0510.00050.50020.0051.25E-07
NARS 5.00E-050.41240.14650.27940.0091.26E-07
TCOF1 5.00E-050.020.04750.03380.07561.28E-07
CHAF1A 0.00010.25810.06670.16240.0081.30E-07
ALDH18A1 5.00E-050.22730.06290.14510.0181.31E-07
MGAT4B 0.5320.00050.00050.00050.00051.33E-07
CYP2C9 5.00E-050.784310.89220.0031.34E-07
MRPL37 5.00E-050.04880.0110.02990.08951.34E-07
TTBK2 5.00E-050.0370.08790.06250.04381.37E-07
AP3D1 0.00080.0260.0010.01350.0131.40E-07
PDCD6IP 5.00E-050.25970.0390.14940.0191.42E-07
CLTA 0.0020.0020.06930.03570.0021.43E-07
CCNI 5.00E-050.030.00050.01520.19141.46E-07
ZFYVE19 5.00E-050.25160.05930.15550.0191.48E-07

The top 154 recessive cancer genes have combined recessive cancer gene p-values lower than 1.5E-07. Alongside the Gene symbol, the p-values for each one of the 3 independent mutational events, i.e. amino acid substitution (pAA-NSSR), frameshift (pFrameshift), gene deletion (pDeletion) and the combined p-values are indicated. The pAA-NSSR p-value was first obtained as the average of pAA and pNSSR, two non independent p-values. The global recessive cancer gene p-value (pRecessiveCancer) was then calculated by multiplying the three independent p-values.

Table 2

The candidate recessive cancer genes with genomic location and associated copy number variations.

GENE SYMBOLCHROMOSOMAL LOCATIONpRecessive CancerGene LengthCopy Number Polymorphism
NASP chr1:45822303-458571541.25E-1134851
CCNB1 chr5:68498668-685098222.50E-1111154
DDX21 chr10:70385897-704142851.00E-1028388
DHX9 chr1:181075073-1811235051.00E-1048432
GANAB chr11:62148878-621706801.00E-1021802
ILF3 chr19:10625987-106640931.25E-1038106
AIPL1 chr17:6267783-62792431.63E-1011460
NOLC1 chr10:103901922-1039136171.75E-1011695
MYO1C chr17:1314229-13358012.25E-1021572
NUDC chr1:27120810-271454743.00E-1024664
PGAM1 chr10:99176016-991831875.00E-107171
IPO4 chr14:23719265-237279645.25E-108699
XRCC5 chr2:216682376-2167792485.25E-1096872
MTO1 chr6:74228208-742678965.60E-1039688
ANP32B chr9:99785309-998180436.02E-1032734
TP53 chr17:7512444-75316426.63E-1019198
AFG3L2 chr18:12319107-123671947.50E-1048087cnp1251
FAF1 chr1:50679522-511985249.96E-10519002
CALR chr19:12910422-129163031.00E-095881
SREBF2 chr22:40559051-406323191.00E-0973268
XRCC6 chr22:40347240-403899981.12E-0942758
ARMC8 chr3:139388837-1394989091.25E-09110072cnp270
GTPBP4 chr10:1024348-10537041.40E-0929356
HSPA4 chr5:132415560-1324686071.70E-0953047
HDAC1 chr1:32530294-325718111.82E-0941517
PGD chr1:10381671-104027871.89E-0921116cnp10
VCP chr9:35046560-350625642.00E-0916004
ATXN2L chr16:28741914-287560572.50E-0914143cnp1177
RPL6 chr12:111327376-1113318262.50E-094450
SARS chr1:109558062-1095823082.56E-0924246
NCL chr2:232027703-2320374493.00E-099746
PTPRC chr1:196874759-1969931683.00E-09118409
SMARCA4 chr19:10932605-110339523.00E-09101347
CCT3 chr1:154545375-1545748193.20E-0929444
NET1 chr10:5478545-54904243.58E-0911879
HNRPD chr4:83493490-835141733.74E-0920683
SQSTM1 chr5:179180502-1791976813.75E-0917179
TUBB2C chr9:139255531-1392579803.75E-092449
C1QBP chr17:5276822-52831954.00E-096373
TRAP1 chr16:3648038-37075994.00E-0959561
ALDOA chr16:29984544-299892354.50E-094691cnp1179
RNASEH2A chr19:12778427-127854624.59E-097035
DDX24 chr14:93587021-936173115.00E-0930290
ILVBL chr19:15086786-150975775.00E-0910791cnp1283
SERPINB3 chr18:59473411-594800945.05E-096683
UQCRC1 chr3:48611435-486221026.60E-0910667
EEF2 chr19:3927054-39364617.00E-099407
NUSAP1 chr15:39412360-394605377.43E-0948177
DNAJC11 chr1:6616817-66844608.66E-0967643
HSP90AA1 chr14:101616827-1016758399.00E-0959012
MYH9 chr22:35007271-351139279.12E-09106656
HK1 chr10:70748628-708316419.35E-0983013
IARS chr9:94012445-940958591.00E-0883414
YBX1 chr1:42920652-429406041.00E-0819952
HDLBP chr2:241815351-2419039271.12E-0888576
EWSR1 chr22:27994016-280265151.25E-0832499
DHX15 chr4:24138187-241952821.27E-0857095
SERPINB4 chr18:59455474-594624821.28E-087008
POLR2A chr17:7328421-73586531.29E-0830232
ALG14 chr1:95220884-953110711.50E-0890187
PRMT1 chr19:54872307-548835161.50E-0811209
COX4NB chr16:84369736-843906011.57E-0820865
SPTBN1 chr2:54536957-547520861.58E-08215129
PTPRF chr1:43769133-438619291.61E-0892796
KHDRBS1 chr1:32252077-322820581.62E-0829981
PABPC1 chr8:101784319-1018034911.75E-0819172
CTNNA1 chr5:138117005-1382986211.80E-08181616
DDB1 chr11:60823494-608572421.80E-0833748cnp921
GNB2L1 chr5:180596533-1806035121.85E-086979
WDR1 chr4:9685060-97276712.00E-0842611cnp312
AARS chr16:68843797-688809132.10E-0837116cnp1189
NDE1 chr16:15651604-157264902.10E-0874886
NQO1 chr16:68300805-683180342.25E-0817229
RUVBL2 chr19:54188967-542109942.30E-0822027
ZWINT chr10:57787204-577910402.37E-083836
HP1BP3 chr1:20941757-209857682.52E-0844011
WDR79 chr17:7532519-75475442.60E-0815025
SLC25A6 chrY:1465044-14709982.75E-085954
TYMS chr18:647650-6634922.76E-0815842
SLC25A3 chr12:97511533-975199083.00E-088375
ACLY chr17:37276706-373287983.49E-0852092
ALDH3A1 chr17:19581891-195922003.50E-0810309
TTC8 chr14:88360730-884140873.55E-0853357
YME1L1 chr10:27439390-274833273.59E-0843937
ATP5A1 chr18:41918107-419381973.61E-0820090
MRPS2 chr9:137532374-1375363373.69E-083963
HNRPH3 chr10:69761884-697729523.70E-0811068
IMMT chr2:86224565-862764044.20E-0851839
IMPDH2 chr3:49036771-490418794.35E-085108
NCKAP1 chr2:183497850-1836114744.36E-08113624cnp194
TTLL12 chr22:41892572-419130514.39E-0820479
PTEN chr10:89613174-897185114.45E-08105337
WBSCR16 chr7:74094219-741276354.55E-0833416cnp627
XPNPEP1 chr10:111614513-1116731924.65E-0858679
SREBF1 chr17:17656110-176810504.78E-0824940
CCDC5 chr18:41938322-419622965.02E-0823974
DDX19B chr16:68890572-689252305.03E-0834658
MAPK6 chr15:50098738-501457515.21E-0847013
MAP4 chr3:47867189-481057155.24E-08238526
PHB2 chr12:6944777-69501525.50E-085375
SAE1 chr19:52325983-524053715.60E-0879388
TALDO1 chr11:737431-7550235.73E-0817592cnp884
AHCY chr20:32331736-323547845.75E-0823048
GTF3C1 chr16:27379435-274687525.82E-0889317
PRPF19 chr11:60414782-604306325.99E-0815850
LASP1 chr17:34279893-343315406.06E-0851647
TRIP10 chr19:6690706-67025286.07E-0811822
HSPD1 chr2:198059554-1980732436.10E-0813689
EIF4G2 chr11:10775169-107871586.20E-0811989
SFN chr1:27062219-270635346.38E-081315
TPM3 chr1:152400913-1524312336.85E-0830320
ZNF259 chr11:116154486-1161639497.11E-089463
MAD2L2 chr1:11657124-116637747.45E-086650
GSK3B chr3:121028237-1212952037.77E-08266966
SH3BP5 chr3:15271360-153579058.27E-0886545
CNDP2 chr18:70314576-703393368.53E-0824760
PRKD2 chr19:51869412-519122248.69E-0842812
CAPG chr2:85475381-854911878.87E-0815806
CAPNS1 chr19:41322757-413330948.93E-0810337
YY1 chr14:99774854-998145579.00E-0839703
ACSL5 chr10:114125945-1141781279.29E-0852182
CCT6A chr7:56086871-560991769.55E-0812305
RPUSD3 chr3:9854533-98606769.80E-086143
SBF1 chr22:49232101-492603201.02E-0728219
YWHAE chr17:1194594-12502671.02E-0755673
XPO1 chr2:61558573-616189221.03E-0760349
CRELD2 chr22:48698347-487071781.04E-078831
PDCD10 chr3:168884389-1689353451.04E-0750956
HNRPF chr10:43201070-432233051.06E-0722235
RFT1 chr3:53099850-531395031.08E-0739653
BAX chr19:54149928-541568671.09E-076939
EFTUD2 chr17:40283804-403322891.11E-0748485
EEF1D chr8:144733040-1447507261.12E-0717686
FDPS chr1:153546200-1535570801.12E-0710880cnp61
CDKN2A chr9:21957751-219844901.13E-0726739
PFKP chr10:3099751-31689951.14E-0769244cnp816
TACC3 chr4:1693063-17166931.20E-0723630cnp308
FPGS chr9:129605328-1296163771.21E-0711049
WDR74 chr11:62356959-623642041.22E-077245
CDKN2B chr9:21992905-219993121.24E-076407
SFPQ chr1:35421789-354313221.25E-079533
NARS chr18:53418891-534401751.26E-0721284
TCOF1 chr5:149717427-1497600631.28E-0742636
CHAF1A chr19:4353659-43943931.30E-0740734
ALDH18A1 chr10:97355676-974065571.31E-0750881
MGAT4B chr5:179156710-1791665471.33E-079837
CYP2C9 chr10:96688429-967391371.34E-0750708
MRPL37 chr1:54438427-544566381.34E-0718211
TTBK2 chr15:40823837-410002991.37E-07176462
AP3D1 chr19:2051993-21025561.40E-0750563
PDCD6IP chr3:33814560-338861981.42E-0771638
CLTA chr9:36180891-362020551.43E-0721164
CCNI chr4:78188198-782161491.46E-0727951
ZFYVE19 chr15:38886565-388940591.48E-077494

The top 154 genes have combined recessive cancer gene p-values lower than 1.5×10−7 (FDR = 0.39). Alongside the gene symbol, genome coordinates, gene length, cancer gene p-value and eventual copy number polymorphic site are reported.

The top 154 recessive cancer genes have combined recessive cancer gene p-values lower than 1.5E-07. Alongside the Gene symbol, the p-values for each one of the 3 independent mutational events, i.e. amino acid substitution (pAA-NSSR), frameshift (pFrameshift), gene deletion (pDeletion) and the combined p-values are indicated. The pAA-NSSR p-value was first obtained as the average of pAA and pNSSR, two non independent p-values. The global recessive cancer gene p-value (pRecessiveCancer) was then calculated by multiplying the three independent p-values. The top 154 genes have combined recessive cancer gene p-values lower than 1.5×10−7 (FDR = 0.39). Alongside the gene symbol, genome coordinates, gene length, cancer gene p-value and eventual copy number polymorphic site are reported. The probability of a protein having amino acid mutations and frame-shifts in cancer, events which are independent, was defined as the product of the respective p-values. Just using these two tests, the prototypical TP53 and PTEN cancer genes ranked 205th and 233rd out of 27,184 evaluated human transcripts (p-value<1×10−4). Additionally, two other well-known recessive cancer genes, CDKN2A and CDKN2B, also had significant p-values, albeit lower rankings (p<0.0025 and FDR = 0.019, respectively). This behavior was expected for genes with small coding regions, which might be more commonly deleted than mutated [6]. Their presence in the significant point mutations cancer gene-set, even at this intermediate stage, reassured us of the selection capabilities of our algorithm. Nevertheless this early classification, based entirely on point mutations, was compiled only from two mutation tests; thus, relying on EST sequencing data, it was still not reliable according to our model which incorporated an additional mutation mode. It should be noted that we did not set to identify translocations, alterations expected to be dominant at the cellular level and therefore not suited to our quest for recessive genes. The last test, based on aCGH analysis, confirmed that a very large portion of the human genome is frequently deleted in cancer. As expected for our 2-channels aCGH procedure, we correctly detected sex chromosome genes as differentially represented in the genome screens. In particular, owing to the resolution of our structural assay, the genes from the pseudo-autosomal region 1 were identified as normal diploid (supplemental Figure S1). Most importantly, we would expect that polymorphic CNVs had not filtered through the aCGH assay. Indeed, only a small percentage of cancer genes coincided with polymorphic CNVs and this percentage is even smaller than expected by chance (Table 2). The number of deletions detected by aCGH in the cancer genome is very high (more than 10% of human genes were deleted in cancer). Notwithstanding this deletion excess, when all mutation modes are included, the number of candidate genes is less than 0.5% of the analyzed human genome. The cancer gene products are involved in biological processes such as cell cycle, DNA repair and apoptosis, in agreement with literature. The same functional terms are also associated to the genes in the COSMIC Cancer Census [18]. Strikingly, tyrosine kinases, dominant oncogenes, present in the Cancer Census, were absent from our cancer gene-set, in agreement with the selection for recessive genes. Some strong limitations are inherent to our approach. It is unlikely that the recorded frame-shifts are polymorphisms, since they alter the primary structure of the gene products. Conversely, they might be very often results of sequencing errors. For this reason, we chose to filter out as much as possible the sequencing errors by using a paired t test over a sliding window. Another controversy might be related to the somatic character of the detected mutations. Since there are virtually no germ-line sequences corresponding to the tumor libraries in the EST database, there can not be any formal demonstration that the selected genes correspond to somatic mutation targets. We can not establish how many of the detected mismatches are real mutations, nor how many of them are truly of somatic origin. We could only attach to each human gene a p-value for the excess of mismatches with gene inactivating potential in cancer samples. The presence of TP53, PTEN and CDKN2A in the candidate gene-set and its functional characteristics, are evidences in favor of the hypothesis that we measured an excess of somatic cancer mutations. We will be able to refute this hypothesis by using various experimental protocols. On the other hand, it is possible that some of the candidate genes might bear germ-line mutations and thus constitute predisposition traits for cancer insurgence. When we compared our results to those of the recently published massive sequencing project, some differences emerged. We used a larger amount of sequencing data, albeit of lower quality since we did not use second pass sequencing data. We obtained from dbEST a number of mismatches roughly 5 times higher than the genome wide sequencing screens. This excess could be due to the lower quality sequencing data in ESTs or the higher sensitivity of our approach compared to PCR based direct sequencing. Detection of under-represented mutations in often heterogeneous cancer biopsies can be a technical challenge for direct sequencing, but not for cloned ESTs. ESTs were used in previous attempts to identify cancer related genes. Almost invariably these approaches were based on expression profiling, which in tumor samples is probably correlates and late events, among the steps leading to tumor development and progression. In a very different data mining effort on EST sequences in cancer, Qiu and co-workers [20] measured SNP-tumor association. Their analysis was highly focused on single nucleotide mismatches, and restricted to known mutations described in the SNP database and present in at least 50 EST hits. They identified 4,865 SNP frequent in tumors (p<0.05), out of which 327 induced amino acid substitution (cSNP). Many major histocompatibility complex (MHC) class II molecules were present among these coding SNPs, while none was present in our recessive cancer gene-set. Most importantly, no landmark cancer genes, such as TP53, PTEN and CDKN2A were present within cSNPs. Finally, none of the SNP genes detected by Qiu et al. [20] were present in our candidate recessive cancer gene set. The minute cancer recessive sub-genome (<0.5%) we identified might represent a milestone towards the identification of novel markers for early diagnosis and prognosis. Additionally, our mining strategy can be applied to the data which will be available upon the sequencing of cancer genomes [22]. Finally, our work might lead to a different equilibrium within the pool of cancer genes, currently unbalanced towards dominant oncogenes.

Materials and Methods

EST data mining

All human coding sequences were extracted from RefSeq mRNA database at NCBI (27,184 sequences). The dbEST database contained more than 5.6 million human ESTs (exceeding 3,009 million nucleotides in length). The dbEST libraries (7574) were manually annotated corresponding to the biomaterial of origin and ESTs were subdivided in the following seven classes: cancer tissues and cell lines (Y, 4466 libraries), normal tissues (N, 2621), cell lines of uncertain origin (C, 193), hyperplasia (B, 32), normal tissues associated to cancer lesions (A, 33), matched normal controls from cancer patients (M, 70) and undetermined origin (U, 159). Only the library with clear cut origin was used: i.e. 4466 cancer tissues and cell lines (Y) vs. 2621 normal tissues (N). Tissues associated (A) or matched (M) to cancer, benign tumors (B) and other cell lines (C) were not used. The coding sequences for each RefSeq entry were aligned against the human dbEST database by using BLAST. The Cancer Mutome MySQL database was populated with a total of 43,965,904 mismatches and gaps extracted from 3,839,543 alignments. Perl was used to develop all the scripts and implement the system. BioPerl was used for the BLAST procedure and parsing. BLAST parameters were set to default (expect = 1E-21) with the exception of recovering up to a maximum of 500 alignments for each query.

Detection of point mutations in ESTs

To attenuate the problem of high sequencing error rate in ESTs, our procedure retrieved candidate mutations only in the region of maximum nucleotide identity to the query. Our assumption was that an identical error rate was present in the two EST populations, those derived from the control and those from the cancer cells. Therefore the frequencies of mismatches due to sequencing errors are expected to be comparable across ESTs for the same genes. The mismatches were considered for subsequent analysis only when present in the internal sequence (not in the first or last ten nucleotides of the BLAST alignments). Mismatches were then evaluated for their capabilities of changing the amino acid residue in the correspondent codon. A single candidate mutation was considered only once for each dbEST library, to avoid bias due to RNA copy number. The 8972 human genes most variable for number of mismatches (IQR>0.5) were retained for further testing. Statistics for amino acid substitutions, non-synonymous to synonymous nucleotide exchange rate and frame-shifts were calculated for each human coding sequence. Gene specific confidence limits for the respective paired t tests were calculated by bootstrap analysis. The two bootstrap classes were composed by random extracting 1000 times, with replacement, cancer or normal status from the library classes [23], [24]. In the first of three different measures, the frequencies of amino acid substitution were compared, for each gene in normal and cancerous tissues, by using paired t test over a 25-residues protein window. Normalization of mismatches for the control and cancer classes was attained by using a gene specific and local correction factor. The correction factor was derived by dividing the respective counts of ESTs in both classes at each nucleotide position of the query. The score assigned to each human RefSeq gene corresponded to the sum of the T scores values exceeding the gene-specific confidence limit (p<0.05) over the sliding window (i.e. the area of the peaks above the threshold). The second measure, linked to the amino acid substitution frequency consisted in the evaluation of the selective pressure for amino acids changes. This filter was implemented to separate causal from bystander mutations and to further diminish the effects of sequencing errors. The ratios of non-synonymous (NS) to synonymous (S) nucleotide substitutions within the cancer and normal ESTs were calculated for each gene. A paired t test was used to compare the cancer and normal NS/S substitution ratios at different codons. When the number of synonymous substitutions at denominator was null, unity was added to both numerator and denominator. Only the amino acid positions significant for frequency of substitutions (in the amino acid substitution test above) were evaluated here. The gene specific confidence limit at 5%, the p-values and the FDR were again computed by bootstrap, as described above. A third measure on point mutations was relative to the frequency of frame-shifts, which can produce premature protein termination or other major alterations in primary structure. In a paired t test, analogous to that for the pAA, a 25-nt sliding window based procedure was applied to the number of frame-shifts induced in cancer by 1 nucleotide insertions or deletions. Longer DNA alterations were not recorded, and were extremely rare. The gene p-values for such frame-shifts in cancer were again computed by bootstrap and defined as pFrameshift.

ESTs P-value and false detection rate calculation

Procedures were devised for calculation of gene-specific p-values and false detection rates in each one of the described approaches. Bootstrap analysis was used to compute the adjusted probability that a human gene was affected in cancer but not in normal ESTs [23], [24]. The resampling test allowed us to define confidence limits for each different gene and to effectively tackle local issues such as DNA composition, CpG occurrence, and protein or gene length. For the point mutation analyses, the resampling procedure was performed only on the protein residues found to be above T threshold (p<0.05). A range of bootstraps were performed to choose the lowest number of resampling cycles yielding stable p-values through a short gene list and 1000 cycles were found to be a satisfactory requirement. The ESTs belonging to cancer and normal classes were randomly subdivided to form two simulated classes with the same size as the original ones. The gene specific p-value was defined as the frequency at which the resampling test scored equal or better than the real test. Null p-values were set to half of the lowest p-value in the respective simulations.

Detection of deletions in array CGH

744 comparative genomic hybridization arrays were studied (537 samples from GEO and 207 from SMD). All platforms were 2-channel based, data were downloaded as normalized values, and probes were indexed by gene symbol. Gene data and annotations were stored in the GeoSoft database. All normalized log ratios were converted to log2 ratios, with the cancer value at the numerator and the control value at the denominator. Pre-filtering of genes was performed on standard deviation, to exclude the genes which did not show high variation of their genomic profiles (std dev<0.2). Genes were scored when measured in at least 300 tumors. Deleted cancer genes were expected to have log2 ratios lower than the 5th percentile of the bootstrapped log2 ratios; amplified genes log2 ratios higher than the 95th percentile of the bootstrapped values. Bootstrap analysis was used (10,000 random swaps of tumor and control channels) to obtain gene specific p-values and confidence limits for deletion and amplification.

Point Mutation and aCGH combined p-values

Finally, the p-values obtained by the three different tests: pAA-NSSR, pFrameshift and pDeletion were multiplied together to compute the global pRecessiveCancer p-value. This p-value was used to sort the human genes by their propensity to bear mutations in cancer. One hundred and 54 genes were selected with p-value below 1.5×10−7. One hundred resampling cycles were performed by randomly associating p-values for each mutation test and yielded a false detection rate of 39%. EASE (http://david.abcc.ncifcrf.gov) and BinGO (Cytoscape plugin) were used for Gene Ontology analysis. Hyper-geometric test with Benjamini and Hochberg false discovery rate correction was used in BinGO [25]. Genomic structures are correctly identified by the aCGH protocol. Track analysis in UCSC Genome Browser of Xp22 Pseudo-Autosomal Region 1 (PAR1). The Pseudo-Autosomal Region 1 is correctly identified as normal (diploid) by the array CGH analysis, while the rest of X chromosome is reported, also as expected, “pseudo-amplified”. The chromosome X genes 3 prime of PAR1 appear as amplified because their DNA copy number is higher than expected when compared to the respective average DNA copy number in the whole, mixed sex, tumour population. (0.31 MB TIF) Click here for additional data file. Distribution of gene size in the candidate recessive cancer gene-set. The recessive cancer gene sizes do not differ significantly from the gene sizes in the human genome (most common genes range between 32 and 128 kb). (0.28 MB TIF) Click here for additional data file. Array CGH datasets. (0.04 MB DOC) Click here for additional data file. Functional (Gene ontology, biological process) chart of the candidate cancer recessive genes, FDR<0.5 (0.22 MB DOC) Click here for additional data file.
  23 in total

1.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences.

Authors:  K Irizarry; V Kustanovich; C Li; N Brown; S Nelson; W Wong; C J Lee
Journal:  Nat Genet       Date:  2000-10       Impact factor: 38.330

2.  Basic local alignment search tool.

Authors:  S F Altschul; W Gish; W Miller; E W Myers; D J Lipman
Journal:  J Mol Biol       Date:  1990-10-05       Impact factor: 5.469

Review 3.  Oncogenes and cancer.

Authors:  Carlo M Croce
Journal:  N Engl J Med       Date:  2008-01-31       Impact factor: 91.245

Review 4.  Genome resequencing and genetic variation.

Authors:  Michael Stratton
Journal:  Nat Biotechnol       Date:  2008-01       Impact factor: 54.908

5.  The IARC TP53 database: new online mutation analysis and recommendations to users.

Authors:  Magali Olivier; Ros Eeles; Monica Hollstein; Mohammed A Khan; Curtis C Harris; Pierre Hainaut
Journal:  Hum Mutat       Date:  2002-06       Impact factor: 4.878

6.  Pathology of familial breast cancer: differences between breast cancers in carriers of BRCA1 or BRCA2 mutations and sporadic cases. Breast Cancer Linkage Consortium.

Authors: 
Journal:  Lancet       Date:  1997-05-24       Impact factor: 79.321

7.  Hematologic and cytogenetic responses to imatinib mesylate in chronic myelogenous leukemia.

Authors:  Hagop Kantarjian; Charles Sawyers; Andreas Hochhaus; Francois Guilhot; Charles Schiffer; Carlo Gambacorti-Passerini; Dietger Niederwieser; Debra Resta; Renaud Capdeville; Ulrike Zoellner; Moshe Talpaz; Brian Druker; John Goldman; Stephen G O'Brien; Nigel Russell; Thomas Fischer; Oliver Ottmann; Pascale Cony-Makhoul; Thierry Facon; Richard Stone; Carole Miller; Martin Tallman; Randy Brown; Michael Schuster; Thomas Loughran; Alois Gratwohl; Franco Mandelli; Giuseppe Saglio; Mario Lazzarino; Domenico Russo; Michele Baccarani; Enrica Morra
Journal:  N Engl J Med       Date:  2002-02-28       Impact factor: 91.245

8.  The genomic landscapes of human breast and colorectal cancers.

Authors:  Laura D Wood; D Williams Parsons; Siân Jones; Jimmy Lin; Tobias Sjöblom; Rebecca J Leary; Dong Shen; Simina M Boca; Thomas Barber; Janine Ptak; Natalie Silliman; Steve Szabo; Zoltan Dezso; Vadim Ustyanksky; Tatiana Nikolskaya; Yuri Nikolsky; Rachel Karchin; Paul A Wilson; Joshua S Kaminker; Zemin Zhang; Randal Croshaw; Joseph Willis; Dawn Dawson; Michail Shipitsin; James K V Willson; Saraswati Sukumar; Kornelia Polyak; Ben Ho Park; Charit L Pethiyagoda; P V Krishna Pant; Dennis G Ballinger; Andrew B Sparks; James Hartigan; Douglas R Smith; Erick Suh; Nickolas Papadopoulos; Phillip Buckhaults; Sanford D Markowitz; Giovanni Parmigiani; Kenneth W Kinzler; Victor E Velculescu; Bert Vogelstein
Journal:  Science       Date:  2007-10-11       Impact factor: 47.728

9.  Patterns of somatic mutation in human cancer genomes.

Authors:  Christopher Greenman; Philip Stephens; Raffaella Smith; Gillian L Dalgliesh; Christopher Hunter; Graham Bignell; Helen Davies; Jon Teague; Adam Butler; Claire Stevens; Sarah Edkins; Sarah O'Meara; Imre Vastrik; Esther E Schmidt; Tim Avis; Syd Barthorpe; Gurpreet Bhamra; Gemma Buck; Bhudipa Choudhury; Jody Clements; Jennifer Cole; Ed Dicks; Simon Forbes; Kris Gray; Kelly Halliday; Rachel Harrison; Katy Hills; Jon Hinton; Andy Jenkinson; David Jones; Andy Menzies; Tatiana Mironenko; Janet Perry; Keiran Raine; Dave Richardson; Rebecca Shepherd; Alexandra Small; Calli Tofts; Jennifer Varian; Tony Webb; Sofie West; Sara Widaa; Andy Yates; Daniel P Cahill; David N Louis; Peter Goldstraw; Andrew G Nicholson; Francis Brasseur; Leendert Looijenga; Barbara L Weber; Yoke-Eng Chiew; Anna DeFazio; Mel F Greaves; Anthony R Green; Peter Campbell; Ewan Birney; Douglas F Easton; Georgia Chenevix-Trench; Min-Han Tan; Sok Kean Khoo; Bin Tean Teh; Siu Tsan Yuen; Suet Yi Leung; Richard Wooster; P Andrew Futreal; Michael R Stratton
Journal:  Nature       Date:  2007-03-08       Impact factor: 49.962

10.  Cancer proliferation gene discovery through functional genomics.

Authors:  Michael R Schlabach; Ji Luo; Nicole L Solimini; Guang Hu; Qikai Xu; Mamie Z Li; Zhenming Zhao; Agata Smogorzewska; Mathew E Sowa; Xiaolu L Ang; Thomas F Westbrook; Anthony C Liang; Kenneth Chang; Jennifer A Hackett; J Wade Harper; Gregory J Hannon; Stephen J Elledge
Journal:  Science       Date:  2008-02-01       Impact factor: 47.728

View more
  8 in total

Review 1.  A continuum model for tumour suppression.

Authors:  Alice H Berger; Alfred G Knudson; Pier Paolo Pandolfi
Journal:  Nature       Date:  2011-08-10       Impact factor: 49.962

2.  Cracking the ANP32 whips: important functions, unequal requirement, and hints at disease implications.

Authors:  Patrick T Reilly; Yun Yu; Ali Hamiche; Lishun Wang
Journal:  Bioessays       Date:  2014-08-25       Impact factor: 4.345

3.  Genome-wide association study identifies three new susceptibility loci for esophageal squamous-cell carcinoma in Chinese populations.

Authors:  Chen Wu; Zhibin Hu; Zhonghu He; Weihua Jia; Feng Wang; Yifeng Zhou; Zhihua Liu; Qimin Zhan; Yu Liu; Dianke Yu; Kan Zhai; Jiang Chang; Yan Qiao; Guangfu Jin; Zhe Liu; Yuanyuan Shen; Chuanhai Guo; Jianhua Fu; Xiaoping Miao; Wen Tan; Hongbing Shen; Yang Ke; Yixin Zeng; Tangchun Wu; Dongxin Lin
Journal:  Nat Genet       Date:  2011-06-05       Impact factor: 38.330

4.  A comparative study of cancer proteins in the human protein-protein interaction network.

Authors:  Jingchun Sun; Zhongming Zhao
Journal:  BMC Genomics       Date:  2010-12-01       Impact factor: 3.969

5.  Alu distribution and mutation types of cancer genes.

Authors:  Wensheng Zhang; Andrea Edwards; Wei Fan; Prescott Deininger; Kun Zhang
Journal:  BMC Genomics       Date:  2011-03-23       Impact factor: 3.969

6.  Functional complementation between transcriptional methylation regulation and post-transcriptional microRNA regulation in the human genome.

Authors:  Zhixi Su; Junfeng Xia; Zhongming Zhao
Journal:  BMC Genomics       Date:  2011-12-23       Impact factor: 3.969

7.  BALB/c-congenic ANP32B-deficient mice reveal a modifying locus that determines viability.

Authors:  Vonny I Leo; Ralph M Bunte; Patrick T Reilly
Journal:  Exp Anim       Date:  2015-11-10

8.  Prognostic value of the DNA integrity index in patients with malignant lung tumors.

Authors:  Dimple Y Chudasama; Zeynep Aladag; Mayla I Felicien; Marcia Hall; Julie Beeson; Nizar Asadi; Yori Gidron; Emmanouil Karteris; Vladimir B Anikin
Journal:  Oncotarget       Date:  2018-04-20
  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.