Literature DB >> 21169372

Identifying cancer driver genes in tumor genome sequencing studies.

Abstract

MOTIVATION: Major tumor sequencing projects have been conducted in the past few years to identify genes that contain 'driver' somatic mutations in tumor samples. These genes have been defined as those for which the non-silent mutation rate is significantly greater than a background mutation rate estimated from silent mutations. Several methods have been used for estimating the background mutation rate.
RESULTS: We propose a new method for identifying cancer driver genes, which we believe provides improved accuracy. The new method accounts for the functional impact of mutations on proteins, variation in background mutation rate among tumors and the redundancy of the genetic code. We reanalyzed sequence data for 623 candidate genes in 188 non-small cell lung tumors using the new method. We found several important genes like PTEN, which were not deemed significant by the previous method. At the same time, we determined that some genes previously reported as drivers were not significant by the new analysis because mutations in these genes occurred mainly in tumors with large background mutation rates. AVAILABILITY: The software is available at: http://linus.nci.nih.gov/Data/YounA/software.zip.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2010 PMID： 21169372 PMCID： PMC3018819 DOI： 10.1093/bioinformatics/btq630

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Major tumor sequencing projects (Ding ; Greenman ; Sjoblom ; Wood ) have been conducted and initiated in the past few years to identify genes that are frequently mutated and thereby are expected to have primary roles in the development of tumor. One of the challenges in interpreting this data is distinguishing driver mutations, which have a role in oncogenesis or in the cancer phenotype from passenger mutations that accumulate through DNA replication but are irrelevant to tumor development. To find these driver genes, each gene is tested for whether its mutation rate is significantly higher than the background (or passenger) mutation rate. The background mutation rate is estimated based on silent mutations which do not change amino acid encoding and which are therefore considered to be passenger mutations. All current methods for estimating the background mutation rate are based on a common approach in which background non-silent mutation rate ρ is estimated as a product ρR, where the background silent mutation rate ρ is obtained by dividing the observed number of silent mutations by the number of base pairs sequenced and R is the average ratio of the number of potential non-silent mutations to the number of potential silent mutations. Having estimated the background non-silent mutation rate ρ, each gene can be tested whether the number of mutations is significantly greater than that expected under the background mutation rate using a binomial test. The methods used for calculating R vary. Ding ) calculated R in the following way. They mutate each nucleotide of each codon in silico to determine whether it results in a non-silent or silent mutation. They then calculate the average of each hypothetical non-silent or silent mutation by weighting it according to its relative frequency. Some investigators (Sjoblom ) further divide mutations into several types according to the nucleotide and the neighboring nucleotides of the mutations. They estimate a separate background mutation rate for each mutation type by multiplying relative frequencies of each mutation type by the background rate ρ. They then test each gene by using a likelihood ratio test to assess whether the number of mutations occurring in the gene is unlikely under the background mutation rates. There are three shortcomings in the approaches previously developed for identifying driver genes. First, previous approaches ignore the fact that different types of mutations can have different impact on proteins. Non-silent mutations include missense mutations which change an amino acid to another amino acid, nonsense mutations which change an amino acid to a stop codon, mutations in splice sites and insertions or deletions (indels). The indels can also be divided into two types namely, inframe indels and frameshift indels according to whether it changes the reading frame. Since frameshift indels and nonsense mutations change all the amino acids that come after the amino acid where the mutation occurred, they have the greatest impact on the protein function. Mutations in splice sites also have strong impact since they disrupt splicing. Also different types of missense mutations may have different impact based on how similar are the chemical properties of the original and new amino acids. Several studies also have shown that the selection pressures vary by mutation type and sequence location in cancer mutation datasets. Greenman ) developed tests to examine the significance of selection toward missense, nonsense and splice site mutations in somatic cancer mutation datasets. They found that the selection pressures for nonsense and splice mutations are much higher than those for missense mutations. Also Radivojac ) found that somatic cancer mutation datasets have a significant enrichment for mutations disrupting phosphorylation sites. If two genes A and B have a similar number of mutations, but all mutations in gene A are expected to affect its protein function significantly while those in gene B are not, then the gene A is more likely to be a driver gene than gene B. However, the current methods are not able to differentiate genes A and B since they ignore the information of mutation types. A second limitation of previous approaches is that they ignore the fact that different samples have different background mutation rates. Tumors differ substantially with regard to the number of somatic mutations accumulated. Samples are exposed to different levels of mutagens (for example, smoking) and some samples have mutations in genes that repair mutations. Therefore, some samples have much higher background mutation rate than others. This can be seen in Figure 1 for the data from Ding ) who sequenced 623 genes in 188 tumor samples to identify 1013 non-silent mutations and 108 silent mutations. Figure 1 shows the distribution of the number of mutations that occurred in each of the 188 samples. It shows that the number of mutations per sample ranges from 0 to 54.

Fig. 1.

Histogram of the number of mutations per sample. The data are from Ding ) who sequenced 623 genes in 188 tumor samples.

Histogram of the number of mutations per sample. The data are from Ding ) who sequenced 623 genes in 188 tumor samples. If a gene has mutations only in the samples with high background mutation rate, then those mutations are more likely byproducts of the high background mutation rate rather than the cause of a cancer. In contrast, if a gene has mutations only in the samples with low background mutation rate, then the gene is more likely to be a driver gene even if the number of mutations is small. If we assume the same background mutation rate across samples, the analysis will be biased toward falsely identifying as drivers those genes that have mutations in highly mutated samples and falsely missing those genes with a small number of mutations in samples with low mutation rates. Third, previous approaches ignore the fact that a different number of non-silent mutations can occur at each base pair according to the genetic code. For example, consider a codon TGG which encodes the amino acid Tryptophan. Since this is the only codon encoding Tryptophan, any mutation at any nucleotide of the codon would change the amino acid. Therefore any mutation results in a non-silent mutation. In contrast, six codons encode the same amino acid arginine: AGA, AGG, CGA, CGG, CGC and CGT. Therefore within the codon CGA, no non-silent mutations can occur at the third position of the codon and only two non-silent mutations can occur at the first position. If a protein A consists mostly of tryptophan, and a protein B consists mostly of arginine, the gene A encoding the protein A is susceptible to more non-silent mutations than the gene B encoding the protein B. Thus, methods ignoring this difference will tend to misclassify as drivers genes rich in codons with limited redundancy and misclassify as non-drivers genes rich in codons with substantial redundancy. In this article, we propose and evaluate a new method for identifying driver genes. In Section 2, we will define P-values for testing whether a gene is a driver gene. In Section 3, we will evaluate the new method using lung tumor genome sequences.

2 METHODS

2.1 Definition of P-values for identifying driver genes

For each gene, we test if the number of samples with ‘driver-like’ non-silent mutations is higher than that expected by the background mutation model M0. Let then Y is a Bernoulli random variable. Define Since we assume a different mutation rate for each sample, the probability s varies across samples j = 1,…, J. It is calculated from the background mutation model which will be described in the Section 2.2. We assign a score to every possible non-silent mutation according to its expected impact on the protein function: higher score for mutations with stronger impact. As will be shown, the order between scores rather than the actual scores determines the test statistics. Therefore, one can assign any score to each non-silent mutation to reflect the order of its impact on the protein function. We assign scores so that they comply with the following order: missense < inframe indel < mutation in splice sites < frameshift indel = nonsense. We also assign different scores to different types of missense mutations based on BLOSUM80 matrix, which is a matrix of scores for each of the 190 possible substitutions of the 20 standard amino acids. Let T be the maximum score of the non-silent mutations that occurred in sample j for gene i. If no mutation occurred, let T = 0. Define F(x) = P(T < x|Y = 1, M0). We can obtain the distribution F from the background mutation model described in the Section 2.2. Then, We use as our test statistic and define P-values by P(Z < z|M0), where z is the observed value of Z. Z can be interpreted as a sum of mutated sample indicators weighted by . Since s is larger than 0.5, is less than one and thus, the weights are negative. The larger the s and F(T), the smaller the weight. Therefore, mutations with higher scores (stronger impact on the protein function) occurring in samples with low mutation rates (samples with large s) contribute more in decreasing Z and thus, P-values. We can generate the distribution of Z under the background mutation model by simulating Y from Bernoulli(s) and T from F for j = 1,…, J and then calculating the sum of . Then we can approximate the P-value P(Z < z|M0) by computing the tail area of this background distribution beyond the observed value z for each gene i.

2.2 Background mutation model

The most distinguishing features of our background mutation model are that it does not assume separate mutation rates for non-silent and silent mutations and that it assumes separate mutation rates for different samples. We assume that each passenger mutation is generated from one background mutation rate process and that whether the mutation is non-silent or silent depends on the genetic code. There are six types of mutations: The transitions A:T → G:C and G:C → A:T change a purine to another purine or a pyrimidine to another pyrimidine. The transversions change a purine to a pyrimidine or vice versa. Because transitions occur more frequently than transversions, we assume separate mutation rates for transitions and transversions. Also it is generally observed that a mutation occurs more often at C:G than A:T and that a C:G appearing in CpG dinucleotides has a higher mutation rate than a C:G appearing in non-CpG dinucleotides. Therefore, we assume a separate background mutation rate for each combination of base pair types and CpG dinucleotides context. We also assume that different tumor samples have different mutation rates. To keep the number of parameters manageable, we assume that relative frequencies of different types of mutations are same for each sample. Thus, the mutation rate in sample j for mutation type m is defined as the product of p, the ratio of mutation rate of the type m relative to the type 1 (A:T → G:C) and q, the mutation rate of the sample j for the mutation type 1 (Table 1).

Table 1.

Background mutation rates

Mutation type	Mutation type ID	Mutation rate
A:T → G:C	1	q_jp₁
A:T → C:G	2	q_jp₂
A:T → T:A	2	q_jp₂
C:G → T:A at non CpG	3	q_jp₃
C:G → A:T at non CpG	4	q_jp₄
C:G → G:C at non CpG	4	q_jp₄
C:G → T:A at CpG	5	q_jp₅
C:G → A:T at CpG	6	q_jp₆
C:G → G:C at CpG	6	q_jp₆
Inframe indels	7	q_jp₇
Frameshift indels	8	q_jp₈

*j is sample index.

Background mutation rates *j is sample index. To estimate the parameters in the background mutation model, we could fit the model in Table 1 to the sequences for which silent mutations were identified. (Most previous projects have evaluated silent mutations for only a subset of the genes.) To estimate the background mutation rate for insertions and deletions (indels), which are non-silent, however, we included in our estimation genes which have at most one non-silent mutation across all tumor samples; these genes are not likely to be related to tumorigenesis and thus the non-silent mutations in these genes are likely to be passenger mutations. However, since we selected these genes based on the total number of mutations occurring in each gene, the estimated background rates for these genes may be biased. Since the selection was based on the total number of mutations, it is unlikely that the relative frequencies of different types of mutations are subject to the bias, but the sample-specific mutation rates may be. Let q′ be the mutation rate of the sample j for mutation type 1 in the selected genes. Then we assume q′ = r · q, where r is the selection bias and q is the unbiased sample-specific mutation rate. To estimate the parameters r, q and p1,…, p8, we first define the position of base pairs across all the sequenced genes. Since we assume that background mutation rates are independent of genes, we do not need to differentiate genes. Therefore, we concatenate all the sequenced genes and determine the position of each base pair from 1 to N, the total number of base pairs that are sequenced. Let K denote the subset of positions of the base pairs belonging to the genes used for silent mutation detection and let L denote the subset of positions of the base pairs belonging to the genes which have at most one non-silent mutation across all samples. For position k in genes for which silent mutations have been evaluated, the probability that a silent transition of type i (second column in Table 1) occurs in sample j equals qcp where c is 1 if a transition at position k results in a silent mutation, otherwise c is 0. The probability that a silent transversion of type i occurs at that position equals qdp, where d is the number of silent transversions possible at position k (0, 1 or 2). The full set of probabilities definitions are shown in Table 2 based on the indicators X, which indicate the type of mutation occurring at position k in sample j. Since any mutation is either a silent mutation or a non-silent mutation, c + e = 1 (number of possible transition mutations) and d + f = 2 (number of possible transversion mutations). When a mutation occurs within splice sites, it is considered to be non-silent, therefore c = d = 0, e = 1, f = 2 if k belongs to splice sites.

Table 2.

Definition of probabilities of X

a = (cp + dp)I(k ∈ K)+r(ep + fp + p7 + p8)I(k ∈ L);

I(x), indicator function, 1 if x is true and 0 otherwise;

c, number of silent transitions possible at position k (0 or 1); d, number of silent transversions possible at position k (0, 1 or 2); e, number of non-silent transitions possible at position k (0 or 1); f, number of non-silent transversions possible at position k (0, 1 or 2); t, mutation type ID for the transition at position k (1, 3 or 5); v, mutation type ID for the transversion at position k (2, 4 or 6); non, no mutation; sts, silent transition; stv, silent transversion; nts, non-silent transition; ntv, non-silent transversion; iid, inframe indel; fid, frameshift indel.

Definition of probabilities of X a = (cp + dp)I(k ∈ K)+r(ep + fp + p7 + p8)I(k ∈ L); I(x), indicator function, 1 if x is true and 0 otherwise; c, number of silent transitions possible at position k (0 or 1); d, number of silent transversions possible at position k (0, 1 or 2); e, number of non-silent transitions possible at position k (0 or 1); f, number of non-silent transversions possible at position k (0, 1 or 2); t, mutation type ID for the transition at position k (1, 3 or 5); v, mutation type ID for the transversion at position k (2, 4 or 6); non, no mutation; sts, silent transition; stv, silent transversion; nts, non-silent transition; ntv, non-silent transversion; iid, inframe indel; fid, frameshift indel. All of the constants shown in Table 2 can be determined from the gene sequence and genetic code. However, the values are ambiguous in cases where genes have several alternative transcripts, and where some base pairs belong to different codons in alternative transcripts. We describe how to determine the values of the constants in such cases in the Supplementary Material.

2.3 Estimation of parameters

We use the method of moments to estimate r and p. The process of obtaining the method of moments estimates and is described in the Supplementary Material. The estimation of q is more complex because the number of base pairs sequenced per sample is not sufficient to estimate the extremely small mutation rate q accurately. For example, no mutations were found for many samples in the data from Ding ). Therefore, the maximum likelihood estimate of q would be zero for those samples, which are problematic point estimates. To improve the accuracy of the estimates, we use empirical Bayes methods to estimate the distribution of q. Empirical Bayes methods borrow information from all the samples for estimating each q, therefore give more robust estimates of q. (Casella, 1985) We assume the prior distribution f of q is uniform on (α, β). As estimates of α, β, we use the values maximizing the marginal likelihood given the estimates, and . The posterior distribution of q given the data X for k ∈ K ∪ L, j = 1,…, J and the estimated parameters is where the product is over all positions k, the probabilities are computed from the formulas in Table 2 and λ is the normalizing constant. We use the posterior distribution of q in calculating s=P(Y=0|M0) rather than using the point estimates of q to take into account the uncertainty in the point estimate. Therefore, where G is the subset of positions of the base pairs belonging to the gene i and . The integration with regard to the posterior distribution of q is performed numerically. The resulting values of s are used as described in Section 2.1 for computing statistical significance. The distributions F(x) = P(T < x|Y = 1, M0) are also needed for the significance tests used to identify driver genes and are computed from: where T′ is the score of the mutation occurring in position k and sample j. The distribution P(T′ < x|X = nts, ntv, iid , or fid ) can be easily calculated from the genetic code and background mutation model. The process of the derivation of F(x) is explained in the Supplementary Material.

3 RESULTS

We applied our method to the data of Ding ). They sequenced coding exons and splice donor/acceptor sites (dinucleotides in the 5′/3′ ends of introns) of 623 genes in 188 samples from patients with lung adenocarcinoma to identify 1013 non-silent mutations. They selected a subset of 250 genes to identify 108 silent mutations for measuring a background mutation rate. The table describing all the identified mutations is available in the paper of Ding ), but the patient-specific gene sequences are not. Thus, we used the reference sequence of coding exons and splice donor/acceptor sites (dinucleotides in the 5′/3′ ends of introns) of the 623 genes from Ensembl release 46.

3.1 Simulation study

We first performed a simulation study to evaluate our method. For the comparison with the method of Ding ), we did not include the mutation score T in the test statistics, that is, we use the test statistic instead of . We generate simulated data based on the data of Ding ). We first generate passenger mutations by shuffling the locations of all observed non-silent and silent mutations across the genes sequenced. There are 1013 non-silent mutations observed in 623 genes and 108 silent mutations observed in 250 genes. For these mutations, we change the base pair positions in which the mutation occurred as follows: we randomly sample the base pair positions from the base pair positions within the sequence of all genes, which correspond to the same base pair types as the mutations. If a mutation occurred in the base pair A, we sample its new base pair position from all the base pair positions within the sequence of all genes corresponding to a base pair A. If the base pair is G or C, we also restrict the sampling by the CpG dinucletoide context. We then determine which of these mutations are non-silent or silent according to the genetic code. Since we randomly sample the base pair positions of all the mutations, they become evenly spread across all genes. To see the effect of variation of mutation rates across samples, we change the sample ID in which mutations occurred by sampling a new sample ID under two different distributions namely, moderate sample variation and high sample variation. The first distribution, moderate sample variation, is estimated from the background mutations of the data from Ding ). We sample each sample ID with the probability proportional to the number of passenger mutations (silent mutations and non-silent mutations observed in genes with at most one non-silent mutations) that occurred in the sample. For the second distribution, high sample variation, we increase the mutation rates of the 10 samples with highest mutation rate by a factor of 10. Finally, we make 20 driver genes by adding five non-silent mutations to 20 selected genes. In our simulations, we have used the true expected ratio of non-silent to silent mutations (R) in applying the method of Ding ) because we did not have their software for estimating R. This may somewhat overestimate the accuracy of their method. Each simulation was repeated for 200 replications. The average number of true and false positive driver genes claimed based on P-value cutoffs of 0.005 and 0.01, respectively, are shown in Table 3. Our method finds more true positives and fewer false positives than the method of Ding ). We did Wilcoxon signed rank test of the null hypothesis that the distribution of number of true or false positives from both methods are same and presented one-sided P-values in the last column of Table 3. For moderate sample variation, the P-values for false positives are 0.0001 and 0.0008, and the P-values for true positives are less than 10−16. For high sample variation, all the P-values are less than 10−16. This shows that the difference in the number of true positives or false positives between two methods is significant and it gets more significant as the sample variation grows larger.

Table 3.

Result for simulated data

Sample variation	Cutoff	Average number	Our method	Ding's method	P-value
Moderate	0.005	TP	12.9	9.9	<1e-16
Moderate	0.005	FP	1.3	1.7	1e-04
	0.01	TP	14.9	11.7	<1e-16
	0.01	FP	3	3.4	8e-04
High	0.005	TP	13.4	9.9	<1e-16
High	0.005	FP	0.2	2.0	<1e-16
	0.01	TP	15.1	11.7	<1e-16
	0.01	FP	0.6	3.9	<1e-16

TP, true positives; FP, false positives.

Result for simulated data TP, true positives; FP, false positives.

3.2 Results for the data of Ding

We identified 28 genes as driver genes with the false discovery rate (FDR) controlled at 5% using the Benjamini and Hochberg method. These include EGFR, CDKN2A, KRAS, STK11, TP53, EPHA3, NF1, ATM, RB1, APC, INHBA, ERBB4, PTPRD, FGFR4, PTEN, EPHA5, NTRK3, NTRK1, KDR, LRP1B, PAK3, NRAS, LTK, ZMYND10, EPHA7, MYO3B, NTRK2 and TFDP1. The P-values of the selected genes are given in Table 4.

Table 4.

Driver genes by new method

Gene name	P-value
EGFR	0
CDKN2A	0
KRAS	0
STK11	0
TP53	0
EPHA3	2e-06
NF1	2e-06
ATM	3e-06
RB1	4e-06
APC	1.3e-05
INHBA	6.8e-05
ERBB4	0.000109
PTPRD	0.000145
FGFR4	0.000146
PTEN	0.000210
EPHA5	0.000237
NTRK3	0.000298
NTRK1	0.000298
KDR	0.000319
LRP1B	0.000518
PAK3	0.000750
NRAS	0.000848
LTK	0.000876
ZMYND10	0.001091
EPHA7	0.001116
MYO3B	0.001151
NTRK2	0.001322
TFDP1	0.001404

Driver genes by new method By the method of Ding ), 22 genes were found to be significant at the 5% FDR level. These genes include two genes that we do not find significant. Our method finds eight genes, which they do not find significant. We drew a map of the genes selected by each method versus tumor samples in Figure 2. The genes are ordered by the P-value obtained by our method and samples are ordered according to the total number of genes with non-silent mutation (among all 623 genes). The genes indicated by the red bar are those which Ding ) find significant but we do not. The genes indicated by the yellow bar are those which we find significant but they do not. The genes indicated by the blue bar are those which both methods find significant.

Fig. 2.

Map of the 30 selected genes versus tumor samples. Tumor samples with/without mutations in genes are labeled yellow/blue. The rows (genes) are ordered according to the P-value obtained by our method. The columns (samples) are ordered according to the total number of genes with non-silent mutations (among all 623 genes) in the corresponding sample. The red/blue/yellow banner across the left side of the map shows the difference between selected genes by the two methods: our method and the method of Ding ). The genes covered by the red bar are the additional genes found by the method of Ding ) and those covered by the yellow bar are the additional genes found by our method. The genes covered by the blue bar are those which both methods find significant. For most of the well known oncogene and tumor suppressor genes in the dataset we analyzed (EGFR, CDKN2A, KRAS, STK11, TP53, NF1, RB1 PTEN, NRAS), the mutations occurring in them have very high mutation scores. For example, in STK11, most mutations are frameshift indels, nonsense or mutations in splice sites. Even the missense mutations represent poorly conserved amino acid changes. In RB1, all seven mutations that occurred in the gene are either frameshift indel, nonsense or mutations in splice sites. This is consistent with our scoring system that mutations in driver genes will tend to have strong impact on protein functions. By incorporating mutation scores in calculating P-values, driver genes have smaller P-values and thus are better identified. For the well- known driver genes EGFR, CDKN2A, KRAS, STK11, TP53 which already have computed P-values of zero due to frequent mutations, the effect of scoring makes little difference. But for genes like NF1, RB1, APC, INHBA, ERBB4, FGFR4, PTEN and NRAS, their P-values are about one-third on average of the P-values calculated without incorporating the mutation scores. Most of those genes are well-known cancer driver genes, and incorporating mutation scores helps in their identification. Not having mutations with high scores does not preclude a gene being a potential driver gene, but these cases tend to be infrequently mutated genes that occur in samples with large mutation rates as was the case for CDH11 and PDGFRA. These two genes are selected to be significant by Ding's method, but not by our method. Figure 2 shows that the mutations in these genes are clustered in the highly mutated samples except one in PDGFRA. However, most mutations in PDGFRA are missense mutations with low mutation scores, offsetting the effect of low mutation rate to the test statistics of PDGFRA. Also, some of the mutations in both genes occur in the same sample, therefore, our method which is based on the number of samples with mutations rather than the total number of mutations assigned larger P-values to them than the method of Ding ). There are eight genes beside the yellow bar. The gene with the smallest P-value is PTEN, a well-known tumor suppressor gene. Ding ) did not find it significant because the total number of mutations is so small (four). However, since each of the four mutations occurred in different samples with low mutation rates and the score of each mutation is high (one nonsense mutation and three missense mutation with high score), our method could find it significant. The gene with the second smallest P-value is NRAS, a well- known oncogene. Although there were only three total mutations in this gene, all of them are the same missense mutation changing glutamine to leucine, which has a high score. Also, one of the mutation occurred in a sample with low mutation rate, thus we could find it significant. The gene ZMYND10 is a candidate tumor suppressor gene whose association with carcinomas is suggested by Agathanggelou ); Cho (2007); Lerman and Minna (2000); Marsit ); Qiu ). The gene EPHA7 is a member of the ephrin receptor family and is known to be related to oncogenesis (Kiyokawa ). The other two members EPHA3 and EPHA5 are also selected to be significant by both methods, implying that EPHA7 is potentially involved in oncogenesis. NTRK2 is a member of the neurotrophic tyrosine receptor kinase (NTRK) family, which phosphorylates members of the MAPK pathway. It is known to be potentially implicated in oncogenesis (Marchetti ) and also the other two members of the NTRK receptor family, NTRK1 and NTRK3 are selected to be significant by both methods, supporting the implication of NTRK2 in oncogenesis. TFDP1 is a transcription factor and its overexpression or amplification is known to be associated with carcinomas (Melchor ; Yasui ). The role of LTK and MYO3B in oncogenesis is not well known.

4 DISCUSSION

We have developed a new method for identifying driver genes that has several methodological advantages compared with the previously used methods. First, we assign scores to non-silent mutations according to their expected impacts on the protein function so that the genes with more ‘driver-like’ mutations will get smaller P-values. Second, we permit each sample to have a different background mutation rate. This has the effect of reducing the false positives and increasing true positives, which was confirmed by the simulation study. Third, instead of assuming separate background mutation rates for non-silent and silent mutations, we assume that each passenger mutation is generated from one background mutation rate process and that whether the mutation is non-silent or silent depends on the genetic code. Thus, our model accounts for the variable number of possible non-silent mutations that can occur at each base pair according to the genetic code. This takes into account the difference in the number of possible non-silent mutations between genes according to the codon usage within genes. Fourth, we take into account uncertainties in the background mutation rate by using empirical Bayes methods. These methodological advances contributed to identifying a different set of driver genes when compared with those identified by Ding ). First, we did not find the genes CDH11 and PDGFRA which Ding ) found significant. These genes are not selected by our method because they are mainly mutated in the highly mutated samples and the scores of the mutations are not high. Second, we found PTEN, NRAS, LTK, ZMYND10, EPHA7, MYO3B, NTRK2 and TFDP1, which Ding ) did not find significant. It shows that our method is more sensitive in finding genes whose total number of mutations is small. Although we believe that our method provides an improvement over the previous methods, there is room for improvement by extending our approach. First, we measure the functional impact of mutations by the significance of the change to amino acids caused by the mutation. However, the functional impact is also dependent on the position in which a mutation occurs. For example, all three mutations in NRAS occurred in the exact same base pair position, which implies that the mutation in the specific position is crucial to the function of the protein. If a score for each position can be estimated that measures the significance of the position in protein function, it can be used in our test statistics in the same way as the mutation score T. Second, the current scoring system which assigns mutation scores in the order: missense mutation < inframe indel < mutation in splice sites < frameshift indel = nonsense mutation may be biased toward identifying tumor suppressor genes over oncogenes. Loss-of-function mutations such as frameshift indels or nonsense mutations occur more frequently in tumor suppressor genes than in oncogenes. Our use of the BLOSUM80 matrix to refine the scoring of missense mutations helps in the identification of new oncogenes. Alternative scoring systems can be used, however, to increase sensitivity for identifying oncogenes. For example, we can assign the same scores as the current method to the missense mutations, but reduced scores to indels, mutation in splice sites or nonsense mutations. Third, we may refine our background mutation model in Table 1 so that all six types of mutations, A:T → G:C, A:T → C:G, A:T → T:A, G:C → A:T, G:C → T:A, G:C → C:G have separate mutation rates. We separate the rates of mutations according to the mutation types (transition or transversion), base pair types (A:T or G:C) and their context (CpG dinucleotide contexts). Therefore, we did not separate the rates of the two types of mutation for each transversion: A:T → C:G, A:T → T:A for the transversion at A:T and C:G → A:T, C:G → G:C for the transversion at C:G in non-CpG or in CpG. However, if the two types of mutations for each transversion have quite different mutation rates, it may induce bias. Therefore, we evaluated a model in which each of them has a separate mutation rate using the simulated data generated described in Section 3.1. For 200 repeated simulations, we calculated the average number of true and false positives for this method. When compared with the original method, the new model increased true positives as well as false positives. Supplementary Table S1 shows these results. For larger datasets, one could refine our background mutation model to differentiate coding and non-coding strands. Currently, we assume that the mutation rate at the base pair A:T for example is same whether A is in the coding strand or T is in the coding strand. Using separate mutation rates according to the coding and non-coding strand, however, will increase the number of parameters by almost 2-fold, and therefore will be feasible only for the large datasets. Fifth, we did not take into account correlations among mutations in identifying driver genes. Indeed, none of the existing methods for identifying driver mutations that we are aware of utilize estimates of synergism or antagonism for pairs of mutations. However, strong positive or negative correlations between mutations in several pairs of genes have been observed. Therefore, one could attempt to utilize the correlation structure among mutations in identifying driver genes. Finally, one might combine both copy number variation and sequencing data to identify driver genes. In this article, we used only genomic sequence changes to identify driver genes. However, change of protein functions related to oncogenesis are frequently caused by copy number variation. Therefore, it is desirable to integrate both copy number variation and sequence changes to identify driver genes if both data are available. Our method can be extended to include copy number variation in the test statistics; we can test for each gene if the number of samples with ‘driver-like’ non-silent mutation or copy number variation is higher than that expected by the background mutation model. The analysis of tumor sequencing data is of key importance for understanding oncogenesis, identifying molecular targets and personalizing therapy. Learning to read the tumor genome is complex, however, and new methods of analysis are needed. We believe that methods such as those we have described that account for functional impact of mutations, sample variation in mutation rates and the redundancy of the genetic code will be useful for the identification of genes that drive the pathogenesis of cancer.

15 in total

1. Statistical analysis of pathogenicity of somatic mutations in cancer.

Authors: Chris Greenman; Richard Wooster; P Andrew Futreal; Michael R Stratton; Douglas F Easton
Journal: Genetics Date: 2006-06-18 Impact factor: 4.562

2. Epigenetic inactivation of the candidate 3p21.3 suppressor gene BLU in human cancers.

Authors: Angelo Agathanggelou; Ashraf Dallol; Sabine Zöchbauer-Müller; Catherine Morrissey; Sofia Honorio; Luke Hesson; Tommy Martinsson; Kwun M Fong; Michael J Kuo; Po Wing Yuen; Eamonn R Maher; John D Minna; Farida Latif
Journal: Oncogene Date: 2003-03-13 Impact factor: 9.867

3. The consensus coding sequences of human breast and colorectal cancers.

Authors: Tobias Sjöblom; Siân Jones; Laura D Wood; D Williams Parsons; Jimmy Lin; Thomas D Barber; Diana Mandelker; Rebecca J Leary; Janine Ptak; Natalie Silliman; Steve Szabo; Phillip Buckhaults; Christopher Farrell; Paul Meeh; Sanford D Markowitz; Joseph Willis; Dawn Dawson; James K V Willson; Adi F Gazdar; James Hartigan; Leo Wu; Changsheng Liu; Giovanni Parmigiani; Ben Ho Park; Kurtis E Bachman; Nickolas Papadopoulos; Bert Vogelstein; Kenneth W Kinzler; Victor E Velculescu
Journal: Science Date: 2006-09-07 Impact factor: 47.728

4. The 630-kb lung cancer homozygous deletion region on human chromosome 3p21.3: identification and evaluation of the resident candidate tumor suppressor genes. The International Lung Cancer Chromosome 3p21.3 Tumor Suppressor Gene Consortium.

Authors: M I Lerman; J D Minna
Journal: Cancer Res Date: 2000-11-01 Impact factor: 12.701

5. Overexpression of ERK, an EPH family receptor protein tyrosine kinase, in various human tumors.

Authors: E Kiyokawa; S Takai; M Tanaka; T Iwase; M Suzuki; Y Y Xiang; Y Naito; K Yamada; H Sugimura; I Kino
Journal: Cancer Res Date: 1994-07-15 Impact factor: 12.701

6. The candidate tumor suppressor gene BLU, located at the commonly deleted region 3p21.3, is an E2F-regulated, stress-responsive gene and inactivated by both epigenetic and genetic mechanisms in nasopharyngeal carcinoma.

Authors: Guo-Hua Qiu; Luke K S Tan; Kwok Seng Loh; Chai Yen Lim; Gopesh Srivastava; Sen-Tien Tsai; Sai Wah Tsao; Qian Tao
Journal: Oncogene Date: 2004-06-10 Impact factor: 9.867

7. The genomic landscapes of human breast and colorectal cancers.

Authors: Laura D Wood; D Williams Parsons; Siân Jones; Jimmy Lin; Tobias Sjöblom; Rebecca J Leary; Dong Shen; Simina M Boca; Thomas Barber; Janine Ptak; Natalie Silliman; Steve Szabo; Zoltan Dezso; Vadim Ustyanksky; Tatiana Nikolskaya; Yuri Nikolsky; Rachel Karchin; Paul A Wilson; Joshua S Kaminker; Zemin Zhang; Randal Croshaw; Joseph Willis; Dawn Dawson; Michail Shipitsin; James K V Willson; Saraswati Sukumar; Kornelia Polyak; Ben Ho Park; Charit L Pethiyagoda; P V Krishna Pant; Dennis G Ballinger; Andrew B Sparks; James Hartigan; Douglas R Smith; Erick Suh; Nickolas Papadopoulos; Phillip Buckhaults; Sanford D Markowitz; Giovanni Parmigiani; Kenneth W Kinzler; Victor E Velculescu; Bert Vogelstein
Journal: Science Date: 2007-10-11 Impact factor: 47.728

8. Hypermethylation of RASSF1A and BLU tumor suppressor genes in non-small cell lung cancer: implications for tobacco smoking during adolescence.

Authors: Carmen J Marsit; Duk-Hwan Kim; Mei Liu; Philip W Hinds; John K Wiencke; Heather H Nelson; Karl T Kelsey
Journal: Int J Cancer Date: 2005-03-20 Impact factor: 7.396

9. Patterns of somatic mutation in human cancer genomes.

Authors: Christopher Greenman; Philip Stephens; Raffaella Smith; Gillian L Dalgliesh; Christopher Hunter; Graham Bignell; Helen Davies; Jon Teague; Adam Butler; Claire Stevens; Sarah Edkins; Sarah O'Meara; Imre Vastrik; Esther E Schmidt; Tim Avis; Syd Barthorpe; Gurpreet Bhamra; Gemma Buck; Bhudipa Choudhury; Jody Clements; Jennifer Cole; Ed Dicks; Simon Forbes; Kris Gray; Kelly Halliday; Rachel Harrison; Katy Hills; Jon Hinton; Andy Jenkinson; David Jones; Andy Menzies; Tatiana Mironenko; Janet Perry; Keiran Raine; Dave Richardson; Rebecca Shepherd; Alexandra Small; Calli Tofts; Jennifer Varian; Tony Webb; Sofie West; Sara Widaa; Andy Yates; Daniel P Cahill; David N Louis; Peter Goldstraw; Andrew G Nicholson; Francis Brasseur; Leendert Looijenga; Barbara L Weber; Yoke-Eng Chiew; Anna DeFazio; Mel F Greaves; Anthony R Green; Peter Campbell; Ewan Birney; Douglas F Easton; Georgia Chenevix-Trench; Min-Han Tan; Sok Kean Khoo; Bin Tean Teh; Siu Tsan Yuen; Suet Yi Leung; Richard Wooster; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2007-03-08 Impact factor: 49.962

Review 10. Nasopharyngeal carcinoma: molecular biomarker discovery and progress.

Authors: William Chi-Shing Cho
Journal: Mol Cancer Date: 2007-01-02 Impact factor: 27.401

89 in total

1. Translating cancer 'omics' to improved outcomes.

Authors: Emily A Vucic; Kelsie L Thu; Keith Robison; Leszek A Rybaczyk; Raj Chari; Carlos E Alvarez; Wan L Lam
Journal: Genome Res Date: 2012-02 Impact factor: 9.043

Review 2. Emerging patterns of somatic mutations in cancer.

Authors: Ian R Watson; Koichi Takahashi; P Andrew Futreal; Lynda Chin
Journal: Nat Rev Genet Date: 2013-09-11 Impact factor: 53.242

3. The Landscape of Somatic Genetic Alterations in Metaplastic Breast Carcinomas.

Authors: Charlotte K Y Ng; Salvatore Piscuoglio; Felipe C Geyer; Kathleen A Burke; Fresia Pareja; Carey A Eberle; Raymond S Lim; Rachael Natrajan; Nadeem Riaz; Odette Mariani; Larry Norton; Anne Vincent-Salomon; Y Hannah Wen; Britta Weigelt; Jorge S Reis-Filho
Journal: Clin Cancer Res Date: 2017-02-02 Impact factor: 12.531

4. The primacy of NF1 loss as the driver of tumorigenesis in neurofibromatosis type 1-associated plexiform neurofibromas.

Authors: A Pemov; H Li; R Patidar; N F Hansen; S Sindiri; S W Hartley; J S Wei; A Elkahloun; S C Chandrasekharappa; J F Boland; S Bass; J C Mullikin; J Khan; B C Widemann; M R Wallace; D R Stewart
Journal: Oncogene Date: 2017-01-09 Impact factor: 9.867

5. MEScan: a powerful statistical framework for genome-scale mutual exclusivity analysis of cancer mutations.

Authors: Sisheng Liu; Jinpeng Liu; Yanqi Xie; Tingting Zhai; Eugene W Hinderer; Arnold J Stromberg; Nathan L Vanderford; Jill M Kolesar; Hunter N B Moseley; Li Chen; Chunming Liu; Chi Wang
Journal: Bioinformatics Date: 2021-06-09 Impact factor: 6.937

6. Genome-wide prediction of cancer driver genes based on SNP and cancer SNV data.

Authors: Quanze He; Quanyuan He; Xiaohui Liu; Youheng Wei; Suqin Shen; Xiaohui Hu; Qiao Li; Xiangwen Peng; Lin Wang; Long Yu
Journal: Am J Cancer Res Date: 2014-07-16 Impact factor: 6.166

7. Benchmarking mutation effect prediction algorithms using functionally validated cancer-related missense mutations.

Authors: Luciano G Martelotto; Charlotte Ky Ng; Maria R De Filippo; Yan Zhang; Salvatore Piscuoglio; Raymond S Lim; Ronglai Shen; Larry Norton; Jorge S Reis-Filho; Britta Weigelt
Journal: Genome Biol Date: 2014-10-28 Impact factor: 13.583

8. ZSCAN10 promotes cell proliferation, upregulates OCT4 expression, and activates Wnt/β-catenin signaling in glioma.

Authors: Yuan Jiang; Hongming Huang; Xingen Zhu; Miaojing Wu; Minhua Ye; Bing Xiao; Cong Yu; Hua Fang; Feng Liu; Shigang Lv
Journal: Int J Clin Exp Pathol Date: 2019-03-01

9. BCR-ABL1 compound mutations in tyrosine kinase inhibitor-resistant CML: frequency and clonal relationships.

Authors: Jamshid S Khorashad; Todd W Kelley; Philippe Szankasi; Clinton C Mason; Simona Soverini; Lauren T Adrian; Christopher A Eide; Matthew S Zabriskie; Thoralf Lange; Johanna C Estrada; Anthony D Pomicter; Anna M Eiring; Ira L Kraft; David J Anderson; Zhimin Gu; Mary Alikian; Alistair G Reid; Letizia Foroni; David Marin; Brian J Druker; Thomas O'Hare; Michael W Deininger
Journal: Blood Date: 2012-12-05 Impact factor: 22.113

Review 10. Cancer genome landscapes.

Authors: Bert Vogelstein; Nickolas Papadopoulos; Victor E Velculescu; Shibin Zhou; Luis A Diaz; Kenneth W Kinzler
Journal: Science Date: 2013-03-29 Impact factor: 47.728