Literature DB >> 21701589

Exome sequencing reveals comprehensive genomic alterations across eight cancer cell lines.

Han Chang¹, Donald G Jackson, Paul S Kayne, Petra B Ross-Macdonald, Rolf-Peter Ryseck, Nathan O Siemers.

Abstract

It is well established that genomic alterations play an essential role in oncogenesis, disease progression, and response of tumors to therapeutic intervention. The advances of next-generation sequencing technologies (NGS) provide unprecedented capabilities to scan genomes for changes such as mutations, deletions, and alterations of chromosomal copy number. However, the cost of full-genome sequencing still prevents the routine application of NGS in many areas. Capturing and sequencing the coding exons of genes (the "exome") can be a cost-effective approach for identifying changes that result in alteration of protein sequences. We applied an exome-sequencing technology (Roche Nimblegen capture paired with 454 sequencing) to identify sequence variation and mutations in eight commonly used cancer cell lines from a variety of tissue origins (A2780, A549, Colo205, GTL16, NCI-H661, MDA-MB468, PC3, and RD). We showed that this technology can accurately identify sequence variation, providing ∼95% concordance with Affymetrix SNP Array 6.0 performed on the same cell lines. Furthermore, we detected 19 of the 21 mutations reported in Sanger COSMIC database for these cell lines. We identified an average of 2,779 potential novel sequence variations/mutations per cell line, of which 1,904 were non-synonymous. Many non-synonymous changes were identified in kinases and known cancer-related genes. In addition we confirmed that the read-depth of exome sequence data can be used to estimate high-level gene amplifications and identify homologous deletions. In summary, we demonstrate that exome sequencing can be a reliable and cost-effective way for identifying alterations in cancer genomes, and we have generated a comprehensive catalogue of genomic alterations in coding regions of eight cancer cell lines. These findings could provide important insights into cancer pathways and mechanisms of resistance to anti-cancer therapies.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Year: 2011 PMID： 21701589 PMCID： PMC3118809 DOI： 10.1371/journal.pone.0021097

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

All cancer cells have somatic mutations in their genomes, such as single nucleotide mutations, insertions, deletions, and copy-number gain or loss. Genomic lesions in cancer cells disrupt normal functions and pathways such as proliferation and apoptosis, and are essential for tumor genesis, growth, and metastasis. In addition, each tumor carries a unique combination of mutations in its genome, leading to heterogeneity in cancer prognosis and responses to therapeutic intervention. Our limited understanding of the more common mutations has already affected therapeutic regimens. For example, treatment with small molecule inhibitors of the epidermal growth factor receptor (EGFR) has been shown to primarily benefit lung cancer patients that carry certain somatic mutations in their EGFR gene [1], [2]. Similarly, certain antibody therapies directed against EGFR only show efficacy in the subset of colorectal cancer patients with a wild-type KRAS gene [3], [4]. Deep systematic characterization of somatic mutations in cancer genomes promises to be a powerful tool for both understanding cancer pathways and developing targeted therapeutics. Over the last two decades, focused studies on candidate genes have led to the identification of mutations occurring with high frequency in crucial cancer pathway genes such TP53, KRAS, and PTEN [5]. In recent years, the coding regions of breast, lung, colon, and brain tumor genomes have been analyzed using capillary-based sequencing technologies. These efforts have led to the identification of causative mutations in previously unsuspected genes such as IDH1, highlighting the power and importance of unbiased, genomic-scale mutation discovery [6], [7], [8]. However, large-scale capillary-based sequencing technologies are time consuming and expensive, and thus not feasible for wider use. Next-generation sequencing (NGS) technologies have increased the throughput and decreased the cost of DNA sequencing by several orders of magnitude. A number of studies have applied NGS technologies to sequence cancer genomes, as summarized in recent reviews [9], [10]. However, sequencing the whole genome is still cost-prohibitive for many potentially valuable applications. One alternative to whole genome methods is exome sequencing, which captures and sequences only coding exons in the genome. Exome sequencing methods can deliver sequencing information for much of the functionally relevant genome at increased coverage and reduced cost. Recent studies have successfully applied exome sequencing to identify causal mutations of Mendelian diseases [11], [12]. Large cancer genome initiatives such as The Cancer Genome Atlas project also include exome sequencing as part of their strategy to characterize cancer genomes [13]. Protein kinases are the most ubiquitous family of signaling molecules in human cells and play essential roles in regulating most cellular functions [14]. Since the protein kinase family is one of the most frequently mutated gene families in cancers [5], it has been subjected to several focused genomic sequencing studies. Bardelli et al. conducted the first systematic screen of mutations in the receptor tyrosine kinase subfamily of protein kinases, in colorectal cancer samples [15]. Since then, studies in primary tissues and cell lines have identified many mutations in protein kinases across multiple tumor types [16], [17], [18]. The interest in mutations of kinases has continued with recent genome-wide mutation discovery studies [13], [19], [20]. Cell line models of human cancer have played a critical role in our understanding of cancer disease pathways, identification and validation of cancer target genes, and our ability to screen potential anticancer drugs. These cell lines carry genomic mutations inherited from their source tumor cells, although additional mutations can be acquired during the course of cell line development and passage. In general, comparisons between cell lines reveal substantial heterogeneity in genomic mutations and reflect cancer pathways similar to those found in primary tumors. For example, comparison of a panel of breast cancer cell lines with a collection of primary breast samples showed that gene expression and copy number profiles in cell lines mirror those found the primary tumors [21]. Similarly, genomic mutations reported in the COSMIC database for cell lines have a similar spectrum to those in primary tumors [22]. As additional large-scale tumor genome sequencing results become available, there is a growing need for corresponding cell models to determine how novel variants affect protein function. Comprehensive characterization of genomic alterations in cancer cell lines will advance our understanding of cancer biology, and could also provide a basis for choosing relevant cell line models to study a particular aspect of cancer disease biology, or to screen for antagonists of certain cancer pathways. To evaluate NGS technologies and to characterize genomic mutations in cancer cell lines, we have analyzed data from the Roche Nimblegen exome capturing array and Roche 454 NGS technologies, applied to eight commonly used cell lines representing several major cancer types. We demonstrate that exome sequencing can be a reliable and cost effective way for identifying genomic alterations in cancer genome, and generated a comprehensive catalogue of genomic alterations in coding regions of eight cancer cell lines.

Results

Exome capture and sequencing results

Exome capture and 454 sequencing technologies were applied to DNA samples from eight cancer cell lines (A2780, A549, COLO205, GTL16, NCI-H661, MDA-MB468, PC3, and RD, as described in Methods. The results of initial data processing are summarized in Table 1. For each cell line, about 1.9 million sequencing reads (688 million bases; 98.5% of total sequencing reads) could be successfully mapped to the human genome NCBI36/hg18 reference assembly (http://www.ncbi.nlm.nih.gov). The average read length across all cell lines is 364 bases, consistent with the long read length reported for the 454 sequencing technology. On average, 89.5% of the circa 180,000 exons on the Nimblegen 2.1 M human exome array (target regions) were covered with at least one sequencing read, and the average sequencing read depth for all cell lines is 7.3 in target regions. The exome capture and sequencing results are within the normal range of performance specified by the manufacturer and are comparable with published results using the same technology [23].

Table 1

Exome capture and sequencing results.

Cell-line	A2780	A549	COLO205	GTL16	NCI-H661	MDA-MB468	PC3	RD	Average
Cancer type	ovary	lung	colon	stomach	lung	breast	prostate	soft tissue
Number of mapped reads (% total reads)	2112926 (98.29%)	1906737 (98.87%)	1707216 (98.39%)	1932251 (98.42%)	1774006 (98.55%)	1843735 (98.36%)	1789248 (98.42%)	2150836 (98.76%)	1902119 (98.5%)
Number of mapped bases (% total bases)	751 Mb (99.14%)	730 Mb (99.53%)	613 Mb (99.11%)	715 Mb (98.42%)	641 Mb (99.41%)	665 Mb (99.3%)	657 Mb (99.4%)	735 Mb (98.76%)	688 Mb (99.1%)
Average read length	355	382	360	368	360	359	365	366	364
Target regions coverage	87.6%	90.8%	90.8%	89.4%	91.7%	85.6%	87.1%	92.6%	89.5%
Average read depth (target regions)	7.7	8.1	6.7	7.6	6.7	6.6	6.7	8.5	7.3
Total variant detected (target regions)	16036	14283	13768	14296	13966	14931	12701	14741	14340
Novel variant detected (target regions)	3563	2769	2075	3111	2759	3021	2150	2786	2779
Novel non-synonymous variant (target regions)	2243	1977	1463	2121	1974	1949	1538	1967	1904

We detected on average 14,340 sequence variants (differences from the human reference genome) per cell line. The majority of these differences are known polymorphisms in normal human population (i.e. recorded in NCBI dbSNP database, build 130). On average 2,779 variants per cell line are not found in the dbSNP database, and therefore represent novel sequence variations and/or somatic mutations. On average 1,904 of the 2,779 novel variants are non-synonymous, i.e. they alter codon specificity. These variants are more likely to change protein functions and impact cellular phenotypes.

Concordance with genotyping results

As another means to assess the accuracy of exome sequencing, we compared the data with genotyping results across the eight cell lines (Table 2). The Affymetrix Genome-Wide Human SNP Array 6.0 is designed to detect genotype information for about one million known SNP positions. It can therefore provide independent verification of variations observed in the exome sequence data. For each cell line, we identified SNP Array 6.0 positions with successful genotype calls that were also covered by at least two unique exome sequencing reads. The overlap yielded between 26,407 and 29,650 SNP positions (depending on cell line) for further analysis. Overall, there was an average of 91% concordance between genotype calls from SNP array 6.0/Birdseed and those determined by exome sequencing. In the RD cell line, for example, 26,154 (91.5%) out of 28,594 SNP positions have the same genotype call (i.e., AA, AB, or BB) by SNP array 6.0 and by exome sequencing (Table 2).

Table 2

Comparison of genotype calls by SNP 6 chip and exome sequencing.

	All SNP			homozygous SNP			heterozygous SNP
cell-line	all SNP	same genotype call	percent	homozygous SNP	same genotype call	percent	heterozygous SNP	same genotype call	percent
A2780	29193	25493	87%	22079	21559	98%	7114	3934	55%
A549	26407	24065	91%	22627	22188	98%	3780	1877	50%
Colo205	27638	24780	90%	23526	22797	97%	4112	1983	48%
GTL16	29650	27474	93%	26447	25691	97%	3203	1783	56%
NCI-H661	29117	26901	92%	26737	25756	96%	2380	1145	48%
MDA-MB468	29360	27088	92%	25914	25221	97%	3446	1867	54%
PC3	27914	25505	91%	24421	23729	97%	3493	1776	51%
RD	28594	26154	91%	24616	23870	97%	3978	2284	57%
Average	28484	25933	91%	24546	23851	97%	3938	2081	52%

It is expected that the accuracy of genotype detection by sequencing will be influenced both by sequencing read depth and by heterozygosity at a given genomic location. We calculated concordance of genotype calls at difference sequencing read depth, and separately for homozygous or heterozygous SNPs. As shown in Figure 1, concordance is high for homozygous SNPs (average 97%) regardless of sequencing read depth. Concordance for heterozygous alleles is lower, but increases with sequence read depth, starting with 31% concordance at a read depth of 3 and reaching >90% at a read depth of 10 or higher. In theory, sequencing DNA fragments from a region that contains a heterozygous SNP is a process of random sampling. At lower sequencing depth, there is a higher chance of missing one of the two alleles. We calculated the theoretical rate of detecting both alleles by sequencing at different read depths, assuming no error in sequencing (Figure 1, dashed line). At low read depths, our experimental observations are close to the theoretical rate, indicating that low concordance at low read depths is likely due to the random sampling process rather than poor quality of sequence data.

Figure 1

Sequencing depth and concordance between genotype.

The graph displays a plot of average concordance of genotype calls obtained from the Affymetrix SNP Array 6.0 and from exome sequencing, as a function of sequencing read depths. Square markers indicate concordance at homozygous positions, diamond markers indicate concordance at heterozygous positions. The dashed line shows the theoretical rate of detecting heterozygous positions by sequencing (as described in Methods). Triangle markers display average number of heterozygous SNP locations per cell-line as a function of sequencing read depths (Y-axis on the right).

Sequencing depth and concordance between genotype.

Comparison of exome sequencing to the COSMIC database of cancer mutations

The protein-coding exons and immediate flanking intron sequences of 61 common cancer genes have previously been systematically determined in about 800 cell lines by the Welcome Trust Sanger Institute, using capillary-based sequencing [22]. Of the eight cell lines in this study, all except one (GTL16) have been screened in that project. We compared somatic mutation information from the Sanger COSMIC database with our exome sequencing results for the seven cell lines. As shown in Table 3, exome sequencing re-discovered most of the 21 mutations reported in the COSMIC database, including point mutations and small insertion/deletions. The two missing cases are due to lack of sequence coverage in the locus of interest: the documented STK11 mutation in A549 is not measurable due to lack of STK11 gene coverage in the Nimblegen 2.1 M human exome arrays, and the TP53 gene is covered by the Nimblegen array but lacks sufficient reads in the PC3 line to verify in this study (there are sufficient reads for the TP53 gene in other lines, as in Table 3).

Table 3

Comparison of exome sequencing results with mutations reported in the COSMIC database.

Cell-line	Gene	COSMIC report	Exome sequencing (reference allele reads ∶ variant allele reads)	Notes
A2780	PTEN	KGR128–130 del (Hom)	KGR 128–130 del (0∶3)
A549	KRAS	G12S (Hom)	G12S (0∶5)
A549	SMARCA4	Q729fs (23 bp del) (Hom)	Q729fs (23 bp del) (0∶4)
A549	CDKN2A	large region deletion (Hom)	large region deletion (Hom)	zero read depth in 14 consecutive regions
A549	STK11	Q37* (Hom)	Not in exome capture array	gene not in exome capture array
Colo205	BRAF	V600E (Het)	V600E (1∶2)
Colo205	TP53	26 bp del (Hom)	26 bp del (0∶4)
Colo205	SMAD4	904 bp del (Hom)	large region deletion (Hom)	zero read depth in 4 consecutive regions
Colo205	APC	T1556fs (insertion A) (Hom)	T1556fs (insertion A) (6∶5)	within a stretch of 6 A
NCI-H661	TP53	R158L (Hom)	R158L (5∶2)
NCI-H661	TP53	S215I (Het)	S215I (2∶4)
NCI-H661	SMARCA4	L1161fs (deletion G) (Hom)	L1161fs (deletion G) (0∶6)	within a stretch of 6 G
NCI-H661	CDKN2A	chr9_21960900 G→T (Hom)	chr9_21960900 G→T (0∶7)	splicing site
MDA-MB468	PTEN	chr10_89680827 G→T (Hom)	chr10_89680827 G→T (0∶9)	splicing site
MDA-MB468	TP53	R273H (Hom)	R273H (0∶4)
MDA-MB468	RB1	large region deletion (Hom)	large region deletion (Hom)	zero read depth in 35 consecutive regions
MDA-MB468	SMAD4	large region deletion (Hom)	large region deletion (Hom)	zero read depth in 16 consecutive regions
PC3	PTEN	large deletion (Hom)	large region deletion (Hom)	zero read depth in 34 consecutive regions
PC3	TP53	K139fs (Hom)	no sequencing read	no sequencing read
RD	NRAS	Q61H (Hom)	Q61H (5∶10)
RD	TP53	R248H (Hom)	R248H (Hom)

Large homozygous deletions, such as the known deletions of the CDKN2A gene in A549 and SMAD4 in Colo205 cells, cannot be directly observed with exome sequencing. But a deletion of gene regions can be inferred where the read depth is zero for several consecutive exons (see next section for detailed discussion). All five genomic deletions reported in the COSMIC database are identifiable from exome sequencing results (Table 3). For example, in the A549 cell line we observed 14 consecutive regions around CDKN2A gene with a read depth of zero. In the Colo205 cell line, a documented 904-base deletion in the SMAD4 gene manifests as 4 consecutive target regions with a read depth of zero.

Detecting gene amplification and deletion

Deletions or amplifications of chromosomal segments are common alterations in cancer genomes. In principle, the sequencing read depth in a region should be proportional to its copy number. However, the relatively modest read depth of the current study could give undue weight to random variations in read depth. Variability in read depth could also arise from technical aspects of the exome sequencing process. For example, the exome capturing array could vary in efficiencies for different exon regions due to diverse sequence composition. To assess the possibility of estimating copy number information from our exome sequencing data, we compared average sequence read depths with copy-number data estimated from SNP6 platform. As show in Figure 2, there is a positive correlation between sequence read depth and copy-number, with Pearson correlation coefficient of 0.41. The variation in read depth makes it challenging to accurately detect low-level copy-number changes. On the other hand, we find that accurate detection of high-level gene amplifications and homozygous deletions is possible.

Figure 2

Comparison of sequencing read depth with copy number data in MDA-MB468 cell-line.

Comparison of sequencing read depth with copy number data in MDA-MB468 cell-line.

Average sequencing read depths in capture regions were plotted against copy number data estimated from Affymetrix SNP 6.0 data as described in the methods section. The blue line shows the linear regression line. The Pearson correlation coefficiency (r = 0.41) of sequencing read depth and copy number data is printed on the figure. Homozygous deletion of the SMAD4 gene region has been reported in the MDA-MB468 cell line (Sanger COSMIC database) and is thus illustrative for comparing deletion detection methods. The sequencing read depths of exon regions in SMAD4 gene and surrounding area were determined for MDA-MB468 and plotted according to their chromosomal location (Figure 3A). Sixteen consecutive exon regions on chromosome 18 have a read depth of zero in the data for MDA-MB468. The genomic locations of the 16 exon regions are from 46.75 MB to 46.86 MB, which spans the SMAD4 gene. For comparison, we performed copy-number analysis of the Affymetrix SNP array 6.0 data as described in the methods section. For MDA-MB468, this analysis indicated a homozygous deletion of genomic region 46.76–46.86 Mb on chromosome 18 (Figure 3B), in good agreement with results from read depth analysis.

Figure 3

Sequencing read depth around the SMAD4 gene on chromosome 18 in the MDA-MB468 cell line.

Sequencing read depth around the SMAD4 gene on chromosome 18 in the MDA-MB468 cell line.

A. Plots of read depth data on consecutive exons around the SMAD4 gene region on chromosomal 18. The blue line shows sequencing read depth data for MDA-MB468, and the pink line shows the median sequencing read depth of all eight cell lines. B. Copy-number data from Affymetrix SNP6 chip data around the SMAD4 gene region on chromosomal 18. The black line shows the segmented copy-number data (log2 ratio to normal samples) generated by the aroma.affymetrx package in R as described in the methods section. A read depth of zero could result from technical issues, such as probe design in the Nimblegen 2.1 M array. In fact, we identified 2,513 exon regions that have a read depth of zero for all 8 cell lines (Table S1). However, since the median read depth across all 8 cell lines is greater than zero for all of the 16 exon regions (Figure 3A), it is unlikely that the observed depth of zero in the MDA-MB468 cell line is due to a systematic failure of exome capture. Random variation in read depth is another reason for lack of sequencing coverage. In the MDA-MB468 cell line, there are 17,161 exon regions with a read depth of zero (from 194,706 total regions, excluding the 2,513 regions mentioned above). It is highly unlikely that 16 consecutive exon regions around SMAD4 gene would have a read depth of zero due to random variation (p = 1.3e-17, calculated from the binomial distribution). We were also able to re-identify previously documented gene amplification events using the read depth data. For example, amplification of EGFR1 in the MDA-MB468 cell line has been documented by fluorescence in situ hybridization and by quantitative PCR [24]. We observed that the 53 exon regions around the EGFR gene on chromosome 7 have very high read depths in the MDA-MB468 data (Figure 4A; the exons between 55.58–55.73 Mb have an average read depth of 107). Our copy number analysis of the Affymetrix SNP array 6.0 data also indicated that the EGFR gene region is highly amplified in the MDA-MB468 line (Figure 4B, genomic region 55.48–55.81 Mb).

Figure 4

Sequencing read depth and amplification around the EGFR gene on chromosome 7 in the MDA-MB468 cell line.

A. Plots of read depth data on consecutive exons around the EGFR gene region on chromosomal 7. The blue line shows sequencing read depth data for MDA-MB468, and the pink line shows the median sequencing read depth of all eight cell lines. B. Copy-number data from Affymetrix SNP6 chip data around the EGFR gene region on chromosomal 7. The black line shows the segmented copy-number data (log2 ratio to normal samples) generated by the aroma.affymetrx package in R as described in the methods section.

Sequencing read depth and amplification around the EGFR gene on chromosome 7 in the MDA-MB468 cell line.

Novel non-synonymous variants in protein kinases

Since mutations in protein kinases have important roles in cancer biology, we chose to examine the sequence data for protein kinases and focus on non-synonymous variations, which produce amino acid substitutions that may have functional consequences. As noted above, exome sequencing revealed circa 2,000 novel non-synonymous variants in each of the eight cell lines. After applying a stringent filter (as described in Methods), between 199 to 479 genes have novel non-synonymous variants, depending on the cell-line (Table S2). The Nimblegen 2.1 M capture array used in this study included exons for 440 of the 518 protein kinases in the human genome (Table S3) [25]. In each cell line, an average of 122 non-synonymous variations were detected in kinase genes. After removing likely germline variants (found in dbSNP) and applying a stringent filter described above, each cell line has an average of eight kinases with non-synonymous variations (Table 4). These sequence variations in protein kinases are listed in Table 5. Most of these sequence variations are not reported in the COSMIC database or reported in the literature, but several have independent confirmation. For example, we identified EGFR variant A1048V in the GTL16 gastric cell line. The same variant in EGFR has been reported in the MKN45 gastric cell line [26], which is the parental cell line of GTL16 [27]. A second example is the R796S variant of the insulin receptor gene (INSR) in the RD cell line (Table 5). We had previously identified this variant in the RD cell line using capillary sequencing technology (data not shown).

Table 4

Number of protein kinase genes with non-synonymous variants in each cell-line.

	All	Novel (−dbSNP)	Novel (high confidence*)
A2780	125	49	9
A549	131	61	7
Colo205	107	42	10
GTL16	122	52	4
NCI-H661	124	55	15
MDA-MB468	121	50	8
PC3	113	40	6
RD	133	63	5

*at least 3 reads show the variation, and allele frequency for the variant > = 40%.

Table 5

High confidence* non-synonymous variants in protein kinase genes in each of 8 cell-lines.

A2780		A549		Colo205		GTL16
ALPK2	721fs	ALPK2	G286C	ADRBK1	443fs	EGFR	A1048V
EPHA2	D232G	BRD3	K508-	EPHA2	R315Q	MAP3K4	395fs
FLT3	Q771P	HIPK3	D713G	EPHA8	L559F	PSKH2	E227G
HIPK3	G586R	MKNK1	406fs	FRAP1	P1193L	STK31	S160T
HSPB8	E179Q	PAK6	302fs	LRRK2	G1213S
LATS2	D1013N	SGK196	169fs	NEK9	D84N
MAP3K5	M375V	ZAP70	I342T	NTRK2	A203T
MYO3A	248fs			RNASEL	G59S
TRPM6	C943G			TAOK2	A867V
				TTN	S597C

*at least 3 reads show the variation, and allele frequency for the variant > = 40%.

*at least 3 reads show the variation, and allele frequency for the variant > = 40%. *at least 3 reads show the variation, and allele frequency for the variant > = 40%.

Discussion

Analysis of data from eight diverse cancer cell lines shows that Roche Nimblegen and 454 exome sequencing technologies can be successfully applied to identify variations in gene-coding regions. From sequencing data with an average of 7.3-fold coverage, variants from the NCBI36 reference genome were identified in about 8% (14,340 regions) of all target regions on the exome capture array. While the majority of these variants could be confirmed in dbSNP database, on average 0.16% (2,779) of total target regions carry a novel variant. A comparison of SNP genotype calls from exome sequencing with data generated on the Affymetrix Genome-Wide Human SNP Array 6.0 showed that there is high concordance between the two technology platforms. The concordance is 97% for homozygous sites, and ranges from 30% to >90% at heterozygous positions, with accuracy dependent on sequencing read depth. Our analysis of the relationship between read depth and power of detection suggested that a minimum of ten-fold read depth is required for reliably detecting both alleles at heterozygous sites. These results provide guidance in planning future genome sequencing projects. For the seven examined cell lines that are also present in the COSMIC database, we show that 19 of 21 known mutations can be re-discovered by exome sequencing. Two previously described mutations were missing due to lack of sequence coverage. In one case this was due to incomplete coverage of the human exome in the Nimblegen 2.1 M capture array, indicating a need for improvements in array design. By successful re-identification of the EGFR amplification and the SMAD4 homozygous deletion in the MDA-MB468 cell line, we demonstrate that copy number alterations can be inferred from the sequencing read depth data. However, because of the stochastic nature of sequencing read depth and likely unevenness in the exome capturing process, in general it is not possible to reliably estimate copy-number information from our data. Applying the technology to more samples would help improve our ability to estimate and correct for systematic biases in the platform, and increasing the depth of sequencing reads would reduce the variance due to random fluctuation in read number. To bring context to the genomic variation identified in this study, we chose to focus on protein kinases as an illustrative class. In this work, we identified with high confidence at least four novel variant protein kinases in each cell line. Most of the novel sequence variations in protein kinases identified in this study have not previously been reported, and probably reflect the high diversity of genomic alteration in cancer. Our results expand the knowledge of sequence variations in protein kinases and other potential cancer-related genes. These novel variants could be either germline SNPs not yet reported in the dbSNP database, or somatic mutations in these cancerous cells. Several large-scale human genome sequencing projects currently in progress will expand identification of germline SNPs and help to categorize the nature of novel variants found in tumors. In conclusion, we showed that exome sequencing can be a reliable and cost-effective approach to identify genomic alterations in cancer cell lines, and suggest ways to further improve exome-sequencing technologies for applications in cancer genomics. A comprehensive catalogue of genomic alterations in the coding regions of eight cancer cell lines was generated, which should contribute not only to our knowledge of these models in particular, but also to our understanding of cancer genomics and cancer biology in general.

Materials and Methods

DNA Preparation

A2780, A549, Colo205, GTL16, NCI-H661, MDA-MB468, PC3, and RD cell lines were originally obtained from ATCC. Cell lines were grown in RPMI 1640 (Gibco) with 10% heat-inactivated Fetal Bovine Serum (FBS; CellGro) with the exception of RD (additional 25 mM HEPES) and A549 (Ham's F12 (Gibco), with 10% FBS). Genomic DNA (10 ug) was prepared by QIAamp DNA Mini Kit (Qiagen) using manufacturers protocols, and provided to the Roche 454 Sequencing Center.

Exome Capture and Next-Generation Sequencing

Exome capture and next-generation sequencing was performed by Roche NimbleGen and Roche 454 Life Science according to manufacturer's protocols. Genomic DNA was captured on the Nimblegen Sequence Capture Human Exome 2.1 M Array, which has 197,218 total regions (capture regions) covering about 175,278 exons and miRNA regions (target regions, large target region may consist of several capture regions). For each cell line, captured DNA was sequenced with two runs of the 454 GS FLX Titanium Sequencing technology.

Array-based Genotyping and Copy-number Analysis

Two aliquots of 250 ng genomic DNA per sample were digested by restriction enzymes NspI and StyI, respectively. The resulted products were ligated to the corresponding adaptors and PCR amplified. The labeled PCR products were hybridized to the Affymetrix Genome-Wide Human SNP Array 6.0 according to the manufacturer's recommendations. The Birdseed algorithm [28] implemented in Affymetrix Power Tools (APT) Software Package (version 1.10.0) was used for genotype determination. For copy-number analysis, the Cel files were processed using the aroma.affymetrix package [29] for the R-project. Segmentation of normalized raw copy number data was performed with the CBS algorithm [30] implemented in the aroma.affymetrix package.

Bioinformatics analysis

The Human genome NCBI36/hg18 reference assembly (http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html#b36) was used as the framework for all analyses. Sequence data processing, mapping to the human genome, and initial calls of variation from the reference sequence were performed by Roche 454 Life Science using GS Reference Mapper software (Roche Inc.). To qualify as a variant from the reference genome sequence, there must be at least two independent reads that 1) show the difference, 2) have at least 5 bases on both sides of the difference, and 3) have few other isolated sequence differences in the read. Variants identified as ‘high confidence’ were subject to a more stringent filter, requiring at least three independent reads with the variant comprising at least 40% of all independent reads covering the allele genomic position. To identify non-synonymous variants, the impact of each variant on translated protein sequence was assessed by mapping its genomic coordinates back to genes in RefSeq collection [31] release 37, and identifying changes in codon specificity. We calculated the theoretical rate of detection at heterozygous positions as a function of different read depth as follows: N sequencing reads covering a heterozygous position could be considered as random sampling of the two alleles repeated N times, thus should follow the binomial distribution. Assuming that allele A is reported in the human reference genome and allele B is the variant allele, we require at least two sequencing reads with the B allele for declaring the detection of allele B. The probability of detecting both A and B alleles at a heterozygous position can be calculated as: PAB = 1−P1−P2. P1 is the probability of finding 0 or 1 read with the A allele in N sequencing reads according to the binomial distribution, which would lead to a genotype call of AA. P2 is the probability of finding N reads with the B allele in N sequencing reads according to the binomial distribution, which will lead to a genotype call of BB. Catpure regions that have zero read depth in all 8 cell lines. (XLS) Click here for additional data file. All novel non-synonymous variants in eight cell-lines. (XLS) Click here for additional data file. 440 protein kinase genes covered by the Nimblegen 2.1 M capture array. (XLS) Click here for additional data file.

27 in total

1. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing.

Authors: Murim Choi; Ute I Scholl; Weizhen Ji; Tiewen Liu; Irina R Tikhonova; Paul Zumbo; Ahmet Nayir; Ayşin Bakkaloğlu; Seza Ozen; Sami Sanjad; Carol Nelson-Williams; Anita Farhi; Shrikant Mane; Richard P Lifton
Journal: Proc Natl Acad Sci U S A Date: 2009-10-27 Impact factor: 11.205

2. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses.

Authors: Siân Jones; Xiaosong Zhang; D Williams Parsons; Jimmy Cheng-Ho Lin; Rebecca J Leary; Philipp Angenendt; Parminder Mankoo; Hannah Carter; Hirohiko Kamiyama; Antonio Jimeno; Seung-Mo Hong; Baojin Fu; Ming-Tseh Lin; Eric S Calhoun; Mihoko Kamiyama; Kimberly Walter; Tatiana Nikolskaya; Yuri Nikolsky; James Hartigan; Douglas R Smith; Manuel Hidalgo; Steven D Leach; Alison P Klein; Elizabeth M Jaffee; Michael Goggins; Anirban Maitra; Christine Iacobuzio-Donahue; James R Eshleman; Scott E Kern; Ralph H Hruban; Rachel Karchin; Nickolas Papadopoulos; Giovanni Parmigiani; Bert Vogelstein; Victor E Velculescu; Kenneth W Kinzler
Journal: Science Date: 2008-09-04 Impact factor: 47.728

3. An integrated genomic analysis of human glioblastoma multiforme.

Authors: D Williams Parsons; Siân Jones; Xiaosong Zhang; Jimmy Cheng-Ho Lin; Rebecca J Leary; Philipp Angenendt; Parminder Mankoo; Hannah Carter; I-Mei Siu; Gary L Gallia; Alessandro Olivi; Roger McLendon; B Ahmed Rasheed; Stephen Keir; Tatiana Nikolskaya; Yuri Nikolsky; Dana A Busam; Hanna Tekleab; Luis A Diaz; James Hartigan; Doug R Smith; Robert L Strausberg; Suely Kazue Nagahashi Marie; Sueli Mieko Oba Shinjo; Hai Yan; Gregory J Riggins; Darell D Bigner; Rachel Karchin; Nick Papadopoulos; Giovanni Parmigiani; Bert Vogelstein; Victor E Velculescu; Kenneth W Kinzler
Journal: Science Date: 2008-09-04 Impact factor: 47.728

4. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs.

Authors: Joshua M Korn; Finny G Kuruvilla; Steven A McCarroll; Alec Wysoker; James Nemesh; Simon Cawley; Earl Hubbell; Jim Veitch; Patrick J Collins; Katayoon Darvishi; Charles Lee; Marcia M Nizzari; Stacey B Gabriel; Shaun Purcell; Mark J Daly; David Altshuler
Journal: Nat Genet Date: 2008-09-07 Impact factor: 38.330

5. The genomic landscapes of human breast and colorectal cancers.

Authors: Laura D Wood; D Williams Parsons; Siân Jones; Jimmy Lin; Tobias Sjöblom; Rebecca J Leary; Dong Shen; Simina M Boca; Thomas Barber; Janine Ptak; Natalie Silliman; Steve Szabo; Zoltan Dezso; Vadim Ustyanksky; Tatiana Nikolskaya; Yuri Nikolsky; Rachel Karchin; Paul A Wilson; Joshua S Kaminker; Zemin Zhang; Randal Croshaw; Joseph Willis; Dawn Dawson; Michail Shipitsin; James K V Willson; Saraswati Sukumar; Kornelia Polyak; Ben Ho Park; Charit L Pethiyagoda; P V Krishna Pant; Dennis G Ballinger; Andrew B Sparks; James Hartigan; Douglas R Smith; Erick Suh; Nickolas Papadopoulos; Phillip Buckhaults; Sanford D Markowitz; Giovanni Parmigiani; Kenneth W Kinzler; Victor E Velculescu; Bert Vogelstein
Journal: Science Date: 2007-10-11 Impact factor: 47.728

6. Patterns of somatic mutation in human cancer genomes.

Authors: Christopher Greenman; Philip Stephens; Raffaella Smith; Gillian L Dalgliesh; Christopher Hunter; Graham Bignell; Helen Davies; Jon Teague; Adam Butler; Claire Stevens; Sarah Edkins; Sarah O'Meara; Imre Vastrik; Esther E Schmidt; Tim Avis; Syd Barthorpe; Gurpreet Bhamra; Gemma Buck; Bhudipa Choudhury; Jody Clements; Jennifer Cole; Ed Dicks; Simon Forbes; Kris Gray; Kelly Halliday; Rachel Harrison; Katy Hills; Jon Hinton; Andy Jenkinson; David Jones; Andy Menzies; Tatiana Mironenko; Janet Perry; Keiran Raine; Dave Richardson; Rebecca Shepherd; Alexandra Small; Calli Tofts; Jennifer Varian; Tony Webb; Sofie West; Sara Widaa; Andy Yates; Daniel P Cahill; David N Louis; Peter Goldstraw; Andrew G Nicholson; Francis Brasseur; Leendert Looijenga; Barbara L Weber; Yoke-Eng Chiew; Anna DeFazio; Mel F Greaves; Anthony R Green; Peter Campbell; Ewan Birney; Douglas F Easton; Georgia Chenevix-Trench; Min-Han Tan; Sok Kean Khoo; Bin Tean Teh; Siu Tsan Yuen; Suet Yi Leung; Richard Wooster; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2007-03-08 Impact factor: 49.962

7. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins.

Authors: Kim D Pruitt; Tatiana Tatusova; Donna R Maglott
Journal: Nucleic Acids Res Date: 2006-11-27 Impact factor: 16.971

8. A single-array preprocessing method for estimating full-resolution raw copy numbers from all Affymetrix genotyping arrays including GenomeWideSNP 5 & 6.

Authors: Henrik Bengtsson; Pratyaksha Wirapati; Terence P Speed
Journal: Bioinformatics Date: 2009-06-17 Impact factor: 6.937

9. COSMIC (the Catalogue of Somatic Mutations in Cancer): a resource to investigate acquired mutations in human cancer.

Authors: Simon A Forbes; Gurpreet Tang; Nidhi Bindal; Sally Bamford; Elisabeth Dawson; Charlotte Cole; Chai Yin Kok; Mingming Jia; Rebecca Ewing; Andrew Menzies; Jon W Teague; Michael R Stratton; P Andrew Futreal
Journal: Nucleic Acids Res Date: 2009-11-11 Impact factor: 16.971

10. Exome sequencing of a multigenerational human pedigree.

Authors: Dale J Hedges; Dale Hedges; Dan Burges; Eric Powell; Cherylyn Almonte; Jia Huang; Stuart Young; Benjamin Boese; Mike Schmidt; Margaret A Pericak-Vance; Eden Martin; Xinmin Zhang; Timothy T Harkins; Stephan Züchner
Journal: PLoS One Date: 2009-12-14 Impact factor: 3.240

17 in total

1. Exome sequencing and the genetic basis of complex traits.

Authors: Adam Kiezun; Kiran Garimella; Ron Do; Nathan O Stitziel; Benjamin M Neale; Paul J McLaren; Namrata Gupta; Pamela Sklar; Patrick F Sullivan; Jennifer L Moran; Christina M Hultman; Paul Lichtenstein; Patrik Magnusson; Thomas Lehner; Yin Yao Shugart; Alkes L Price; Paul I W de Bakker; Shaun M Purcell; Shamil R Sunyaev
Journal: Nat Genet Date: 2012-05-29 Impact factor: 38.330

Review 2. The promise of whole-exome sequencing in medical genetics.

Authors: Bahareh Rabbani; Mustafa Tekin; Nejat Mahdieh
Journal: J Hum Genet Date: 2013-11-07 Impact factor: 3.172

3. Standard free droplet digital polymerase chain reaction as a new tool for the quality control of high-capacity adenoviral vectors in small-scale preparations.

Authors: Philip Boehme; Thorsten Stellberger; Manish Solanki; Wenli Zhang; Eric Schulz; Thorsten Bergmann; Jing Liu; Johannes Doerner; Armin E Baiker; Anja Ehrhardt
Journal: Hum Gene Ther Methods Date: 2015-02 Impact factor: 2.396

4. Variant association tools for quality control and analysis of large-scale sequence and genotyping array data.

Authors: Gao T Wang; Bo Peng; Suzanne M Leal
Journal: Am J Hum Genet Date: 2014-05-01 Impact factor: 11.025

5. Detection of somatic copy number alterations in cancer using targeted exome capture sequencing.

Authors: Robert J Lonigro; Catherine S Grasso; Dan R Robinson; Xiaojun Jing; Yi-Mi Wu; Xuhong Cao; Michael J Quist; Scott A Tomlins; Kenneth J Pienta; Arul M Chinnaiyan
Journal: Neoplasia Date: 2011-11 Impact factor: 5.715

Review 6. The 'omics' of adrenocortical tumours for personalized medicine.

Authors: Guillaume Assié; Anne Jouinot; Jérôme Bertherat
Journal: Nat Rev Endocrinol Date: 2014-02-04 Impact factor: 43.330

Review 7. Genetic and epigenetic traits as biomarkers in colorectal cancer.

Authors: Marianne Berg; Kjetil Søreide
Journal: Int J Mol Sci Date: 2011-12-16 Impact factor: 5.923

8. Identification of chromosomal errors in human preimplantation embryos with oligonucleotide DNA microarray.

Authors: Lifeng Liang; Cassie T Wang; Xiaofang Sun; Lian Liu; Man Li; Craig Witz; Daniel Williams; Jason Griffith; Josh Skorupski; Ghassan Haddad; Jimmy Gill; Wei-Hua Wang
Journal: PLoS One Date: 2013-04-16 Impact factor: 3.240

9. Genome and transcriptome sequencing of lung cancers reveal diverse mutational and splicing events.

Authors: Jinfeng Liu; William Lee; Zhaoshi Jiang; Zhongqiang Chen; Suchit Jhunjhunwala; Peter M Haverty; Florian Gnad; Yinghui Guan; Houston N Gilbert; Jeremy Stinson; Christiaan Klijn; Joseph Guillory; Deepali Bhatt; Steffan Vartanian; Kimberly Walter; Jocelyn Chan; Thomas Holcomb; Peter Dijkgraaf; Stephanie Johnson; Julie Koeman; John D Minna; Adi F Gazdar; Howard M Stern; Klaus P Hoeflich; Thomas D Wu; Jeff Settleman; Frederic J de Sauvage; Robert C Gentleman; Richard M Neve; David Stokoe; Zora Modrusan; Somasekar Seshagiri; David S Shames; Zemin Zhang
Journal: Genome Res Date: 2012-10-02 Impact factor: 9.043

10. Performance comparison of four commercial human whole-exome capture platforms.

Authors: Daichi Shigemizu; Yukihide Momozawa; Testuo Abe; Takashi Morizono; Keith A Boroevich; Sadaaki Takata; Kyota Ashikawa; Michiaki Kubo; Tatsuhiko Tsunoda
Journal: Sci Rep Date: 2015-08-03 Impact factor: 4.379