| Literature DB >> 19528076 |
Iouri Chepelev1, Gang Wei, Qingsong Tang, Keji Zhao.
Abstract
Whole-genome resequencing is still a costly method to detect genetic mutations that lead to altered forms of proteins and may be associated with disease development. Since the majority of disease-related single nucleotide variations (SNVs) are found in protein-coding regions, we propose to identify SNVs in expressed exons of the human genome using the recently developed RNA-Seq technique. We identify 12 176 and 10 621 SNVs, respectively, in Jurkat T cells and CD4(+) T cells from a healthy donor. Interestingly, our data show that one copy of the TAL-1 proto-oncogene has a point mutation in 3' UTR and only the mutant allele is expressed in Jurkat cells. We provide a comprehensive dataset for further understanding the cancer biology of Jurkat cells. Our results indicate that this is a cost-effective and efficient strategy to systematically identify SNVs in the expressed regions of the human genome.Entities:
Mesh:
Substances:
Year: 2009 PMID: 19528076 PMCID: PMC2760790 DOI: 10.1093/nar/gkp507
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.The flow chart of single nucleotide variations identification in expressed exons using RNA-Seq.
Figure 2.Redundant Reads Filter and SNV probability calculation examples. (A) There are nine reads that map uniquely to the same genomic location (top box). Nucleotide mismatches with reference sequence are highlighted in red. Filter 1 retains a single copy of each read. Thus, only five reads remain after Filter 1 is applied (middle box). There are two U1 reads, two U2 reads and one U0 read in the middle box. Filter 2 randomly selects one U1, one U2 and one U0 read. This leaves three reads at the same genomic location (bottom box). (B) Example of SNV probability calculation. Colored in red is a candidate SNV site. Seven short reads map uniquely to that site. The reference nucleotide is T. Five reads have nucleotides that differ from the reference nucleotide and two reads have nucleotide T at the candidate SNV site. Let the error rate estimated from the total number of U0, U1 and U2 nonredundant reads be q = 0.02. The binomial (random chance) probability to observe two matches and five mismatches at the same location is proportional to q5 (1−q)2. The P-value is given by the binomial probability of observing five or more mismatches in a seven-read alignment and it is equal to 6.5 × 10–8.
Figure 3.Demonstration that Redundant Reads Filter is necessary. (A) As described in ‘Material and methods’ section, application of redundant reads filter (Filter 1 + Filter 2) to uniquely mapped reads leaves at most three reads at a given genomic location: one U0, one U1 and one U2 read. By restricting the number of reads that can map to the same genomic location, we reduce false-positive rate of SNV detection. The evidence for presence of SNV comes mainly from overlapping but noncoincident reads. There are many overlapping but noncoincident reads that can cover a single SNV. In fact, there can still be as many as 90 reads of length 30 bp that cover a single SNV after the filtering step. Thus, the statistical power to detect the SNV is not reduced by the filtering procedure. (B) The number of detected (P-value = 10–9) known, i.e. SNPs from dbSNP database, and unknown (novel) SNVs using reads filtered using four different filters: Filter A is the Redundant Reads Filter; Filter B is Filter 1 followed by randomly selecting two reads each from U1 and U2 categories; Filter C is Filter 1 followed by randomly selecting three reads each from U1 and U2 categories; the last filter is an empty filter, i.e. no filtering of unique reads is done. The number of detected known SNVs is not sensitive to the filtering method used, confirming very low false-positive rate among detected known SNVs. However, the number of detected unknown SNVs is much higher for the cases of Filters B, C and No filter than for Filter A, demonstrating high false-positive rates resulting from the use of these alternative filters. Thus, Filter A is the best of four filters.
Miscellaneous information about short sequence reads and single nucleotide variant sites detected in Jurkat and CD4+ T cell samples using RNA-seq data
| Jurkat | CD4 | |
|---|---|---|
| Number of unique genomic reads | 26 275 213 | 27 253 288 |
| Number of unique exon-junction reads | 2 451 140 | 2 267 059 |
| Number of unique and nonredundant genomic reads | 13 166 074 | 13 325 274 |
| Number of unique and nonredundant junction reads | 1 166 914 | 924 220 |
| Number of matched bases | 422 487 847 | 419 253 929 |
| Number of mismatched bases | 7 501 793 | 8 230 891 |
| Number of variant sites | 5 477 131 | 6 030 145 |
| Number of significant variant sites | 12 176 | 10 621 |
| Number of novel significant variant sites | 4703 | 2952 |
| Number of nonsynonymous mutations | 3206 | 1977 |
| Number of novel nonsynonymous mutations | 1770 | 747 |
| Number of nonsense mutations | 47 | 17 |
| Number of novel nonsense mutations | 41 | 9 |
| Number of splice-site mutations | 66 | 31 |
| Number of novel splice-site mutations | 60 | 23 |
Miscellaneous information about short sequence reads and single nucleotide variant sites detected in Jurkat and CD4+ T cell samples using RNA-seq data.
The number of unique and nonredundant reads refers to the number of uniquely mapped reads that pass Redundant Reads Filter.
The number of mismatched bases refers to the total number of bases in unique nonredundant reads that mismatch with the reference sequence; it is given by the formula: number of U1 reads + twice the number of U2 reads.
The number of variant sites refers to the total number of genomic locations where mismatches with short sequence reads were observed.
‘Significant’ means significant at P-value < 1.0 × 10–9.
‘Novel’ means not present in dbSNP build 126 database.
Figure 4.Reads coverage analysis and cost analysis of SNV detection. (A) Percentage of exonic sequences passing coverage threshold. Three curves correspond to different numbers of uniquely mapped nonredundant reads: 13 million (Jurkat), 7 million (random subsample of 50% Jurkat reads) and 26 million (Jurkat + CD4). For example, about 30% of exonic regions are covered at least 5-fold by nonredundant uniquely mapped reads in Jurkat sample. In the combined Jurkat and CD4 sample, about 40% of exonic regions are covered at least 5-fold. (B) Two curves correspond to estimates of sequencing costs for homozygous (red curve) and heterozygous (blue dotted curve) SNV detection in CD4+ sample. About 80% of all homozygous SNVs in expressed (RPKM ≥ 1) exons can be detected using 67 million 30-bp nonredundant unique reads (∼2000 Mbp). At this sequencing depth, about 55% of all heterozygous SNVs in expressed exons can be detected. (See ‘Materials and methods’ section for details on derivation of cost curves).
Figure 5.Summary of results. (A) Venn diagram of single nucleotide variants (SNVs) detected in Jurkat and CD4 samples. (B) Summary table of SNVs detected in Jurkat and CD4 samples. Shown in the brackets are numbers of SNVs that are novel, i.e. not present in dbSNP Build 126 database.
Confirmation of selected Jurkat single nucleotide variants by Sanger sequencing of genomic DNA
| Gene | Chromosome | Position | Predicted allele | Reference allele | #A | #C | #G | #T | Known SNP | Amino acid change | Confirmed | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LCP1 | chr13 | 45606292 | C | T | 0 | 58 | 0 | 0 | 1.0e-102 | Yes | K → E | Yes |
| LOC554226 | chr2 | 132729041 | C | T | 2 | 53 | 1 | 1 | 1.9e-97 | No | intronic | Yes |
| ECH1 | chr19 | 44013927 | G | T | 0 | 0 | 55 | 1 | 1.1e-95 | Yes | E → A | Yes |
| SEPT9 | chr17 | 73006300 | G | A | 0 | 1 | 50 | 0 | 2.1e-90 | Yes | M → V | Yes |
| POLR3K | chr16 | 43517 | C | A | 0 | 48 | 2 | 0 | 1.2e-88 | Yes | S → A | Yes |
| CYC1 | chr8 | 145222820 | G | A | 0 | 0 | 49 | 0 | 7.0e-87 | Yes | M → V | Yes |
| FLNA | chrX | 153235779 | A | G | 45 | 3 | 2 | 0 | 4.7e-82 | No | R → W | Yes |
| MYO1G | chr7 | 44983146 | T | C | 0 | 0 | 3 | 36 | 2.7e-69 | No | V → M | Yes |
| TAL1 | chr1 | 47456811 | T | C | 0 | 0 | 0 | 39 | 2.7e-69 | No | UTR | Yes |
aShows 1-based chromosomal location of SNV.
bShows the allele inferred from RNA-seq data using the Point Mutation Analyzer.
cShows the allele from hg18 (NCBI Build 36) human genome sequence; both alleles refer to the forward strand of the genome sequence.
#‘X’ denotes the number of uniquely mapped nonredundant RNA-seq reads that have nucleotide X at the location of SNV.
‘Known SNP’ status is based on dbSNP build 126 database.
Jurkat-specific nonsense mutations
| Chromosome | Position | Gene | Mutant allele | WT allele | Jurkat reads variant:WT | CD4 reads variant:WT | Confirmed |
|---|---|---|---|---|---|---|---|
| chr22 | 37212044 | DDX17 | A | G | 32:33 | 0:60 | Yes |
| chr16 | 29995227 | PPP4C | T | C | 28:34 | 0:22 | Yes |
| chr14 | 90841906 | CCDC88C | A | G | 18:17 | 1:37 | Yes |
| chr4 | 39921882 | RHOH | T | C | 19:21 | 0:47 | Yes |
| chr17 | 77118842 | C17orf70 | A | G | 17:19 | 0:19 | Yes |
| chr19 | 1426114 | C19orf25 | A | G | 16:11 | 0:13 | Yes |
| chr20 | 61840255 | LIME1 | T | G | 18:25 | 0:65 | Yes |
| chr19 | 12647448 | MORG1 | T | C | 15:12 | 0:11 | Yes |
| chr14 | 77004596 | AHSA1 | A | G | 14:49 | 0:25 | Yes |
| chr1 | 151902223 | ILF2 | A | G | 17:50 | 0:24 | Yes |
| chr8 | 144728307 | NAPRT1 | A | G | 13:14 | 0:11 | Yes |
| chr20 | 62175757 | RGS19 | A | G | 12:15 | 0:46 | Yes |
| chr9 | 130824412 | SH3GLB2 | A | G | 11:23 | 0:18 | Yes |
| chr12 | 119100858 | GCN1L1 | A | G | 10:16 | 0:16 | Yes |
| chr17 | 77642429 | FASN | A | G | 11:40 | 0:13 | — |
| chr12 | 8126211 | NECAP1 | A | C | 8:9 | 0:21 | Yes |
| chr7 | 99543029 | TAF6 | A | G | 9:24 | 0:24 | Yes |
| chr17 | 40583130 | HEXIM1 | T | C | 9:25 | 0:42 | Yes |
| chr5 | 79534492 | SERINC5 | T | G | 8:17 | 0:17 | No |
| chr9 | 33437492 | AQP3 | A | C | 10:26 | 0:65 | — |
aConfirmed by Sanger sequencing of cDNA.