| Literature DB >> 23176052 |
Yan Guo1, Jiang Li, Chung-I Li, Jirong Long, David C Samuels, Yu Shyr.
Abstract
BACKGROUND: When using Illumina high throughput short read data, sometimes the genotype inferred from the positive strand and negative strand are significantly different, with one homozygous and the other heterozygous. This phenomenon is known as strand bias. In this study, we used Illumina short-read sequencing data to evaluate the effect of strand bias on genotyping quality, and to explore the possible causes of strand bias. RESULT: We collected 22 breast cancer samples from 22 patients and sequenced their exome using the Illumina GAIIx machine. By comparing the consistency between the genotypes inferred from this sequencing data with the genotypes inferred from SNP chip data, we found that, when using sequencing data, SNPs with extreme strand bias did not have significantly lower consistency rates compared to SNPs with low or no strand bias. However, this result may be limited by the small subset of SNPs present in both the exome sequencing and the SNP chip data. We further compared the transition and transversion ratio and the number of novel non-synonymous SNPs between the SNPs with low or no strand bias and those with extreme strand bias, and found that SNPs with low or no strand bias have better overall quality. We also discovered that the strand bias occurs randomly at genomic positions across these samples, and observed no consistent pattern of strand bias location across samples. By comparing results from two different aligners, BWA and Bowtie, we found very consistent strand bias patterns. Thus strand bias is unlikely to be caused by alignment artifacts. We successfully replicated our results using two additional independent datasets with different capturing methods and Illumina sequencers.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23176052 PMCID: PMC3532123 DOI: 10.1186/1471-2164-13-666
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Strand bias examples from real data
| 6 | 32975014 | 21 | 5 | 5 | 10 | 1 | Heterzygous | Homozygous |
| 1 | 81967962 | 38 | 20 | 11 | 7 | 0 | Heterzygous | Homozygous |
| 12 | 10215654 | 31 | 15 | 9 | 7 | 0 | Heterzygous | Homozygous |
1. Forward strand reference allele.
2. Forward strand non reference allele.
3. Reverse strand reference allele.
4. Reverse strand non reference allele.
Figure 1Consistency rates with SNP chip data using 3 strand bias scores and 4 processing pipelines.
SNP quality by MAF and subset1
| All Seq SNPs | 70825/2.35 | 8454/1.93 | 26074/1.94 | 19092/1.96 | 30532/1.85 | 174977/2.08 |
| Overlapped SNPs | 4123/2.99 | 2781/2.8 | 2799/2.73 | 2160/2.53 | 2504/2.69 | 14367/2.78 |
| Seq SNPs - Chip SNPs | 66702/2.31 | 25673/1.86 | 23275/1.87 | 16932/1.9 | 28028/1.79 | 160610/2.03 |
| dbSNP SNPs in Seq | 53520/2.43 | 23458/2.21 | 21798/2.23 | 16359/2.2 | 26152/2.05 | 141287/2.26 |
| dbSNP SNPs not on Chip | 49397/2.39 | 20677/2.15 | 18999/2.17 | 14199/2.16 | 23638/2.00 | 126920/2.21 |
1. Each cell is represented in the format of number of SNPs/TiTv ratio.
Figure 2Genotyping quality control through Ti/Tv ratio.
SB and GATK-SB difference
| 7 | 43917013 | 11 | 2 | 20 | 0 | 2.54 | 0.16 | 0.85 | AG | AA |
| 14 | 95923670 | 16 | 2 | 10 | 0 | 1.56 | 0.12 | 0.48 | CT | TT |
| 19 | 57088850 | 8 | 2 | 16 | 0 | 2.6 | 0.21 | 0.86 | AC | AA |
1. Forward strand reference allele.
2. Forward strand non reference allele.
3. Reverse strand reference allele.
4. Reverse strand non reference allele.
Figure 3Strand bias consistency across subjects.
Figure 4Strand bias correlations between Bowtie and BWA.
Figure 5Correlation of strand bias scores between different processing pipelines.
Figure 6Scatter plot of strand bias scores between different processing pipelines.