| Literature DB >> 22340203 |
Ye Du1, Hui Jiang, Ying Chen, Cong Li, Meiru Zhao, Jinghua Wu, Yong Qiu, Qibin Li, Xiuqing Zhang.
Abstract
BACKGROUND: Restriction Enzyme-based Reduced Representation Library (RRL) method represents a relatively feasible and flexible strategy used for Single Nucleotide Polymorphism (SNP) identification in different species. It has remarkable advantage of reducing the complexity of the genome by orders of magnitude. However, comprehensive evaluation for actual efficacy of SNP identification by this method is still unavailable.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22340203 PMCID: PMC3305556 DOI: 10.1186/1471-2164-13-77
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Main workflow of library construction and data analysis. (A) The workflow summarized the whole process including enzyme selection, library construction and data analysis. (B) Gel image of completely digestion of YH genome by Tsp 45I (lane 1) and gel image after gel extraction (lane 2). Lane M shows 50 bp molecular ladder with size indicator aside.
Summary of in silico digestion results
| Restriction Enzyme | Fragments (200 -700 bp) | Distance between two adjacent reads | #Putative SNPsb | ||||||
|---|---|---|---|---|---|---|---|---|---|
| #Total selected fragments | #Total length of target regionsa | % Percent of coverage | #Length of repetitive on target regions | %percent of repetitive contents | Mean (Mb) | Median (Mb) | S.D.(Mb) | ||
| 65,734 | 11,832,120 | 0.38% | 4,642,275 | 39.24% | 22,732 | 442 | 131.63 | 48,250 | |
| 69,204 | 12,456,720 | 0.40% | 5,388,814 | 43.26% | 21,582 | 368 | 140.26 | 59,204 | |
| 114,374 | 20,587,320 | 0.67% | 8,376,468 | 40.69% | 13,027 | 425 | 95.84 | 79,308 | |
| 194,918 | 35,085,240 | 1.14% | 12,510,491 | 35.66% | 7,607 | 379 | 73.01 | 137,942 | |
| 442,338 | 79,620,840 | 2.59% | 33,483,887 | 42.05% | 3,303 | 348 | 47.28 | 319,623 | |
| 1,131,481 | 203,666,580 | 6.61% | 74,549,862 | 36.60% | 1,237 | 235 | 29.41 | 774,892 | |
| 1,479,019 | 266,223,420 | 8.64% | 104,133,241 | 39.12% | 926 | 224 | 25.64 | 1,074,049 | |
| 2,419,310 | 435,475,800 | 14.14% | 216,911,945 | 49.81% | 531 | 170 | 19.95 | 1,750,903 | |
| 3,308,660 | 595,558,800 | 19.33% | 251,315,078 | 42.20% | 365 | 148 | 17.04 | 2,298,087 | |
The in silico digestion results of nine restriction enzymes using hg18 genome as reference were shown. a regions sequenced in the final corresponding library and calculated according to pair-end sequencing with average read length of 90 bp. b The number of putative SNPs are calculated based on dbSNP v129 data.
Summary of sequencing and alignment results
| Total reads | Total bases (Gb) | PF bases | Mapped bases (Gb) | On target region (Gb) | Target region with depth ≥ 1 (Mb) | Mean depth | Mismatch rate |
|---|---|---|---|---|---|---|---|
| 87,382,662 | 7.864 | 7.848(99.8%) | 6.644(84.7%) | 5.374(80.8%) | 255.34 (95.9%) | 20.40 | 0.33% |
Figure 2Insert size and depth distribution of YH . (A) Depth distribution of target region in Tsp 45I RRL. The red dashed line shows standard Poisson distribution. (B) Insert size distribution in Tsp 45I RRL. Insert size was calculated based on aligned paired end reads in Tsp 45I RRL sequencing data. Compared to simulation results, these fragments shorter than 400 bp were over-represented and longer fragments were under-represented. These peaks along the distribution indicated the accumulation of repeat sequence.
Validation of SNP calling
| On target regions of RRL | |||
|---|---|---|---|
| SNP | Not SNP | ||
| SNP dataset generated by WGS | SNP | 222,028 | 77,136 |
| Not SNP | 35,603 | --- | |
SNP calling results generated by RRL sequencing and WGS were compared by calculating the number of loci identified as SNPs by two methods equally or differently regardless of the concordance of genotyping. From the results the false discovery rate of RRL method was about 13.82% (35,603/(222,028+35,603)) and false negative rate was 25.78% (77,136/(77,136+222,028)).
Detailed interpretations for high False Discovery Rate and False Negative rate
| False Discovery Rate (FDR) class | ||
|---|---|---|
| Reasonable interpretations for SNPs filtered out in YH WGS results | 29,687(83.38%) | 23,348(78.65%) |
| 1. Low depth (<2) | 394(1.11%) | --- |
| 2. Low quality (< 20) | 10,920(30.67%) | --- |
| 3. High copy number (> 2) | 25,036(70.32%) | --- |
| 4. High depth (> 200) | 538(1.51%) | --- |
| Overcalled for unknown reasons in RRL sequencing | 5,916(16.61%) | 615(10.40%) |
| False Negative rate (FNR) class | ||
| Intersection of the reasons | Number of loci (percentage) | In dbSNP v129 |
| Reasonable interpretations for SNPs filtered out in RRL sequencing results | 57,169(74.11%) | 45,216(79.09%) |
| 1. Low depth (<2) | 29,478(38.22%) | --- |
| 2. Low quality (< 20) | 53,830(69.79%) | --- |
| 3. High copy number (> 2) | 25,587(33.17%) | --- |
| 4. High depth (> 200) | 43(0.06%) | --- |
| Allele dropout in RRL sequencing | 19,967(25.89%) | 10,724(53.71%) |
Comparison of RRL sequencing and Illumina Beadchip genotyping results
| Concordance | Discordance | ||||||
|---|---|---|---|---|---|---|---|
| Total | |||||||
| Illumina genotyping | HOM ref. | 55,435(99.95%) | --- | 5 | 22 | 0 | 27 |
| HOM mut. | 19,847(99.80%) | 21 | 1 | 18 | 0 | 40 | |
| HET ref. | 21,244(92.56%) | 1458 | 245 | 3 | 2 | 1708 | |
| HETmut. | 0(0.00%) | 4 | 0 | 0 | 0 | 4 | |
| Total | 96,445(98.19%) | 1483 | 251 | 43 | 2 | 1779 | |
The alleles genotyped by Illumina platform and RRL sequencing were classified into four categories: HOM ref. (homozygotes where both alleles are identical to the reference), HOM mut. (homozygotes where both alleles differ from the reference), HET ref. (heterozygotes where only one allele is identical to the reference), and HET mut. (heterozygotes where both alleles differ from the reference and also differ from one another).
Figure 3Density distribution of FNR loci along the reads. The density distribution of false negative SNPs was calculated and plotted. A large proportion of false negative SNPs located in the first five bases of each read, indicating great influence of disruption of recognition site. The inset shows the magnified distribution from position 6 to 90 along read and the dashed vertical line represents the position 55 after which the number of false negative loci increased sharply.
Figure 4Distribution of putative SNPs along chromosomes of reference genome. The x-axis represents the relative position across each chromosome, and the y-axis represents chromosome coordinates of the reference genome. The colour from red to blue indicates the increased density of putative SNPs in each selected window across the chromosome.