| Literature DB >> 23246509 |
Fritz Joachim Sedlazeck1, Prabhavathi Talloji, Arndt von Haeseler, Andreas Bachmair.
Abstract
Identification of single nucleotide polymorphisms (SNPs) is a key element in sequence-based genetic analysis. Next generation sequencing offers a cost-effective basis to generate the necessary, large sequence data sets, and bioinformatic methods are being developed to process sequencing machine readouts. We were interested in detection of SNPs in a 350 kb region of an EMS-mutagenized Arabidopsis chromosome 3. The region was selectively analyzed using PCR-generated, overlapping fragments for Solexa sequencing. The ensuing reads provided a high coverage and were processed bioinformatically. In order to assess the SNP candidates obtained with a frequently used alignment program and SNP caller, we developed an additional method that allows the identification of high confidence SNP loci. The method can easily be applied to complete genome sequence data of sufficient coverage.Entities:
Mesh:
Substances:
Year: 2012 PMID: 23246509 PMCID: PMC3580289 DOI: 10.1016/j.ygeno.2012.12.001
Source DB: PubMed Journal: Genomics ISSN: 0888-7543 Impact factor: 5.736
Fig. 1Fraction of reads aligned as a function of their matching bases. The number of reads aligned to the genomic region of interest is a function of the chosen identity threshold. At a threshold of 90% identity or lower, the number of reads that can be aligned to the reference sequence is almost constant. If a higher threshold of identity is chosen, more and more reads cannot be aligned. The steepest decline occurs between 97% and 100% identity and reflects a difference between one nonmatching position per read and exactly matching reads (length 40 bases). Black line, fraction of aligned reads; red arrow, 90% identity threshold; red dotted line, percentage of mapped reads at the 90% threshold; green arrow, 100% identity threshold; green dotted line, percentage of mapped reads at the 100% threshold.
Fig. 2C-90 and C-100 read frequencies at predicted SNP positions, numbered according to occurrence on the chromosome. The number of reads was plotted on the y-axis for the C-90 (red lines) and the C-100 (green lines) set, respectively. The x-axis depicts a window of 150 bases before and after the base of a predicted SNP, which is delineated by a black vertical line. Chromosome number, position on the chromosome and BOD score value are written above each graph.
Fig. 3Prototypic read frequency diagrams and BOD score formula. Diagrams such as depicted in Fig. 2 were idealized to pinpoint parameters measured to obtain the BOD score value of a particular SNP candidate. A, idealized diagram of a high scoring SNP candidate (BOD score close to 1). B, a one-sided steep drop in C-100 read coverage already reduces the score (BOD score ca. 0.7). C, idealized diagram of a position with either fluctuating coverage, or with a sequence deviation exceeding the mismatch tolerance of the tolerant alignment set C-90 (BOD score ca. 0.6). D, formula. The formula measures the read number of the locally aligned C-100 reads by comparison of the SNP candidate position with a position at a 5 base distance for either direction. This slope is put in relation to the maximally possible slope, given by the average number of reads in the vicinity of the SNP candidate. The relative difference in read number (SNP position versus local average) is also determined for the tolerant alignment data set C-90. The final score lies between 0 and 1 and is high if the C-100 read numbers display a not-to-steep V shape at the position of interest, whereas the C-90 read number does not decrease at the same position.
Sequence verification of selected candidate single nucleotide polymorphism regionsa.
| BOD score | Coordinate on chromosome 3 | Sequence context | Confirmation process result | Annotation |
|---|---|---|---|---|
| 0.917023 | 16 161 395 | TGGAAA | Confirmed | Between annotated ORFs At3g44580 and At3g44590 |
| 0.628107 | 16 374 619 | CCGAAG TGACAAC | Confirmed | Before the stop codon of ORF At3g44850 |
| 0.617491 | 16 362 587 | CTACTA | Confirmed | Within ORF At3g44820 |
| 0.536839 | 16 339 238 | TTCAGA | Confirmed | Within ORF At3g44796 |
| 0.408319 | 16 051 815 | AGCTTC | Not confirmed | Between annotated ORFs At3g44400 and At3g44410 |
| 0.000000 | 16 373 512 | GAGGTC | Not confirmed | Within ORF At3g44840 |
Selected genomic regions were amplified by PCR from mutated and progenitor line and subjected to conventional sequencing, with the results as listed.
The reference sequence context is written at the top, the sequence of the EMS-treated plant line as determined by confirmatory (Sanger) sequencing is written below. The nucleotides in question are shown in bold.
The sequence change includes a small insertion/deletion compared to the reference sequence. However, the altered sequence of the mutant line was also present in the non-mutagenized progenitor line, suggesting that the nucleotide change cannot cause phenotypic differences between progenitor and mutant plant lines.
SNP was confirmed, but the differing nucleotide is present both in the mutagenized line and in its progenitor, suggesting that the nucleotide change cannot cause phenotypic differences between progenitor and mutant plant lines.