| Literature DB >> 19656394 |
Michael C Wendl1, Richard K Wilson.
Abstract
BACKGROUND: Structural variations in the form of DNA insertions and deletions are an important aspect of human genetics and especially relevant to medical disorders. Investigations have shown that such events can be detected via tell-tale discrepancies in the aligned lengths of paired-end DNA sequencing reads. Quantitative aspects underlying this method remain poorly understood, despite its importance and conceptual simplicity. We report the statistical theory characterizing the length-discrepancy scheme for Gaussian libraries, including coverage-related effects that preceding models are unable to account for.Entities:
Mesh:
Year: 2009 PMID: 19656394 PMCID: PMC2748092 DOI: 10.1186/1471-2164-10-359
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Diagram of an insertion SV (ISV). Sequence inserts derived from the subject genome have a known average length, but appear to be shorter than average when their end-read pairs are aligned to a reference genome. This implies that segment , having length δ = |x2 - x1|, is inserted in the subject genome. Deletion SV (DSV) is the complement of this phenomenon and can be visualized by swapping the "subject" and "reference" labels in the diagram.
Notation for Structural Variation (SV) statistics
| Variable | Meaning |
|---|---|
| Probabilistic Descriptors of SV Detection | |
| probability of false-positive errors | |
| probability of false-negative errors | |
| Genomic and Project Parameters | |
| average insert length in a Gaussian library | |
| standard deviation in a Gaussian library | |
| length of an instance of structural variation | |
| (constant) sequencing read length | |
| difference threshold specified for power analysis | |
| number of inserts processed | |
| haploid subject genome length | |
| minimum admissible insert size (Eq. 5) | |
| haploid physical coverage (≡ | |
| Labels Defining Types of SV | |
| ISV | insertion SV |
| DSV | deletion SV |
| heterozygous SV | |
| homozygous SV | |
| Functions and Random Variables | |
| erf | Gaussian error function (see e.g. Ref. [ |
| exp | exponential function |
| random number of inserts spanning an SV site | |
| random length of an individual insert | |
| random mean length of inserts spanning SV site | |
Representative insert types for discovery over the SV spectrum
| insert type | COV (%) | ||
|---|---|---|---|
| Illumina GA short | 0.25 | 12 | 50 |
| Illumina GA intermediate | 3 | 12 | 50 |
| 454 | 3.2 | 25 | 250 |
| Illumina GA long | 10 | 12 | 50 |
| fosmid | 39.9 | 7 | 600 |
| BAC | 136.4 | 21 | 600 |
representative of < 1 kb inserts, see e.g. refs. [4,29]
insert length representative of extended chemistry protocols for 2–5 kb inserts, see e.g. ref. [33]
library parameters reported in ref. [8]
experimental, not currently in routine use
library parameters reported in ref. [6]
primary breast tumor library B421, see ref. [11]
Figure 2Heterozygous ISV and DSV false-positive trends for 250 bp Illumina GA inserts and 40 kb fosmids (Table 2) for selected values of physical coverage (.
Figure 3Heterozygous ISV power for short Illumina GA inserts and fosmids (Table 2) at .
Figure 4Curves of . Vertical reference line shows the 7% COV threshold, characteristic of the library in ref. [6].
Figure 5Curves of . The 12% COV characteristic of this library is shown by a vertical line. A second line at 7% COV is given as a reference to the fosmid library in ref. [6].
Figure 6Spectral curves for heterozygous ISV at a threshold of . Bold lines represent two feasible designs that leave no spectral gaps at α = 1% using the improved GA libraries.