| Literature DB >> 24353318 |
Minh Duc Cao1, Edward Tasker, Kai Willadsen, Michael Imelfort, Sailaja Vishwanathan, Sridevi Sureshkumar, Sureshkumar Balasubramanian, Mikael Bodén.
Abstract
The advances of high-throughput sequencing offer an unprecedented opportunity to study genetic variation. This is challenged by the difficulty of resolving variant calls in repetitive DNA regions. We present a Bayesian method to estimate repeat-length variation from paired-end sequence read data. The method makes variant calls based on deviations in sequence fragment sizes, allowing the analysis of repeats at lengths of relevance to a range of phenotypes. We demonstrate the method's ability to detect and quantify changes in repeat lengths from short read genomic sequence data across genotypes. We use the method to estimate repeat variation among 12 strains of Arabidopsis thaliana and demonstrate experimentally that our method compares favourably against existing methods. Using this method, we have identified all repeats across the genome, which are likely to be polymorphic. In addition, our predicted polymorphic repeats also included the only known repeat expansion in A. thaliana, suggesting an ability to discover potential unstable repeats.Entities:
Mesh:
Year: 2013 PMID: 24353318 PMCID: PMC3919575 DOI: 10.1093/nar/gkt1313
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Insertions and deletions cause changes in how reads align to a reference sequence. A fragment with length l is sheared from the donor genome and the two ends are sequenced. The linked sequence reads are then mapped to the reference genome. An insertion (left) or a deletion (right) between the two reads in the donor genome will result in an increase or a decrease (respectively) of the observed fragment size x.
Summary of variation in the three synthesized genomes
| Genome | Rates | Number Of | Mean STR | |
|---|---|---|---|---|
| SNP | Indel | Varied STRs | Indel (bp) | |
| Sim-1 | 0.03 | 0.005 | 2257 | 3.047 |
| Sim-2 | 0.06 | 0.010 | 3079 | 6.022 |
| Sim-3 | 0.09 | 0.015 | 3199 | 9.356 |
Columns 2 and 3 show rates of SNP and indels, column 4 shows the number of varied STRs (of 3685) and column 5 shows the average STR indel size.
Figure 2.Assessment of repeat-length estimates error (left) and detection of variation call (right) of STRViper, Samtools, lobSTR and Dindel. Top shows performance on Sim-1, middle shows Sim-2 and bottom shows Sim-3. Outcomes are reported for different read depths (coverage). STRViper estimates are provided for fragment size standard deviations of 10–25 nt, as labelled.
Figure 3.Assessment of repeat-length estimates error (left) and detection of variation call (right) of STRViper when using Dindel and Samtools predictions as prior, and of Dindel and Samtools. Outcomes are reported for different read depths (coverage), on Sim-2 data with fragment size standard deviation 20.
Average running times of STRViper, lobSTR, RepeatSeq, Samtools and Dindel
| Method | 10-fold | 40-fold | 100-fold |
|---|---|---|---|
| STRViper | 3.86 min | 15.50 min | 36.50 min |
| lobSTR | 8.70 min | 34.52 min | 1.55 h |
| RepeatSeq | 0.50 min | 1.02 min | 1.45 min |
| Samtools | 42.00 min | 3.37 h | 14.30 h |
| Dindel | 10.88day | 46.24day | 204.57day |
Detection of experimentally observed STR variation by STRViper and other methods including Gan et al., 2011
| Method | TP | TN | FP | FN | F-score |
|---|---|---|---|---|---|
| STRViper | 142 | 316 | 47 | 43 | 0.742 |
| lobSTR | 24 | 349 | 8 | 167 | 0.212 |
| RepeatSeq | 0 | 354 | 0 | 194 | |
| Samtools | 38 | 344 | 16 | 150 | 0.306 |
| Dindel | 36 | 346 | 27 | 139 | 0.280 |
| Gan | 50 | 346 | 11 | 141 | 0.392 |
TP: number of true positives, TN: number of true negatives, FP: number of false positives and FN: number of false negatives.
The (Pearson) correlation (ρ) between STR variability and repeat purity, length, CG content, CG content in the flanking regions and distance to the nearest origin of replication
| Property | All STRs | All TNRs | |||
|---|---|---|---|---|---|
| Ρ | ρ | ||||
| Purity | 0.435 | 3.2e-170 | 0.322 | 1.8e-26 | |
| Length | 0.231 | 7.9e-46 | 0.242 | 1.9e-15 | |
| CG-content repeats | −0.146 | 6.2e-19 | −0.197 | 1.4e-10 | |
| CG-content flanks | −0.170 | 1.5e-25 | −0.161 | 1.7e-07 | |
| Distance to ORC | −0.001 | 0.94 | −0.028 | 0.36 | |
Repeat-length variability associated with different genomic regions
| Genomic region | All STRs | All TNRs | ||||
|---|---|---|---|---|---|---|
| Number | U-value | Number | U-value | |||
| Exon | 660 | −20.40 | 9.1e-93 | 463 | −10.00 | 1.1e-23 |
| Intron | 425 | 3.95 | 7.9e-05 | 72 | 2.71 | 6.6e-03 |
| 5′-UTR | 356 | −6.00 | 2.0e-09 | 152 | 2.36 | 1.8e-02 |
| 3′-UTR | 111 | −3.76 | 1.8e-04 | 45 | 0.03 | 9.8e-01 |
| Upstream | 510 | 2.09 | 3.6e-02 | 114 | 1.43 | 1.5e-01 |
| Downstream | 410 | 1.42 | 1.5e-01 | 112 | 1.93 | 5.4e-02 |
| OtherRNA | 12 | −1.45 | 1.5e-01 | 8 | −1.25 | 2.1e-01 |
| Non-functional | 1525 | 13.50 | 1.1e-41 | 227 | 5.63 | 1.8e-08 |
Absolute counts, Mann–Whitney U- and P-values are provided for each genomic annotation.