| Literature DB >> 27486221 |
Robert Kofler1, Daniel Gómez-Sánchez1, Christian Schlötterer2.
Abstract
The evolutionary dynamics of transposable elements (TEs) are still poorly understood. One reason is that TE abundance needs to be studied at the population level, but sequencing individuals on a population scale is still too expensive to characterize TE abundance in multiple populations. Although sequencing pools of individuals dramatically reduces sequencing costs, a comparison of TE abundance between pooled samples has been difficult, if not impossible, due to various biases. Here, we introduce a novel bioinformatic tool, PoPoolationTE2, which is specifically tailored for the comparison of TE abundance among pooled population samples or different tissues. Using computer simulations, we demonstrate that PoPoolationTE2 not only faithfully recovers TE insertion frequencies and positions but, by homogenizing the power to identify TEs across samples, it provides an unbiased comparison of TE abundance between pooled population samples. We anticipate that PoPoolationTE2 will greatly facilitate the analysis of TE insertion patterns in a broad range of applications.Entities:
Keywords: Pool-Seq; bioinformatics; comparative genomics; comparative population genomics; next generation sequencing; transposable elements
Mesh:
Substances:
Year: 2016 PMID: 27486221 PMCID: PMC5026257 DOI: 10.1093/molbev/msw137
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
. 1.Overview of PoPoolationTE2. (A) TE insertions (black arrow) result in paired ends (yellow), with one read mapping to a reference chromosome (X) and the other one to a TE (copia). One group of such discordantly mapped reads is located to the left of the insertion (forward signature) and one to the right (reverse signature). (B) The absence of TE insertions results in proper pairs spanning a putative insertion site (green). (C) Mapped paired end reads may be used to generate a base coverage track (gray) and a physical coverage track (green). For the base coverage, the position of the reads is considered whereas for the physical coverage the region between the reads. (D) TE insertions result in paired ends that support a TE insertion (yellow). This can be translated into an additional type of physical coverage (yellow track). The median distance of proper pairs is used to estimate the distance between such discordant pairs. (E) Increasing the inner distance between paired ends compared with panel D results in more reads supporting a TE insertion (copia) and a higher physical coverage. If paired ends are overlapping the physical coverage of individual-paired ends is summed up, contributing to the total height of the physical coverage track. Physical coverage supporting the presence (yellow) and absence (green) of a TE may overlap (central region). (F) Combining the information of all paired ends for each genomic position results in a physical coverage track. (G) To homogenize the power to identify TEs, the physical coverage is randomly sampled to equal levels for each genomic position. (H) The position of signatures of TE insertions is determined using a sliding window (black lines on top) approach and the window with the maximal physical coverage supporting a TE (the red line indicates the window with the highest copia coverage) is used for further analysis. (I) The population frequency of TE signatures is estimated from the ratio of average physical coverage supporting a TE to the total physical coverage in a window (copia ). (J) Matching pairs of TE signatures (forward and reverse) of the same TE family within a given distance are joined, yielding a final set of TE insertions. Final population frequency and position estimates are obtained by averaging the estimates for forward and reverse signature. (K) Accuracy of the population frequency estimates for 1,000 TEs in a simulated pooled population. PoPoolationTE2 has a slight upward bias for intermediate frequency TEs and a slight downward bias for high frequency TEs. (L) Accuracy of insertion position estimates for 1,000 TEs in a simulated pooled population.
Performance of PoPoolationTE2 under optimal conditions such that, in principle, all TEs could be identified. We evaluated the influence of sequencing error rate, inner distance between paired ends (ID), standard deviations of the inner distance (σ), read length, and the product between read numbers and inner distance (keeping the physical coverage constant). The performance was assessed by the number of identified TEs, missed TEs, false positive TEs, TEs with correct strand (strand), TEs with both signatures identified (both sign.), and TEs with a single signature identified (one sign.). Furthermore, we assessed the accuracy of the estimated insertion positions (mean: , standard deviation: ) and of the estimated population frequencies (mean: , standard deviation: ). The resulting average coverage (μ) and average physical coverage in the pool (μ) were estimated from the data.
| Error Rate | Error Rate | Read Length | Read Length | Reads* ID | Reads* ID | |||
|---|---|---|---|---|---|---|---|---|
| Error rate | 0% | 10% | 0% | 0% | 0% | 0% | 0% | 0% |
| Reads [million] | 6.58 | 6.58 | 6.58 | 6.58 | 6.58 | 6.58 | 13.16 | 3.29 |
| ID | 100 | 100 | 100 | 100 | 100 | 100 | 50 | 200 |
| 20 | 20 | 0 | 75 | 20 | 20 | 20 | 20 | |
| Read length | 100 | 100 | 100 | 100 | 50 | 200 | 100 | 100 |
| 394.8 | 317.1 | 395.1 | 395.1 | 198.0 | 780.5 | 790.3 | 197.6 | |
| 193.0 | 109.2 | 199.9 | 187.8 | 188.0 | 191.8 | 191.1 | 196.1 | |
| Found | 999 | 994 | 1,000 | 998 | 991 | 1,000 | 1,000 | 996 |
| Missed | 1 | 6 | 0 | 2 | 9 | 0 | 0 | 4 |
| False positive | 4 | 10 | 5 | 8 | 20 | 2 | 10 | 6 |
| Strand | 999 | 994 | 1,000 | 996 | 988 | 998 | 995 | 993 |
| Both sign. | 996 | 982 | 1,000 | 990 | 986 | 998 | 996 | 986 |
| Single sign. | 3 | 12 | 0 | 8 | 5 | 2 | 4 | 10 |
| 4.0 | 5.5 | 2.0 | 5.2 | 3.0 | 2.3 | 1.8 | 4.8 | |
| 4.0 | 5.1 | 4.6 | 6.4 | 3.2 | 3.9 | 2.7 | 5.9 | |
| 0.030 | 0.029 | 0.019 | 0.043 | 0.021 | 0.079 | 0.092 | 0.020 | |
| 0.016 | 0.022 | 0.009 | 0.023 | 0.010 | 0.036 | 0.042 | 0.017 |
Performance of different tools for identifying TEs with simulated Pool-Seq data. Randomly distributed paired end reads were simulated (2×100bp; inner distance was drawn from a normal distribution with mean 100 and a standard deviation of 20) with an error rate of 1% and 2% chimeric reads. We evaluated the performance of PoPoolationTE2 (Po.TE2), PoPoolationTE (Po.TE) (Kofler et al. 2012), and TEMP (Zhuang et al. 2014). For each tool, we used several minimum thresholds (either minimum count [mc] or minimum support [ms]). For an explanation of the evaluated parameters see table 1.
| Po.TE2 | Po.TE2 | Po.TE | Po.TE | TEMP | TEMP | TEMP | |
|---|---|---|---|---|---|---|---|
| Threshold | mc2 | mc3 | mc3 | mc4 | ms4 | ms7 | ms10 |
| Found | 999 | 994 | 999 | 995 | 994 | 992 | 983 |
| Missed | 1 | 6 | 1 | 5 | 6 | 8 | 17 |
| False positive | 49 | 5 | 41 | 4 | 407 | 193 | 148 |
| Strand | 998 | 993 | 0 | 0 | 980 | 978 | 969 |
| Both sign. | 993 | 985 | 993 | 986 | 991 | 990 | 981 |
| Single sign. | 6 | 9 | 6 | 9 | 3 | 2 | 2 |
| 7.2 | 7.2 | 17.8 | 17.8 | 4.3 | 4.1 | 4.0 | |
| 7.6 | 7.6 | 13.1 | 13.0 | 14.0 | 13.0 | 13.0 | |
| 0.025 | 0.025 | 0.021 | 0.021 | 0.018 | 0.019 | 0.019 | |
| 0.019 | 0.019 | 0.016 | 0.016 | 0.032 | 0.032 | 0.033 | |
| Time (min) | 4.0 | 3.9 | 15.5 | 15.6 | 228.4 | 228.4 | 228.4 |
Evaluating different strategies to compare TE abundance in Pool-Seq samples. We simulated three populations with different numbers of low-frequency insertions (f = 0.01) and paired ends with varying inner distances (ID). An unbiased comparison should result in a stable ratio between observed and simulated TEs in the three populations (i.e., a low ). The best results were obtained when the physical coverage (p.c.) was sampled to equal levels in all three populations. Results are shown for two different minimum count thresholds (mc). The average coverage (μ) and the average physical coverage in the pool (μ) were directly estimated from the data. a Coverage after sampling.
| Population | A | B | C | A | B | C | A | B | C |
| Simulated TEs | 1,000 | 750 | 500 | 1,000 | 750 | 500 | 1,000 | 750 | 500 |
| ID | 100 | 150 | 200 | 100 | 150 | 200 | 100 | 150 | 200 |
| Reads (million) | 1.045 | 1.379 | 2.045 | 1.045 | 1.045 | 1.045 | 1.045 | 1.379 | 2.045 |
| 199.91 | 266.66 | 399.97 | 199.91 | 202.19 | 204.34 | 199.91 | 266.66 | 399.97 | |
| 99.11 | 198.78 | 398.23 | 99.11 | 150.82 | 203.68 | 100.00a | 100.00a | 100.00a | |
| Observed TEs (mc2) | 396 | 676 | 495 | 396 | 580 | 455 | 147 | 64 | 19 |
| Observed/simulated | 0.396 | 0.901 | 0.990 | 0.396 | 0.773 | 0.910 | 0.147 | 0.085 | 0.038 |
| 0.320 | 0.266 | 0.054 | |||||||
| Observed TEs (mc1) | 784 | 745 | 496 | 784 | 742 | 499 | 469 | 375 | 251 |
| Observed/simulated | 0.784 | 0.993 | 0.992 | 0.784 | 0.989 | 0.998 | 0.469 | 0.500 | 0.502 |
| 0.120 | 0.121 | 0.018 |