| Literature DB >> 26417547 |
Yu Qian1, Birte Kehr2, Bjarni V Halldórsson3.
Abstract
Alu elements are sequences of approximately 300 basepairs that together comprise more than 10% of the human genome. Due to their recent origin in primate evolution some Alu elements are polymorphic in humans, present in some individuals while absent in others. We present PopAlu, a tool to detect polymorphic Alu elements on a population scale from paired-end sequencing data. PopAlu uses read pair distance and orientation as well as split reads to identify the location and precise breakpoints of polymorphic Alus. Genotype calling enables us to differentiate between homozygous and heterozygous carriers, making the output of PopAlu suitable for use in downstream analyses such as genome-wide association studies (GWAS). We show on a simulated dataset that PopAlu calls Alu elements inserted and deleted with respect to a reference genome with high accuracy and high precision. Our analysis of real data of a human trio from the 1000 Genomes Project confirms that PopAlu is able to produce highly accurate genotype calls. To our knowledge, PopAlu is the first tool that identifies polymorphic Alu elements from multiple individuals simultaneously, pinpoints the precise breakpoints and calls genotypes with high accuracy.Entities:
Keywords: Alu elements; Mobile element insertion; Paired-end sequencing; Polymorphism genotyping; Structural variation
Year: 2015 PMID: 26417547 PMCID: PMC4582951 DOI: 10.7717/peerj.1269
Source DB: PubMed Journal: PeerJ ISSN: 2167-8359 Impact factor: 2.984
Figure 1Example read alignments at an Alu deletion site.
Arrows show read directions. The blue part of the reads can be mapped to the reference outside of the Alu and the red part can be mapped to the Alu. (A) shows example reads from a haplotype that carries allele H1. (B) shows example reads from a haplotype that carries allele H0. A heterozygote diploid can have reads shown in both (A) and (B).
Figure 2Example read alignment at an Alu insertion site.
Arrows show read directions. The blue part of the reads can be mapped to the reference and the red parts are clipped or mapped somewhere else in the reference. (A) shows example reads from a non-Alu haplotype. (B) shows example reads from an Alu insertion haplotype. A heterozygote diploid can have reads shown in both (A) and (B).
Figure 3Example instance of our two-level voting system that determines the exact breakpoints of an Alu insertion.
At the first level, split-reads vote for a left and right breakpoint position within each individual. At the second level, individuals vote for the positions that received the largest numbers of votes at the first level to choose the final breakpoint positions AL and AR. In this example, position b is elected as AR and position e is elected as AL.
Simulated Alu counts of 100 individuals.
The sum column is the total counts of simulated Alu, the min and max column are the minimum and maximum number of Alu elements seen in one simulated individual.
| Dataset |
|
| ||||
|---|---|---|---|---|---|---|
| sum | min | max | sum | min | max | |
|
| 3,653 | 25 | 44 | 3,424 | 25 | 43 |
|
| 3,753 | 26 | 47 | 3,366 | 26 | 44 |
Summary of predicted Alu counts.
C represents the number of polymorphic Alu predicted as of genotype p while the true underlying genotype is t. The counts of C are further grouped into 4 types, named as TP (True Positive), FN (False Negative), FP (False Positive) and GE (Genotype-calling Error). The definitions of Sensitivity and False Discovery Rate (FDR) are given in the main text.
| Coverage | Dataset | Tool | TP | FN | FP | GE | Sensitivity | FDR | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
| |||||
| ∼10× |
| PopAlu | 2,668 | 3,350 | 985 | 19 | 0 | 0 | 0 | 55 | 85.8% | 0% |
|
| PopAlu | 3,152 | 3,017 | 601 | 341 | 0 | 0 | 0 | 8 | 86.8% | 0% | |
|
| RetroSeq | 530 | 2,505 | 1,347 | 490 | 999 | 1,119 | 1,876 | 371 | 74.2% | 28.6% | |
| ∼25× |
| PopAlu | 3,521 | 3,342 | 132 | 0 | 12 | 0 | 0 | 82 | 98.1% | 0.2% |
|
| PopAlu | 3,269 | 3,041 | 484 | 322 | 0 | 0 | 0 | 3 | 88.7% | 0% | |
|
| RetroSeq | 2302 | 261 | 1,191 | 520 | 1,913 | 79 | 260 | 2,585 | 76.0% | 26.9% | |
Predicted and validated Alu Insertion calls for the CEU trio.
The PCR (total) column provides the total number of PCR validated Alu insertion calls by Stewart et al. (2011) for each sample and the PCR columns the number of validated calls that are also predicted by the program. The Distance columns show the average distance in basepairs between the predicted breakpoint and the breakpoint reported by PCR. For PopAlu, we calculated the distance from the mid-point of the reported interval. For Mobster, we calculated the distance based on the reported “Insert Point”.
| Sample | PCR (total) | PopAlu | RetroSeq | Mobster | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Total | PCR | Distance | Total | PCR | Distance | Total | PCR | Distance | ||
| NA12878 | 165 | 1,441 | 162 | 4.7 bp | 1,038 | 162 | 16.2 bp | 1,058 | 164 | 5.0 bp |
| NA12891 | 142 | 1,432 | 138 | 4.6 bp | 1,046 | 139 | 17.6 bp | 1,030 | 140 | 6.4 bp |
| NA12892 | 152 | 1,405 | 150 | 4.9 bp | 1,078 | 148 | 16.5 bp | 1,023 | 149 | 6.7 bp |
Genotype calls of PCR validated Alu insertion calls for the CEU trio.
C represents the number of polymorphic Alus predicted as of genotype p while the true underlying genotype is t. The true genotype was determined by PCR validation (Stewart et al., 2011).
| Sample | PopAlu | RetroSeq | ||||
|---|---|---|---|---|---|---|
|
|
|
|
|
|
| |
| NA12878 | 124 | 38 | 0 | 124 | 1 | 37 |
| NA12891 | 95 | 41 | 2 | 95 | 0 | 44 |
| NA12892 | 107 | 41 | 2 | 106 | 0 | 42 |