| Literature DB >> 30563451 |
Junwoo Bae1, Kyeong Won Lee2, Mohammad Nazrul Islam2,3,4, Hyung-Soon Yim2,3, Heejin Park5,6, Mina Rho7,8.
Abstract
BACKGROUND: Recent advances in sequencing technology have allowed us to investigate personal genomes to find structural variations, which have been studied extensively to identify their association with the physiology of diseases such as cancer. In particular, mobile genetic elements (MGEs) are one of the major constituents of the human genomes, and cause genome instability by insertion, mutation, and rearrangement. RESULT: We have developed a new program, iMGEins, to identify such novel MGEs by using sequencing reads of individual genomes, and to explore the breakpoints with the supporting reads and MGEs detected. iMGEins is the first MGE detection program that integrates three algorithmic components: discordant read-pair mapping, split-read mapping, and insertion sequence assembly. Our evaluation results showed its outstanding performance in detecting novel MGEs from simulated genomes, as well as real personal genomes. In detail, the average recall and precision rates of iMGEins are 96.67 and 100%, respectively, which are the highest among the programs compared. In the testing with real human genomes of the NA12878 sample, iMGEins shows the highest accuracy in detecting MGEs within 20 bp proximity of the breakpoints annotated.Entities:
Keywords: Long insertions; Mobile genetic elements; Paired-end sequencing; Structural variations
Mesh:
Year: 2018 PMID: 30563451 PMCID: PMC6299635 DOI: 10.1186/s12864-018-5290-9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 2Overview of the iMGEins pipeline. iMGEins has four phases: a read classification, b breakpoint prediction, c MGE identification, and d de novo assembly
Fig. 1Classification of reads to find and annotate MGEs. a When reads are aligned with local alignment, there exist three types of reads: reads are mapped (M, Reads 1, 2, 6, and 7); reads spanning MGE insertion are soft-clipped (S, Reads 1, 3, 5, and 7); reads from MGE insertions are unmapped (U, Reads 2, 3, 4, 5, and 6). b All reads are grouped into six classes: class 1 (M-U or U-M), class 2 (S-U or U-S), class 3 (M-S or S-M), class 4 (S-S), class 5 (M-M), and class 6 (U-U). Soft-clipped reads in classes 2, 3 and 5 are used to find breakpoints. The read pairs in classes 1 and 2 are considered as one-end unmapped reads, which are anchored at the upstream or downstream of the breakpoints. These one-end unmapped reads are used to annotate MGEs inserted. c Clipped subsequences (in yellow) at the breakpoints are used to check the integrity of the reads aligned. Breakpoints that are poorly aligned with clipped subsequences are discarded. d The vicinity of the breakpoint is estimated if a few nucleotides are the same. Such situation could occur when the inserted fragment has target site duplication. Read 1 at the beginning of MGEs might be the same as those in the downstream of the breakpoints. Alternatively, read 2 at the end of MGEs might be the same as those in upstream
Fig. 3Assembly of inserted MGEs. Unmapped paired reads and one-end unmapped reads are collected in the read classification step and MGE identification step. Reads are assembled to contigs, which include the fragments inserted into each breakpoint. One-end unmapped reads are used to find the corresponding contigs inserted to a specific breakpoint. The contigs are searched against the MGE library for annotation
Recall rates of breakpoint detection of iMGEins and novel insertions, popoolationTE, TEMP, RetroSeq and MELT in the simulation set 1
| Simulation type | MGE type | Ratio of SNVs | iMGEins | PoPoolationTE | TEMP | RetroSeq | MELT |
|---|---|---|---|---|---|---|---|
| Non variant (%) | LINE | 100.00 | 22.00 | 90.00 | 2.00 | 98.00 | |
| SINE | 94.00 | 84.00 | 96.00 | 24.00 | 98.00 | ||
| LTR | 96.00 | 86.00 | 100.00 | 62.00 | 100.00 | ||
| DNA | 98.00 | 86.00 | 98.00 | 26.00 | 98.00 | ||
| SNV (%) | LINE | 10% | 100.00 | 10.00 | 0.00 | 0.00 | 40.00 |
| 20% | 100.00 | 0.00 | 0.00 | 20.00 | 0.00 | ||
| 30% | 100.00 | 0.00 | 0.00 | 0.00 | 0.00 | ||
| 40% | 100.00 | 0.00 | 0.00 | 10.00 | 0.00 | ||
| 50% | 90.00 | 0.00 | 0.00 | 25.00 | 0.00 | ||
| SINE | 10% | 95.00 | 60.00 | 0.00 | 25.00 | 10.00 | |
| 20% | 100.00 | 0.00 | 0.00 | 40.00 | 0.00 | ||
| 30% | 100.00 | 0.00 | 0.00 | 25.00 | 0.00 | ||
| 40% | 95.00 | 5.00 | 0.00 | 35.00 | 0.00 | ||
| 50% | 100.00 | 0.00 | 0.00 | 50.00 | 0.00 | ||
| LTR | 10% | 90.00 | 70.00 | 0.00 | 45.00 | 55.00 | |
| 20% | 100.00 | 0.00 | 0.00 | 55.00 | 0.00 | ||
| 30% | 85.00 | 0.00 | 0.00 | 50.00 | 0.00 | ||
| 40% | 95.00 | 0.00 | 0.00 | 40.00 | 0.00 | ||
| 50% | 95.00 | 0.00 | 0.00 | 60.00 | 0.00 | ||
| Random (%) | LINE | 98.00 | 3.33 | 0.00 | 34.67 | 0.00 | |
| SINE | 98.67 | 6.00 | 0.00 | 36.00 | 0.00 | ||
| LTR | 98.00 | 0.67 | 0.00 | 34.00 | 0.00 | ||
| DNA | 100.00 | 0.00 | 0.00 | 36.00 | 0.00 | ||
| True Positives | 975 | 183 | 192 | 328 | 218 | ||
| False Positives | 21 | 16,109 | 22 | 663 | 9 | ||
Fig. 4Recall and precision rates of breakpoints identified for the second simulated data (Additional file 1: Table S2) by iMGEins, PoPoolationTE, TEMP, RetroSeq, and MELT
Performance of MGE detection in iMGEins, popoolationTE, TEMP, RetroSeq and MELT in simulation set 2
| Low coverage (30x) | Hogh coverage (90x) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| iMGEins | PoPoolationTE | TEMP | RetroSeq | MELT | iMGEins | PoPoolationTE | TEMP | RetroSeq | MELT | |
| True positive | ||||||||||
| Human MGEs | 78 | 75 | 80 | 19 | 72 | 75 | 2 | 80 | 27 | 74 |
| Primate MGEs | 77 | 74 | 78 | 20 | 37 | 78 | 10 | 79 | 31 | 39 |
| Novel insertions | 77 | 0 | 0 | 1 | 0 | 77 | 0 | 0 | 0 | 0 |
| Total | 232 | 149 | 158 | 40 | 109 | 231 | 12 | 159 | 58 | 113 |
| False Positive | ||||||||||
| Human MGEs | 0 | 0 | 31 | 121 | 6 | 0 | 1 | 56 | 117 | 5 |
| Primate MGEs | 0 | 1 | 10 | 188 | 21 | 0 | 1 | 40 | 289 | 21 |
| Novel insertions | 0 | 0 | 0 | 33 | 0 | 0 | 0 | 0 | 43 | 0 |
| Total | 0 | 1 | 41 | 342 | 27 | 0 | 2 | 96 | 449 | 26 |
| Recall (%) | ||||||||||
| Human MGEs | 97.50 | 93.75 | 100.00 | 23.75 | 90.00 | 93.75 | 2.50 | 100.00 | 33.75 | 92.50 |
| Primate MGEs | 96.25 | 92.50 | 97.50 | 25.00 | 46.25 | 97.50 | 12.50 | 98.75 | 38.75 | 48.75 |
| Novel insertions | 96.25 | -a | – | 1.25 | – | 96.25 | – | – | 0.00 | – |
| Average | ||||||||||
| Completeb | 96.67 | 62.08 | 65.83 | 16.67 | 45.42 | 95.83 | 5.00 | 66.25 | 24.17 | 47.08 |
| Precision (%) | ||||||||||
| Human MGEs | 100.00 | 100.00 | 72.07 | 13.57 | 92.31 | 100.00 | 66.67 | 58.82 | 18.75 | 93.67 |
| Primate MGEs | 100.00 | 98.67 | 88.64 | 9.62 | 63.79 | 100.00 | 90.91 | 66.39 | 9.69 | 65.00 |
| Novel insertions | 100.00 | – | – | 2.94 | – | 100.00 | – | – | 0.00 | – |
| Average | 100.00 | 99.333 | 79.40 | 10.47 | 80.15 | 100.00 | 85.71 | 62.35 | 11.44 | 81.29 |
aThese programs cannot find novel insertions
bThe rates for the entire test case
cThe rates without the novel insertion category
Comparison of breakpoint prediction for the NA12878 dataset for iMGEins, Tangram, Tea, TEMP, RetroSeq and MELT
| Proximity around breakpointsa | iMGEins | Tangram | Tea | TEMP | RetroSeq | MELT |
|---|---|---|---|---|---|---|
| Breakpoints ±100 | 400 | 433 | 397 | 394 | 426 | 439 |
| Breakpoints ±20 | 397 | 421 | 393 | 394 | 305 | 421 |
| Breakpoints ±10 | 366 | 297 | 348 | 363 | 180 | 93 |
aThe distance around the annotated breakpoints, which is allowed to be consider as true positive