| Literature DB >> 26818118 |
Feifei He1, Yang Li2, Yu-Hang Tang3, Jian Ma4,5, Huaiqiu Zhu6.
Abstract
BACKGROUND: The identification of inversions of DNA segments shorter than read length (e.g., 100 bp), defined as micro-inversions (MIs), remains challenging for next-generation sequencing reads. It is acknowledged that MIs are important genomic variation and may play roles in causing genetic disease. However, current alignment methods are generally insensitive to detect MIs. Here we develop a novel tool, MID (Micro-Inversion Detector), to identify MIs in human genomes using next-generation sequencing reads.Entities:
Mesh:
Substances:
Year: 2016 PMID: 26818118 PMCID: PMC4895285 DOI: 10.1186/s12864-015-2305-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Sensitivity (SN), Positive Predictive Value (PPV) and Standard Deviation (SD) for MIs simulated onto chr10
| Dataset 1 | Dataset 2 | |||||||
|---|---|---|---|---|---|---|---|---|
| Coverage | MID | Gustaf | MID | Gustaf | ||||
| SN | PPV | SN | PPV | SN | PPV | SN | PPV | |
| 2 | 0.692 | 0.900 | 0.423 | 0.688 | 0.741 | 0.976 | 0.333 | 0.783 |
| 4 | 0.732 | 0.911 | 0.339 | 0.760 | 0.840 | 0.932 | 0.506 | 0.872 |
| 6 | 0.782 | 0.871 | 0.359 | 0.718 | 0.782 | 0.935 | 0.427 | 0.810 |
| 8 | 0.870 | 0.918 | 0.403 | 0.689 | 0.784 | 0.916 | 0.302 | 0.824 |
| 10 | 0.771 | 0.857 | 0.239 | 0.703 | 0.877 | 0.932 | 0.351 | 0.800 |
| 20 | 0.767 | 0.904 | 0.233 | 0.667 | 0.849 | 0.914 | 0.267 | 0.705 |
| 30 | 0.859 | 0.886 | 0.188 | 0.514 | 0.821 | 0.912 | 0.254 | 0.627 |
| 40 | 0.781 | 0.860 | 0.250 | 0.588 | 0.820 | 0.907 | 0.230 | 0.603 |
| 50 | 0.781 | 0.856 | 0.211 | 0.654 | 0.837 | 0.906 | 0.266 | 0.616 |
| 60 | 0.835 | 0.857 | 0.264 | 0.563 | 0.863 | 0.900 | 0.231 | 0.509 |
| SD | 0.052 | 0.023 | 0.079 | 0.072 | 0.039 | 0.021 | 0.086 | 0.114 |
Overview of MIs detected in 638 individual samples from 1KGP
| population | sam-num | MI-num | MI-occ | mul-sup | read-num | occ/num | mul-sup/num | read/num |
|---|---|---|---|---|---|---|---|---|
| MXL | 48 | 79 | 153 | 23 | 175 | 1.94 | 29.11 % | 2.22 |
| PUR | 48 | 117 | 167 | 26 | 185 | 1.43 | 22.22 % | 1.58 |
| CLM | 26 | 60 | 83 | 14 | 91 | 1.38 | 23.33 % | 1.52 |
| PEL | 20 | 40 | 58 | 12 | 63 | 1.45 | 30.00 % | 1.58 |
| Total America | 142 | 206 | 296 | 55 | 514 | 1.44 | 26.70 % | 2.50 |
| CDX | 40 | 40 | 65 | 11 | 71 | 1.63 | 27.50 % | 1.78 |
| CHB | 26 | 42 | 52 | 7 | 57 | 1.24 | 16.67 % | 1.36 |
| CHS | 41 | 56 | 72 | 10 | 79 | 1.29 | 17.86 % | 1.41 |
| JPT | 58 | 160 | 325 | 55 | 356 | 2.03 | 34.38 % | 2.23 |
| KHV | 31 | 51 | 81 | 14 | 96 | 1.59 | 27.45 % | 1.88 |
| Total East Asia | 196 | 262 | 349 | 53 | 659 | 1.33 | 20.23 % | 2.52 |
| GIH | 11 | 22 | 29 | 6 | 34 | 1.32 | 27.27 % | 1.55 |
| Total South Asia | 11 | 22 | 22 | 0 | 34 | 1.00 | 0.00 % | 1.55 |
| CEU | 21 | 67 | 82 | 10 | 94 | 1.22 | 14.93 % | 1.40 |
| FIN | 41 | 72 | 87 | 6 | 93 | 1.21 | 8.33 % | 1.29 |
| GBR | 35 | 80 | 104 | 16 | 114 | 1.30 | 20.00 % | 1.43 |
| IBS | 24 | 56 | 80 | 14 | 86 | 1.43 | 25.00 % | 1.54 |
| TSI | 9 | 28 | 34 | 5 | 37 | 1.21 | 17.86 % | 1.32 |
| Total Europe | 130 | 223 | 303 | 39 | 424 | 1.36 | 17.49 % | 1.90 |
| YRI | 42 | 156 | 239 | 41 | 255 | 1.53 | 26.28 % | 1.63 |
| LWK | 54 | 113 | 180 | 28 | 193 | 1.59 | 24.78 % | 1.71 |
| ACB | 14 | 35 | 46 | 9 | 48 | 1.31 | 25.71 % | 1.37 |
| ASW | 49 | 127 | 262 | 49 | 286 | 2.06 | 38.58 % | 2.25 |
| Total Africa | 159 | 284 | 431 | 90 | 782 | 1.52 | 31.69 % | 2.75 |
The “sam-num” column illustrates the number of samples for each category (either population or population group); the “MI-num” column illustrates the number of different MIs detected in each population or population group; the “MI-occ” column illustrates the sum of occurrences of MIs in each population or population group; the “read-num” column illustrates the number of reads supporting MIs. For the population lines, the “mul-sup” column illustrates the number of MIs supported by at least two individual samples (named “multiple samples supported MIs”) in one population, the “ooc/num” column illustrates the ratio of MI occurrence over MI number, which indicates the average number of individual samples supporting one MI in the same population, and the “mul-sup/num” column illustrates the ratio of multiple samples supported MIs over the number of all MIs. For the population group lines (which started with “Total”), the “mul-sup” column illustrates the number of MIs supported by at least two populations (named “multiple populations supported MIs”) in the same population group, the “ooc/num” column illustrates the ratio of MI occurrence over MI number, which indicates the average number of populations supporting one MI in the same population group, and the “mul-sup/num” column illustrates the ratio of multiple populations supported MIs over the total number of MIs. The last “read/num” column illustrates the ratio of the number of reads containing MIs over the number of MIs, which indicates the average number of reads supporting each MI
Fig. 1MI examples in 1KGP data. a shows one MI supported by 46 samples which overlap with 3’ UTR of gene PREPL and SLC3A; b shows the distribution of the MI supported by 46 samples in different populations; c shows one MI changing 6 amino acids of the exon in gene OR51I1
Fig. 2MI examples in Lung Squamous Cell Carcinoma WXS data from CCLE. a shows one MI breaking the edge of CDS region and changing 3 amino acids of gene PSRC1, as well as an insertion next to the MI; b shows one MI changing 5 amino acids of gene JMJD4 and overlapping with 5’UTR of gene SNAP47
Fig. 3The workflow of the MID program
Fig. 4Methods overview of MID. a shows the construction of alignment regions. Head (A) and tail (G) substrings of read serve as a pair of anchors. Segment C in reference sequence denotes a deletion in read, segment D denotes an inversion in read and segment F denotes a translocation. A set of segment (B, C, D, E, F) in the reference sequence, as well as a set of segment (B, −D, F’, E) in the read sequence, form the target regions. b shows how to use a partition and recombine strategy to transfer the pair of overlapping MSs (A, B) to a set of non-overlapping MSs {(A, b), (a, B), (a, b)}. Segment A and B have overlapping region c, and segment a and b are generated by cutting region c off the original segment A and B. c shows the max-score path approach. The path starting with MSP [1] and ending with MSP [4] denotes a short read to be detected, and the path in red is the max-score path. MSP [1], MSP [2], MSP [3], MSP [4] are on the forward strand, and MSP [−3] is on the reverse strand. MID starts with MSP [1] and extends the path to MSP [2], then if MSP [3] and MSP [−3] can both be matched, MID records both path candidates and ends with MSP [4], therefore we have two path candidates {1, 2, 3, 4} and {1, 2, −3, 4}, then {1, 2, 3, 4} instead of {1, 2, −3, 4} will be chosen after scoring due to the reverse penalty for MSP [−3] on the reverse strand. Otherwise if only MSP [−3] is successfully matched, we have path candidates {1, 2, 4} and {1, 2, −3, 4}, after scoring {1, 2, −3, 4} would be chosen owing to the gap penalty, since path candidate {1, 2, 4} contains much more gaps. Therefore MSP [−3] would be reported as detected MI