| Literature DB >> 27809781 |
Seyed Amir Malekpour1, Hamid Pezeshk2,3, Mehdi Sadeghi4.
Abstract
BACKGROUND: Copy Number Variation (CNV) is envisaged to be a major source of large structural variations in the human genome. In recent years, many studies apply Next Generation Sequencing (NGS) data for the CNV detection. However, still there is a necessity to invent more accurate computational tools.Entities:
Keywords: Copy Number Variation (CNV); Expectation Maximization (EM) algorithm; Hidden Markov Models (HMMs); Next Generation Sequencing (NGS); mixture densities
Mesh:
Year: 2016 PMID: 27809781 PMCID: PMC5445519 DOI: 10.1186/s12859-016-1296-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Mate pairs that are taken as the observation for the 2nd genomic segment are shown. A mate pair whose reads are flanking the 2nd segment and its insertion region is spanning the segment, accounts for the observation in the 2nd genomic segment. Other reads that do not satisfy these conditions are discarded
Fig. 2Mate pairs which are generated from a region with tandem duplications, are mapped to reference. Abnormalities in the insertion size and direction of a mate pair depend on whether it is generated from a location around a tandem duplication breakpoint or not. a A mate pair spanning the tandem duplication in the sample genome is shown. After mapping to the reference genome, this mate pair encounters a change in direction and abnormality in the insertion size (the distance of point a to b). b Two mate pairs that are not located around breakpoint are shown. These pairs will map normally to the reference genome, without any change in the insertion size or direction
Fig. 3HMM structure, states and transition probabilities are shown. In diploid state each genomic segment has two copies. In heterozygous deletion and homozygous deletion each genomic segment appears in one and no copies, respectively. Duplication state models those genomic segments that have more than two copies in the sample genome, at least one of the tandem duplication type
Expected distribution of the observation in different states
| Distribution | |||
|---|---|---|---|
| State | 1 | 2 | 3 |
| Diploid |
| - | - |
| Heterozygous deletion |
|
| - |
| Homozygous deletion | - |
| - |
| Tandem duplication |
| - |
|
In diploid and homozygous states, there is a unimodal distribution for the insertion sizes, while heterozygous deletion and tandem duplication states follow a bimodal insertion size distribution
Initial parameter vector (α1, α2, α3) and their estimation after several iterations of the EM algorithm
| Initial | Estimated |
|---|---|
|
|
|
PSE-HMM precision and recall are computed for a simulated dataset with 10× depth of coverage
| Real state | ||||||||
|---|---|---|---|---|---|---|---|---|
| Heterozygous deletion | Diploid | Homozygous deletion | Tandem duplication | Sum | Precision | Recall | ||
| Predicted state | Heterozygous deletions | 1,146 | 660 | 97 | 0 | 1,903 | 0.60 | 1.00 |
| Diploid | 2 | 23,560 | 2 | 89 | 23,653 | 1.00 | 0.95 | |
| Homozygous deletions | 0 | 279 | 1,221 | 1 | 1,501 | 0.81 | 0.93 | |
| Tandem duplications | 3 | 363 | 0 | 2,577 | 2,943 | 0.88 | 0.97 | |
| sum | 1,151 | 24,862 | 1,320 | 2,667 | 30,000 | |||
In columns 3 to 6, predicted state is shown versus the real state of the genomic segments, and number of segments is indicated in the corresponding cell. A total number of 30,000 genomic segments (4.5 million bp) are evaluated in this analysis
Precision and recall values of PSE-HMM are compared to m-HMM, Pindel, CNV-seq, and Delly
| Coverage | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1× | 5× | 10× | ||||||||
| Precision mean/std | Recall mean/std | F-measure | Precision mean/std | Recall mean/std | F-measure | Precision mean/std | Recall mean/std | F-measure | ||
| Duplications | PSE-HMM | 0.91/0.03 | 0.79/0.02 | 0.85 | 0.92/0.02 | 0.95/0.01 | 0.93 | 0.88/0.01 | 0.97/0.02 | 0.92 |
| m-HMM | 0.95/0.01 | 0.21/0.02 | 0.35 | 1.00/0.02 | 0.64/0.02 | 0.78 | 1.00/0.01 | 0.71/0.01 | 0.83 | |
| Pindel | 1.00/0.00 | 0.11/0.04 | 0.20 | 1.00/0.00 | 0.67/0.03 | 0.80 | 1.00/0.01 | 0.81/0.03 | 0.90 | |
| CNV-seq | 0.55/0.03 | 0.41/0.03 | 0.47 | 0.98/0.00 | 0.54/0.03 | 0.70 | 0.99/0.00 | 0.57/0.03 | 0.72 | |
| Delly | 1.00/0.00 | 0.80/0.05 | 0.89 | 1.00/0.00 | 0.99/0.05 | 0.99 | 1.00/0.00 | 1.00/0.00 | 1.00 | |
| Deletions | PSE-HMM (heterozygous) | 0.43/0.03 | 0.37/0.03 | 0.40 | 0.54/0.04 | 0.97/0.01 | 0.69 | 0.60/0.02 | 1.00/0.02 | 0.75 |
| PSE-HMM (homozygous) | 0.20/0.03 | 0.92/0.05 | 0.33 | 0.73/0.03 | 0.97/0.02 | 0.83 | 0.81/0.02 | 0.93/0.03 | 0.87 | |
| PSE-HMM (hetero + homo)a | 0.31/0.03 | 0.86/0.02 | 0.46 | 0.63/0.02 | 0.99/0.01 | 0.77 | 0.72/0.03 | 1.00/0.03 | 0.84 | |
| m-HMM (heterozygous) | 0.67/0.02 | 0.16/0.04 | 0.25 | 0.93/0.03 | 0.88/0.03 | 0.91 | 0.93/0.03 | 0.92/0.02 | 0.93 | |
| m-HMM (homozygous) | 0.95/0.02 | 0.65/0.02 | 0.77 | 0.99/0.02 | 0.62/0.02 | 0.77 | 0.99/0.01 | 0.62/0.03 | 0.77 | |
| m-HMM (hetero + homo)a | 0.93/0.02 | 0.43/0.03 | 0.59 | 0.99/0.01 | 0.78/0.02 | 0.87 | 0.99/0.02 | 0.80/0.03 | 0.88 | |
| Pindel | 0.93/0.15 | 0.02/0.01 | 0.04 | 0.91/0.03 | 0.36/0.02 | 0.52 | 0.87/0.06 | 0.45/0.05 | 0.59 | |
| CNV-seq | 0.72/0.05 | 0.75/0.02 | 0.73 | 0.98/0.00 | 0.91/0.01 | 0.94 | 0.98/0.00 | 0.95/0.01 | 0.96 | |
| Delly | 0.98/0.00 | 0.32/0.04 | 0.48 | 0.99/0.00 | 0.48/0.03 | 0.65 | 0.99/0.00 | 0.49/0.04 | 0.66 | |
| Diploid | PSE-HMM | 0.96/0.00 | 0.79/0.01 | 0.87 | 0.99/0.00 | 0.93/0.00 | 0.96 | 1.00/0.00 | 0.96/0.00 | 0.98 |
| m-HMM | 0.87/0.01 | 1.00/0.01 | 0.93 | 0.94/0.00 | 1.00/0.00 | 0.97 | 0.95/0.00 | 1.00/0.00 | 0.97 | |
| Pindel | 0.82/0.01 | 1.00/0.00 | 0.90 | 0.90/0.01 | 1.00/0.00 | 0.95 | 0.93/0.01 | 1.00/0.00 | 0.96 | |
| CNV-seq | 0.91/0.00 | 0.93/0.00 | 0.92 | 0.94/0.00 | 1.00/0.00 | 0.97 | 0.95/0.00 | 1.00/0.00 | 0.97 | |
| Delly | 0.91/0.01 | 1.00/0.00 | 0.95 | 0.94/0.01 | 1.00/0.00 | 0.97 | 0.94/0.01 | 1.00/0.00 | 0.97 | |
For each method, the average and standard deviation of the precision (recall) values over five different runs of the whole simulation study are given in each cell. For each state i.e. tandem duplication, deletion and diploid, evaluations are done for three different coverage values i.e. 1×, 5×, and 10×. The implanted CNVs are of length 1 kb, 1.5 kb, 2 kb, 2.5 kb, …, 4.5 kb, and 5 kb
a hetero + homo stands for copy loss
Arithmetic and harmonic means of F-measures
| Coverage | ||||
|---|---|---|---|---|
| 1× | 5× | 10× | ||
| Arithmetic mean | PSE-HMM | 0.72 |
|
|
| m-HMM | 0.62 | 0.87 | 0.90 | |
| Pindel | 0.38 | 0.76 | 0.82 | |
| CNV-seq | 0.71 | 0.87 | 0.89 | |
| Delly | 0.77 | 0.87 | 0.87 | |
| Harmonic mean | PSE-HMM | 0.66 |
|
|
| m-HMM | 0.53 | 0.87 | 0.89 | |
| Pindel | 0.09 | 0.71 | 0.78 | |
| CNV-seq | 0.66 | 0.85 | 0.87 | |
| Delly | 0.71 | 0.84 | 0.84 | |
For PSE-HMM, m-HMM, Pindel, CNV-seq, and Delly, arithmetic and harmonic means of F-measures are calculated over different HMM states i.e. tandem-duplications, deletions (either heterozygous or homozygous), and genomic diploid states. The highest accuracies are indicated in bold, for coverages of 5× and 10×
PSE-HMM is compared to other tools in detecting genome-wide deletions and tandem duplications of size 1 kb, 3 kb, and 5 kb
| CNV length | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 1 kb | 3 kb | 5 kb | ||||||||
| Precision mean/std | Recall mean/std | F-measure | Precision mean/std | Recall mean/std | F-measure | Precision mean/std | Recall mean/std | F-measure | ||
| Duplications | PSE-HMM | 0.88/0.04 | 0.89/0.03 | 0.89 | 0.90/0.03 | 0.97/0.03 | 0.93 | 0.88/0.03 | 0.98/0.01 | 0.93 |
| m-HMM | 1.00/0.00 | 0.71/0.1 | 0.83 | 1.00/0.00 | 0.76/0.04 | 0.86 | 1.00/0.00 | 0.75/0.03 | 0.86 | |
| Pindel | 1.00/0.00 | 0.95/0.06 | 0.98 | 1.00/0.00 | 0.82/0.11 | 0.90 | 1.00/0.00 | 0.87/0.1 | 0.93 | |
| CNV-seq | 0.99/0.01 | 0.53/0.06 | 0.69 | 0.99/0.00 | 0.53/0.02 | 0.69 | 0.99/0.00 | 0.59/0.09 | 0.74 | |
| Delly | 1.00/0.00 | 1.00/0.00 | 1.00 | 1.00/0.00 | 1.00/0.00 | 1.00 | 1.00/0.00 | 1.00/0.00 | 1.00 | |
| Deletions | PSE-HMM (heterozygous) | 0.68/0.10 | 1.00/0.01 | 0.81 | 0.35/0.10 | 1.00/0.00 | 0.52 | 0.70/0.08 | 1.00/0.03 | 0.82 |
| PSE-HMM (homozygous) | 0.70/0.11 | 0.89/0.05 | 0.78 | 0.77/0.03 | 0.80/0.00 | 0.78 | 0.75/0.03 | 1.00/0.00 | 0.86 | |
| PSE-HMM (hetero + homo)a | 0.69/0.02 | 0.96/0.02 | 0.80 | 0.64/0.02 | 1.00/0.02 | 0.78 | 0.72/0.02 | 1.00/0.02 | 0.84 | |
| m-HMM (heterozygous) | 0.97/0.03 | 0.50/0.31 | 0.66 | 0.99/0.01 | 1.00/0.00 | 0.99 | 0.99/0.01 | 1.00/0.01 | 0.99 | |
| m-HMM (homozygous) | 0.99/0.01 | 1.00/0.00 | 0.99 | 0.99/0.01 | 1.00/0.00 | 0.99 | 0.99/0.01 | 0.60/0.00 | 0.75 | |
| m-HMM (hetero + homo)a | 0.98/0.01 | 0.67/0.03 | 0.79 | 0.99/0.01 | 1.00/0.02 | 0.99 | 0.99/0.01 | 0.78/0.01 | 0.87 | |
| Pindel | 0.98/0.02 | 0.39/0.08 | 0.56 | 0.85/0.10 | 0.46/0.08 | 0.59 | 0.87/0.11 | 0.45/0.13 | 0.59 | |
| CNV-seq | 0.98/0.00 | 0.92/0.04 | 0.95 | 0.98/0.00 | 0.96/0.01 | 0.97 | 0.98/0.00 | 0.96/0.02 | 0.97 | |
| Delly | 0.99/0.01 | 0.47/0.11 | 0.64 | 0.99/0.00 | 0.51/0.04 | 0.68 | 0.99/0.01 | 0.45/0.12 | 0.62 | |
The average and standard deviation of the precision (recall) values are calculated based on five different repeats of the whole simulation study with 10× sequencing coverage
a hetero + homo stands for copy loss
Fig. 4Comparing the overall accuracy of PSE-HMM, m-HMM, CNV-seq, Pindel and Delly in detecting genome-wide CNV regions. Number of nucleotides in CNV regions whose states are correctly predicted is divided by the total length of the genomic CNV regions
Fig. 5Deletion size distribution for CNVs detected by PSE-HMM, in NA18507. Frequency of the calls decreases exponentially, as deletion size increases
Overlap of CNVs detected by PSE-HMM against DGV is given by calls and bases
| Number of CNV calls | Overlap against DGV (by calls) | Overlap against DGV (by bases) | |
|---|---|---|---|
| Deletion | 5,447 | 58 % | 64 % |
| Tandem duplication | 75 | 83 % | 75 % |
| Total | 5,522 | 58 % | 70 % |