| Literature DB >> 32093618 |
Abstract
BACKGROUND: Personal genomics and comparative genomics are becoming more important in clinical practice and genome research. Both fields require sequence alignment to discover sequence conservation and variation. Though many methods have been developed, some are designed for small genome comparison while some are not efficient for large genome comparison. Moreover, most existing genome comparison tools have not been evaluated the correctness of sequence alignments systematically. A wrong sequence alignment would produce false sequence variants.Entities:
Keywords: Comparative genomics; Genome comparison; Personal genomics; Sequence alignment; Variation detection
Year: 2020 PMID: 32093618 PMCID: PMC7041101 DOI: 10.1186/s12864-020-6569-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The flowchart of GSAlign. Each rectangle is an LMEM (simple pair) and the width is the size of the LMEM. They are then clustered into similar regions, each of which consists of adjacent LMEMs and gaps in between. We then perform gapped/un-gapped alignment to close those gaps to build the complete alignment for each similar region
Fig. 2An example illustrating the process of simple clustering and outlier removing. GSAlign clusters simple pairs and remove outliers according to PosDiff. Simple pairs in red are not unique. Simple pairs with gray backgrounds are considered as outliers and they are removed from the cluster
Fig. 3Simple pairs a and b overlaps due to tandem repeats of “ACGT”. We remove the overlapped fragment from simple pair A (the preceding one)
The synthetic datasets and the number of simulated sequence variations. The Average Sequence Identity (ASI) is estimated by the total mismatches divided by the number of nucleobases
| Dataset | Genome size | SNV | Small indel | large indel | ASI |
|---|---|---|---|---|---|
| simHG-1X | 3,088,279,342 | 58,421,383 | 1,001,626 | 285,757 | 97.93% |
| simHG-3X | 3,088,292,247 | 175,100,939 | 962,721 | 275,584 | 93.86% |
| simHG-5X | 3,088,289,999 | 291,714,646 | 919,762 | 263,271 | 89.90% |
| NA12878 | 6,070,700,436 | 3,088,156 | 531,315 | NA | 99.84% |
The performance evaluation on the three GRCh38 synthetic data sets. The indexing time of each method is not included in the run time. They are 110 (BWT-GSAlign), 129 (Suffix array-MUMmer4), and 2.6 min (Minimizer-Minimap2), respectively
| Dataset | Method | SNV | Indel | Local align# | Run time (min) | ||
|---|---|---|---|---|---|---|---|
| precision | recall | precision | recall | ||||
| SimHG-1X | GSAlign | 1.000 | 1.000 | 0.999 | 0.999 | 250 | 11 |
| Minimap2 | 1.000 | 0.996 | 0.999 | 0.995 | 417 | 39 | |
| MUMmer4 | 0.998 | 0.932 | 0.985 | 0.932 | 3111 | 869 | |
| LAST | 1.000 | 0.992 | 0.992 | 0.947 | 1168 | 2524 | |
| SimHG-3X | GSAlign | 1.000 | 0.998 | 0.994 | 0.997 | 366 | 18 |
| Minimap2 | 1.000 | 0.996 | 0.991 | 0.995 | 561 | 37 | |
| MUMmer4 | 0.989 | 0.923 | 0.796 | 0.925 | 4925 | 289 | |
| LAST | 1.000 | 0.990 | 0.809 | 0.950 | 1234 | 1185 | |
| SimHG-5X | GSAlign | 1.000 | 0.993 | 0.958 | 0.992 | 587 | 24 |
| Minimap2 | 1.000 | 0.995 | 0.952 | 0.994 | 1058 | 40 | |
| MUMmer4 | 0.986 | 0.907 | 0.486 | 0.912 | 5513 | 157 | |
| LAST | 1.000 | 0.981 | 0.461 | 0.947 | 1636 | 458 | |
The performance evaluation on HG38 and the diploid sequence of NA12878. The performance on SNV and Indel detection implies that the diploid genome sequence and the reference variants are not fully compatible
| Dataset | Method | SNV | Indel | Run time (min) | Memory usage (GB) | ||
|---|---|---|---|---|---|---|---|
| Precision | Recall | Precision | Recall | ||||
| NA12878 (Diploid) | GSAlign | 0.832 | 0.969 | 0.759 | 0.767 | 5 | 14 |
| Minimap2 | 0.830 | 0.970 | 0.754 | 0.768 | 65 | 23 | |
| MUMmer4 | 0.752 | 0.946 | 0.711 | 0.749 | 3898 | 57 | |
| LAST | 0.832 | 0.969 | 0.760 | 0.764 | 1305 | 28 | |
The performance comparison on HG38 and the chimpanzee (PanTro4) genome
| Dataset | Method | Alignment length (Mbp) | SNV# | Indel# | Run time (min) |
|---|---|---|---|---|---|
| GRCh38 Vs. PanTro4 | GSAlign | 2412 | 31,710,527 | 3,650,337 | 8 |
| Minimap2 | 2791 | 39,242,895 | 4,375,360 | 18 | |
| MUMmer4 | 2661 | 41,545,986 | 5,450,956 | 1368 | |
| LAST | 2717 | 35,815,610 | 4,483,929 | 884 |
Fig. 4The dot-plot of the alignment for human chromosomes 2, 7, and 14 and mouse chromosome 12. The x-axis indicates the positions of mouse chromosome 12, and y-axis indicates the positions of human chromosomes 2, 7 and 14. The orthologous landmarks are plotted based on the pairwise alignments between the three human chromosomes and mouse chromosome 12