| Literature DB >> 27912743 |
Kaname Kojima1, Yosuke Kawai1, Kazuharu Misawa1, Takahiro Mimori1, Masao Nagasaki2.
Abstract
BACKGROUND: In the estimation of repeat numbers in a short tandem repeat (STR) region from high-throughput sequencing data, two types of strategies are mainly taken: a strategy based on counting repeat patterns included in sequence reads spanning the region and a strategy based on estimating the difference between the actual insert size and the insert size inferred from paired-end reads. The quality of sequence alignment is crucial, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the variations of repeat numbers.Entities:
Keywords: Alignment; High-throughput sequencing; Short tandem repeat
Mesh:
Year: 2016 PMID: 27912743 PMCID: PMC5135796 DOI: 10.1186/s12864-016-3294-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1An example of notations in STR-realigner. Subsequence with B =1 can be used repeatedly in the realignment process
Fig. 2A flowchart of algorithms considered in STR-realigner. After initialization of penalty and traceback information for first query position with Algorithm 1, penalty and traceback information are updated for other query positions with Algorihtm 2 in a dynamic programming manner. Then, a realignment with the best penalty is obtained from traceback information with Algorithm 9
Fig. 3An alignment result around a homopolymer region. Most of the reads spanning the region contain soft clipping parts due to drastic sequencing errors after the homopolymer region
Call rate of STR calling results with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 5345 |
|
| 0.873 | 0.872 |
| 2 | 1160 |
| 0.794 | 0.785 | 0.784 |
| 3 | 517 |
| 0.807 | 0.799 | 0.803 |
| 4 | 1433 |
| 0.766 | 0.771 | 0.783 |
| 5 | 668 |
| 0.811 | 0.819 | 0.819 |
| 6 | 472 |
| 0.850 | 0.852 | 0.856 |
| Total | 9595 |
| 0.841 | 0.838 | 0.840 |
Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for all the STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 5345 | 3.726 |
| 8.273 | 8.461 |
| 2 | 1160 |
| 4.648 | 8.213 | 8.539 |
| 3 | 517 |
| 4.022 | 8.242 | 8.601 |
| 4 | 1433 |
| 5.726 | 9.523 | 9.597 |
| 5 | 668 |
| 6.701 | 10.156 | 10.325 |
| 6 | 472 |
| 5.976 | 10.185 | 10.437 |
| Total | 9595 |
| 4.162 | 8.705 | 8.900 |
Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for commonly called STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 4659 | 3.858 |
| 8.694 | 8.891 |
| 2 | 900 |
| 4.815 | 8.735 | 9.084 |
| 3 | 410 |
| 3.265 | 8.548 | 8.930 |
| 4 | 1084 |
| 4.428 | 9.688 | 9.796 |
| 5 | 536 |
| 5.855 | 10.219 | 10.422 |
| 6 | 399 |
| 5.593 | 10.498 | 10.793 |
| Total | 7988 |
| 3.741 | 9.038 | 9.253 |
Call rate of STR calling results with allelotype using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5345 |
|
| 0.998 | 0.998 | 0.998 |
| 2 | 1160 |
|
| 0.984 | 0.984 | 0.984 |
| 3 | 517 |
|
| 0.988 | 0.988 | 0.988 |
| 4 | 1433 | 0.991 |
| 0.988 | 0.987 | 0.987 |
| 5 | 668 |
|
| 0.990 | 0.990 | 0.990 |
| 6 | 472 |
|
| 0.992 | 0.989 | 0.989 |
| Total | 9595 |
|
| 0.993 | 0.993 | 0.993 |
Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for all the STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5345 | 1.104 |
| 4.148 | 4.181 | 4.152 |
| 2 | 1160 |
| 2.679 | 5.454 | 5.762 | 5.477 |
| 3 | 517 | 2.778 |
| 4.818 | 4.768 | 4.818 |
| 4 | 1433 | 2.651 |
| 5.386 | 5.398 | 5.406 |
| 5 | 668 | 2.301 |
| 6.617 | 6.660 | 6.617 |
| 6 | 472 |
| 3.178 | 6.071 | 5.936 | 6.094 |
| Total | 9595 | 1.959 |
| 4.861 | 4.914 | 4.870 |
Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for commonly called STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5333 | 1.009 |
| 4.024 | 4.058 | 4.028 |
| 2 | 1141 |
| 2.489 | 5.086 | 5.420 | 5.111 |
| 3 | 511 | 2.472 |
| 4.490 | 4.435 | 4.490 |
| 4 | 1414 |
| 2.371 | 5.157 | 5.152 | 5.177 |
| 5 | 661 | 2.058 |
| 6.263 | 6.308 | 6.263 |
| 6 | 467 |
| 2.977 | 5.888 | 5.747 | 5.912 |
| Total | 9527 |
| 1.755 | 4.649 | 4.702 | 4.659 |
Call rate of STR calling results with RepeatSeq using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 5345 | 0.874 |
| 0.854 | 0.848 |
| 2 | 1160 |
| 0.788 | 0.760 | 0.756 |
| 3 | 517 |
| 0.803 | 0.778 | 0.774 |
| 4 | 1433 |
| 0.759 | 0.735 | 0.736 |
| 5 | 668 |
| 0.793 | 0.774 | 0.774 |
| 6 | 472 |
| 0.824 | 0.807 | 0.814 |
| Total | 9595 |
| 0.836 | 0.813 | 0.809 |
Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for all the STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 5345 | 5.261 |
| 8.904 | 9.126 |
| 2 | 1160 |
| 5.984 | 8.879 | 9.162 |
| 3 | 517 |
| 5.409 | 8.629 | 8.880 |
| 4 | 1433 |
| 6.724 | 10.060 | 10.195 |
| 5 | 668 |
| 7.489 | 11.058 | 11.159 |
| 6 | 472 |
| 7.179 | 10.385 | 10.652 |
| Total | 9595 |
| 5.809 | 9.308 | 9.517 |
Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for commonly called STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 4505 | 5.461 |
| 9.047 | 9.184 |
| 2 | 860 |
| 6.307 | 9.238 | 9.436 |
| 3 | 396 |
| 4.862 | 8.520 | 8.817 |
| 4 | 1023 |
| 5.954 | 9.828 | 9.899 |
| 5 | 497 |
| 6.835 | 10.504 | 10.638 |
| 6 | 371 |
| 6.771 | 10.451 | 10.773 |
| Total | 7652 |
| 5.712 | 9.323 | 9.474 |
Call rate of STR calling results with allelotype using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5345 | 0.999 |
| 0.992 | 0.992 | 0.992 |
| 2 | 1160 |
| 0.988 | 0.972 | 0.973 | 0.972 |
| 3 | 517 |
| 0.983 | 0.977 | 0.977 | 0.977 |
| 4 | 1433 |
|
| 0.973 | 0.973 | 0.973 |
| 5 | 668 |
| 0.991 | 0.969 | 0.969 | 0.969 |
| 6 | 472 |
| 0.985 | 0.979 | 0.979 | 0.979 |
| Total | 9595 |
|
| 0.984 | 0.984 | 0.984 |
Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for all the STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5345 |
| 2.695 | 6.017 | 6.009 | 6.009 |
| 2 | 1160 |
| 3.740 | 6.948 | 6.885 | 6.899 |
| 3 | 517 |
| 3.728 | 6.898 | 6.843 | 6.843 |
| 4 | 1433 |
| 3.818 | 7.476 | 7.417 | 7.431 |
| 5 | 668 |
| 4.469 | 8.760 | 8.762 | 8.762 |
| 6 | 472 | 4.319 |
| 8.232 | 8.217 | 8.233 |
| Total | 9595 |
| 3.309 | 6.752 | 6.727 | 6.732 |
Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for commonly called STR regions. The best result is underlined
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5304 |
| 2.656 | 5.740 | 5.732 | 5.732 |
| 2 | 1128 |
| 3.420 | 6.440 | 6.371 | 6.386 |
| 3 | 505 |
| 3.090 | 6.456 | 6.396 | 6.396 |
| 4 | 1394 |
| 3.479 | 7.045 | 6.980 | 6.995 |
| 5 | 647 |
| 4.199 | 8.015 | 8.017 | 8.017 |
| 6 | 462 |
| 3.872 | 7.907 | 7.892 | 7.909 |
| Total | 9440 |
| 3.099 | 6.363 | 6.336 | 6.341 |
The numbers of estimated repeat numbers matched and mismatched with parents in terms of Mendelian inheritance
| Period | STR-realigner | ReviSTER | IndelRealigner | Original BAM | ||||
|---|---|---|---|---|---|---|---|---|
| #CR | #IR | #CR | #IR | #CR | #IR | #CR | #IR | |
| 1 | 1305 | 533 |
| 540 | 1314 |
| 1298 | 531 |
| (1,416) | (563) | |||||||
| 2 |
| 82 | 269 |
| 242 | 90 | 242 | 87 |
| 3 |
|
| 56 | 7 | 56 | 6 | 57 |
|
| 4 |
|
| 183 | 34 | 169 | 38 | 169 | 33 |
| 5 | 41 |
|
| 18 | 44 | 16 | 44 |
|
| 6 |
|
| 34 | 12 | 33 | 13 | 33 | 13 |
| Total |
|
| 1907 | 691 | 1858 | 694 | 1843 | 684 |
The number of consistent regions (#CR), and the number of inconsistent regions (#IR) based on estimated repeat numbers with RepeatSeq in a parent-offspring trio, NA12878, NA12891 and NA19892, for the original BAM files, those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner are summarized. Values in parentheses for STR-realigner are the result without filtering long homopolymer regions. The best result is underlined
The numbers of estimated repeat numbers matched and mismatched with parents in terms of Mendelian inheritance
| Period | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM | |||||
|---|---|---|---|---|---|---|---|---|---|---|
| #CR | #IR | #CR | #IR | #CR | #IR | #CR | #IR | #CR | #IR | |
| 1 | 1772 | 777 | 1773 |
| 1770 | 773 | 1621 | 897 |
| 776 |
| (1834) | (860) | |||||||||
| 2 |
|
| 328 | 94 | 323 | 90 | 317 | 97 | 328 | 84 |
| 3 |
| 6 | 69 |
| 66 | 7 | 65 | 8 | 65 | 7 |
| 4 | 222 | 24 |
|
| 219 | 23 | 214 | 28 | 216 | 26 |
| 5 |
|
|
| 29 |
| 28 | 73 | 31 | 73 | 28 |
| 6 |
| 28 | 67 | 28 | 67 | 27 | 68 |
| 70 | 26 |
| Total |
|
| 2536 | 945 | 2520 | 948 | 2358 | 1083 | 2527 | 947 |
Fig. 4Comparison of original and realigned BAM files for NA12892 in an STR region located at chr22:28045335-28045407. The top panel is a plot of sequencing data in a BAM file realigned with STR-realigner and the bottom one is a plot of sequencing data in the original BAM file
Root mean squared error (RMSE) between the gold standard and estimated STR sizes with RepeatSeq using the original BAM file for NA12878 from HiSeq 2000 and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for all the STR regions
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | Original BAM |
|---|---|---|---|---|---|
| 1 | 5345 | 2.429 |
| 2.431 | 2.430 |
| 2 | 1160 |
| 3.860 | 3.829 | 3.804 |
| 3 | 517 |
| 1.902 | 2.000 | 1.998 |
| 4 | 1433 |
| 2.540 | 2.709 | 2.710 |
| 5 | 668 | 3.134 |
| 3.021 | 3.039 |
| 6 | 472 |
| 2.823 | 2.890 | 2.890 |
| Total | 9595 |
| 2.660 | 2.724 | 2.721 |
For the gold standard, STR sizes estimated from high coverage PacBio sequencing data with allelotype are used. The best result is underlined
Root mean squared error (RMSE) between the gold standard and estimated STR sizes with allelotype using the original BAM file for NA12878 from HiSeq 2000 and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for all the STR regions
| Period | No. of regions | STR-realigner | ReviSTER | IndelRealigner | --realign option | Original BAM |
|---|---|---|---|---|---|---|
| 1 | 5345 | 2.298 |
| 2.354 | 2.296 | 2.298 |
| 2 | 1160 | 3.152 | 3.293 | 3.265 | 3.243 |
|
| 3 | 517 |
| 2.039 | 2.034 | 2.038 | 2.034 |
| 4 | 1433 |
| 2.582 | 2.687 | 2.600 | 2.454 |
| 5 | 668 | 3.033 |
| 3.271 | 3.008 | 3.031 |
| 6 | 472 |
| 2.739 | 2.765 | 2.739 | 3.090 |
| Total | 9595 |
| 2.542 | 2.607 | 2.538 | 2.521 |
For the gold standard, STR sizes estimated from high coverage PacBio sequencing data with allelotype are used. The best result is underlined
Comparison of computational time on a simulation data for an individual with read coverage of 40 × and a real dataset for NA12878
| Method | Computational time | Computational time |
|---|---|---|
| (Simulation data) | (Real data) | |
| STR-realigner | 2,928.90 [s] | 1,186.77 [s] |
| ReviSTER | 5,230.72 [s] | 3,618.62 [s] |
| GATK IndelRealigner | 357.46 [s] | 294.13 [s] |