| Literature DB >> 33806152 |
Eugene V Korotkov1,2, Anastasiya M Kamionskya1, Maria A Korotkova2.
Abstract
Currently, there is a lack of bioinformatics approaches to identify highly divergent tandem repeats (TRs) in eukaryotic genomes. Here, we developed a new mathematical method to search for TRs, which uses a novel algorithm for constructing multiple alignments based on the generation of random position weight matrices (RPWMs), and applied it to detect TRs of 2 to 50 nucleotides long in the rice genome. The RPWM method could find highly divergent TRs in the presence of insertions or deletions. Comparison of the RPWM algorithm with the other methods of TR identification showed that RPWM could detect TRs in which the average number of base substitutions per nucleotide (x) was between 1.5 and 3.2, whereas T-REKS and TRF methods could not detect divergent TRs with x > 1.5. Applied to the search of TRs in the rice genome, the RPWM method revealed that TRs occupied 5% of the genome and that most of them were 2 and 3 bases long. Using RPWM, we also revealed the correlation of TRs with dispersed repeats and transposons, suggesting that some transposons originated from TRs. Thus, the novel RPWM algorithm is an effective tool to search for highly divergent TRs in the genomes.Entities:
Keywords: dynamic programming; rice genome; tandem repeats
Year: 2021 PMID: 33806152 PMCID: PMC8064497 DOI: 10.3390/genes12040473
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Dependence of statistical significance Z on the average number of substitutions per nucleotide x between any two repeats. A total of 50 insertions and 50 deletions of 1 nt were made in an artificial sequence of 3000 nt, which contained 100 repeats of 30 nt each.
Figure 2Dependence of the proportion of identified tandem repeats (Y) on the number of introduced base substitutions (x) for the Tandem Repeat Finder (TRF), T-Reks, and random position weight matrices (RPWMs) programs. The analyzed sequences contained 100 repeats 6 nt long; each sequence had a different number of base substitutions along with 5 insertions and 5 deletions at random positions.
The numbers of regions with TRs of different lengths detected in rice chromosomes.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 8277 | 6960 | 6679 | 6999 | 6073 | 6559 | 6234 | 6083 | 4925 | 4886 | 6397 | 6102 |
Figure 3Distribution of regions with TRs in the rice genome according to the repeat length.
Figure 4Dependence of Z(n) for the DNA region from 2,630,061 to 2,630,589 nt of the first rice chromosome. mF(11) = 610.3.
Alignment of sequences S2 and S for the DNA region from 2,630,061 to 2,630,589 nt of the first rice chromosome. Numbers 10 and 11 in sequence S2 are replaced with a and b.
| NO | 1234567...8....9a.b |
|---|---|
| 1 | CATTTGA...A....TC.G. |
| 2 | CAGGATT...G....AA.A. |
| 3 | AAACGGA...G....GA.A. |
| 4 | TAGGAAA...A....AC.A. |
| 5 | CAGGAAT...C....TG.A. |
| 6 | TAGGAAT...G....CA.A. |
| 7 | GTGTAAA........AC.A. |
| 8 | GAGGATT...GCAAAAC.A. |
| 9 | CAGGAAA...A....AC.A. |
| 10 | TAGGAAT...G....AC.C. |
| 11 | GTTTAATTGGA....CC.A. |
| 12 | CAGGAAA...A....AC.A. |
| 13 | CAGGAAT...C....AG.A. |
| 14 | TGAGAGA...G....AT.A. |
| 15 | GACTTAG...G....GC.C. |
| 16 | CCTTTGA...A....TC.A. |
| 17 | TAGGAAT...G....AA.A. |
| 18 | AAACGGA...G....GA.A. |
| 19 | TAGGAAA...A....AC.A. |
| 20 | TATGATT...A....TG.A. |
| 21 | CAGGAAT...G....TA.A. |
| 22 | GTGTAAA........AC.A. |
| 23 | GAGGATT...GCAAAAC.A. |
| 24 | CAGGAAA...A....AC.A. |
| 25 | TAGGAAT...G....AC.CG |
| 26 | TTTGATT...GGA..CC.A. |
| 27 | TAGGAAA...A....AC.A. |
| 28 | CAGGAAT...T....TG.A. |
| 29 | GGAGAGA...T....AA.A. |
| 30 | GACTCAA...A....GG.A. |
| 31 | TTTCTTC...C....AT.G. |
| 32 | AGGTTCT...A....CC.T. |
| 33 | CATGTTA...A....AA.T. |
| 34 | TCCTCCA...A....AA.C. |
| 35 | TTGTATG...G....GA.A. |
| 36 | GAGGCAT...T....CC.A. |
| 37 | TAGGAAT...T....TC.A. |
| 38 | TAAGATT...C....AATA. |
| 39 | GGGTTCA...T....TC.A. |
| 40 | TTTGATT...C....AA.A. |
| 41 | GGGCTTT...G....TA.G. |
| 42 | GAAAAAT...T....CCTA. |
| 43 | TAGGAAT...A....AA.A. |
| 44 | T |
Matrix mQ(11) for the TRs in the region from 2,630,589 to 2,630,061 nt of the first rice chromosome.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| A | 1.4 | −4.8 | −4.0 | −0.7 | −5.3 | 6.0 | −9.4 | 1.9 | −7.3 | −9.4 | 3.6 |
| T | −1.3 | 8.1 | −2.5 | −2.5 | −6.3 | −7.4 | 5.3 | −1.9 | −1.3 | −3.5 | −4.1 |
| C | 5.0 | −4.3 | −0.8 | 0.6 | 12.4 | −0.8 | 0.6 | −3.6 | −3.6 | −2.2 | −4.3 |
| G | 4.4 | −6.7 | 1.2 | −3.9 | −3.9 | −6.7 | −0.5 | −5.0 | 7.4 | 11.3 | −5.0 |
Intersection between exons and regions containing tandem periodicity in rice chromosomes.
| Chromosome | Number of Exons | Number of Overlaps | Average Number of Overlaps at Random Locations |
|---|---|---|---|
| 1 | 28,791 | 854 | 2297 |
| 2 | 23,456 | 1665 | 1876 |
| 3 | 25,583 | 1819 | 1956 |
| 4 | 17,446 | 1458 | 1462 |
| 5 | 16,256 | 1259 | 1295 |
| 6 | 16,038 | 1316 | 1329 |
| 7 | 14,744 | 1167 | 1180 |
| 8 | 12,888 | 1016 | 1046 |
| 9 | 10,563 | 964 | 822 |
| 10 | 9992 | 910 | 815 |
| 11 | 11,092 | 1142 | 932 |
| 12 | 10,936 | 964 | 891 |
Intersection of regions with periodicity and known transposons and repeats. X is the normal distribution argument of the identified intersections calculated using the Monte Carlo method.
| Repeat Name | Number of Overlapping Repeats | Number of Expected Overlapping Repeats |
|
|---|---|---|---|
| DNAnona/Helitron | 4721 | 2222 | 53.0144 |
| DNAnona/unknown | 99 | 244 | −9.2827 |
| MITE/Tourist | 1567 | 2346 | −16.0832 |
| MITE/Stow | 1258 | 2112 | −18.5828 |
| DNAauto/MULE | 1662 | 962 | 22.5689 |
| DNAnona/MULE | 8850 | 2648 | 120.5238 |
| LINE/unknown | 2520 | 1218 | 37.3067 |
| LTR/Gypsy | 14,664 | 9534 | 52.5388 |
| DNAnona/hAT | 2273 | 1008 | 39.8438 |
| DNAnona/MULEtir | 468 | 486 | −0.8165 |
| DNAnona/Tourist | 292 | 144 | 12.3333 |
| LTR/Copia | 2830 | 1692 | 27.6657 |
| DNAauto/CACTA | 2141 | 688 | 55.3951 |
| SINE/unknown | 440 | 404 | 1.7911 |
| DNAnona/CACTA | 3144 | 688 | 93.6341 |
| DNAauto/hAT | 315 | 146 | 13.9865 |
| DNAnona/PILE | 231 | 120 | 10.1329 |
| DNAauto/PILE | 99 | 108 | −0.8660 |
| LTR/TRIM | 339 | 172 | 12.7336 |
| DNAauto/Helitron | 81 | 118 | −3.4061 |
| Evirus/ERTBV-C | 5 | 4 | 0.5000 |
| LTR/unknown | 175 | 50 | 17.6777 |
| DNAnona/CACTG | 1189 | 234 | 62.4303 |
| DNAauto/CACTG | 3367 | 958 | 77.8313 |
| LTR/Solo | 17 | 6 | 4.4907 |
| DNAauto/MLE | 68 | 72 | −0.4714 |
| Evirus/ERTBV-B | 9 | 30 | −3.8341 |
| Evirus/ERTBV-A | 11 | 28 | −3.2127 |
| Evirus/ERTBV | 3 | 12 | −2.5981 |
| DNAauto/POLE | 68 | 52 | 2.2188 |
| DNAnona/POLE | 81 | 64 | 2.1250 |
| DNAnona/MLE | 12 | 8 | 1.4142 |
| Centro/tandem | 1351 | 80 | 142.1021 |
| Satellite/rice | 84 | 26 | 11.3747 |