| Literature DB >> 27260942 |
Helen N Catanese1, Kelly A Brayton2, Assefaw H Gebremedhin3.
Abstract
BACKGROUND: Short-sequence repeats (SSRs) occur in both prokaryotic and eukaryotic DNA, inter- and intragenically, and may be exact or inexact copies. When heterogeneous SSRs are present in a given locus, we can take advantage of the pattern of different repeats to genotype strains based on the SSRs. Cataloguing and tracking these repeats can be difficult as diverse groups of researchers are involved in the identification of the repeats. Additionally, the task is error-prone when done manually.Entities:
Keywords: Anaplasma marginale; Genetic diversity; Genotyping; Knuth-Morris-Pratt algorithm; Msp1a; RepeatAnalyzer; Short sequence repeat (SSR); Software tool; Visualization
Mesh:
Year: 2016 PMID: 27260942 PMCID: PMC4891823 DOI: 10.1186/s12864-016-2686-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1A flowchart of RepeatAnalyzer’s functionalities. Each main branch (a-e) represents a type of functionality. Following each branch shows the program flow for each option
Fig. 2A flowchart of the Knuth-Morris-Pratt string searching algorithm. KMP is a computationally efficient algorithm for finding short text patterns in a long string of characters. offsets is a list of numbers. The length of offsets is the number of characters in pattern. Each entry in offsets corresponds to the distance counters i and j are adjusted when a mismatch occurs. len(item) denotes the length of item
Metrics used to calculate genetic diversity
| Metric Name | Formula | Significance |
|---|---|---|
| GD2 |
| Defined in [ |
| GD2b |
| Modified version of GD2, calculable in RepeatAnalyzer |
| GDM1-Local |
| Ratio of unique repeats in each strain in the region |
| GDM1-Global |
| Ratio of unique repeats across the region as a whole |
| GDM2-Local |
| Variation in how often repeats occur within strains in the region |
| GDM2-Global |
| Variation in how often repeats occur in the region as a whole |
Length(Genotype) = the number of SSRs in that genotype
Mistaken repeat sequence and mistaken repeat name found in the A. marginale literature
aEach value in parentheses is the last two digits of an accession code where the first digits precede the parenthesized values
bThis repeat was listed as co-reported in the indicated papers
Diversity scores for Nayarit and Jalisco, Mexico, Kansas and world data
| Metric | Nayarit | Jalisco | Kansas | World |
|---|---|---|---|---|
| GD2b | 70 | 145 | 56 | 77 |
| GDM1-Locala | 0.692 | 0.814 | 0.395 | 0.739 |
| GDM1-Globala | 0.148 | 0.345 | 0.114 | 0.177 |
| GDM2-Locala | 0.095 | 0.074 | 0.103 | 0.092 |
| GDM2-Globala | 0.027 | 0.028 | 0.196 | 0.010 |
aThese values are rounded, however as they are constructed from counts, rather than measurements, they do not have a strict number of significant digits. Rather, for the GDM scores, we have chosen to display numbers rounded to three significant digits to make it easier to read and compare them. For the GD2b score we decided to round to whole numbers as the magnitudes are of a different scale
Fig. 3Repeat frequency and genotype length distributions. The four plots in the figure are histograms produced by RepeatAnalyzer for Jalisco, Mexico and whole world data. Plots a and c show the distribution of SSR frequencies by the number of genotypes in which they occur in the region. Plots b and d show distributions of genotype lengths. The Inset in figure c is zoomed in to show its middle segment in finer detail; RepeatAnalyzer automatically generates this type of inset when outlier values would make the indices on the table difficult to interpret
Edit distance analysis results for some common repeats
| # SSRs reported at edit distance (ED): | ||||||
|---|---|---|---|---|---|---|
| SSR | 1 | 2 | 3 | Max | Max ED | Reported in |
| α | 1 (108)a | 1 (Ph1) | 0 | 1 (135) | 16 | Mexico, Brazil, Argentina, Taiwan, Venezuela |
| β | 6 | 6 | 10 | 2 (99, EV6) | 16 | Mexico, Brazil, Argentina, Taiwan, Venezuela, Philippines |
| Γ | 10 | 16 | 44 | 4 (99, 134, 135, EV6) | 12 | Italy, Mexico, Brazil, Argentina, Taiwan, Venezuela, Philippines |
| E | 10 | 20 | 31 | 1 (135) | 13 | United States, Puerto Rico, Israel, Venezuela, Mexico |
| 27 | 15 | 46 | 51 | 4 (133, 135, EV6, EV11) | 12 | Argentina, South Africa, Brazil, Philippines, Mexico, Venezuela, |
| M | 11 | 34 | 63 | 1 (135) | 13 | United States, Brazil, Italy, Argentina, Israel, Mexico, Philippines, Venezuela, South Africa, Taiwan |
aParentheses contain the name of the referenced repeat
Fig. 4Geographic visualization of repeats. The figure shows the output of the query: Repeats: 10; 11; 12; 13; 14; 15; B; C; α; β; Γ, Strains: None, Location: Any, Scale: 1. A version of this same map zoomed in to show Venezuela in detail is available in Additional file 2. The size of a circle indicates the scope of the region it denotes. This is necessary because while a location for a genotype must include a country, it may also (optionally) include a province and/or county. In these cases, the larger the circle is, the broader the scope; country only markers have the largest circles, while markers for a specific county are the smallest
Fig. 5Geographic visualization of repeats from Nayarit, Mexico. The figure shows the output of the query: Repeats: α; β; Γ; EV1; EV3; EV7; EV6, Strains: EV1 β β β Γ; α β β β Γ; EV1 β β Γ; EV3 EV7 β β EV6, Location: Nayarit, Mexico, Scale: 1.5. A version of this same map zoomed out to show the whole world is available in Additional file 2. Circles with grey outlines represent the positions of whole genotypes, rather than individual SSRs. Size still indicates the scope of the region represented, though genotype markers are strictly larger than SSR markers to allow both to be visible simultaneously at the same coordinate location
SSR genotypes of S. pneumoniae pspA
| Accession # | SSR Repeat patterna |
|---|---|
| FQ312027 | 1 2 3 4 5 6 3 4 5 6 3 4 5 6 3 7 |
| ABJ54172 | 1 8 9 10 9 10 9 10 3 7 |
| U89711 | 1 8 9 9 11 11 12 13 14 |
| ACB89372 | 1 16 17 18 19 20 9 6 21 7 |
| AAK74303 | 15 16 3 4 11 11 10 6 3 7 |
aRepeat sequences are presented in Additional file 3