| Literature DB >> 22536906 |
Marco Pellegrini1, Maria Elena Renda, Alessio Vecchio.
Abstract
BACKGROUND: Tandem repetitions within protein amino acid sequences often correspond to regular secondary structures and form multi-repeat 3D assemblies of varied size and function. Developing internal repetitions is one of the evolutionary mechanisms that proteins employ to adapt their structure and function under evolutionary pressure. While there is keen interest in understanding such phenomena, detection of repeating structures based only on sequence analysis is considered an arduous task, since structure and function is often preserved even under considerable sequence divergence (fuzzy tandem repeats).Entities:
Mesh:
Substances:
Year: 2012 PMID: 22536906 PMCID: PMC3402919 DOI: 10.1186/1471-2105-13-S3-S8
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1PTRStalker algorithm scheme.
UniProtKB/Swiss-Prot database: percentage of protein sequences that contain at least a tandem repeat with length ≥ 20
|
|
|
|
|
|---|---|---|---|
| BLOSUM 90 | 17.7 | 20.5 | 23.2 |
| BLOSUM 70 | 21.6 | 25.6 | 26.1 |
| BLOSUM 50 | 22.5 | 27.4 | 29.1 |
| Edit | 4.3 | - | - |
UniProtKB/Swiss-Prot database: percentage of protein sequences that contain at least a tandem repeat with length ≥ 30
|
|
|
|
|
|---|---|---|---|
| BLOSUM 90 | 5.7 | 6.4 | 7.8 |
| BLOSUM 70 | 7.0 | 7.0 | 8.7 |
| BLOSUM 50 | 7.1 | 8.4 | 8.8 |
| Edit | 2.2 | - | - |
UniProtKB/Swiss-Prot database: percentage of protein sequences that contain at least a tandem repeat with length ≥ 40
|
|
|
|
|
|---|---|---|---|
| BLOSUM 90 | 4.3 | 4.9 | 4.9 |
| BLOSUM 70 | 4.9 | 5.5 | 5.4 |
| BLOSUM 50 | 4.9 | 5.7 | 5.7 |
| Edit | 1.8 | - | - |
Analysis of the 12 proteins from the UniProtKB/Swiss-Prot database with a very long fuzzy tandem repeat
| Protein | Tandem repeats found by | ||||||
|---|---|---|---|---|---|---|---|
| Q8IVF2 | AHNK2_HUMAN | 5795aa | 165-x-24 | 165-x-23 | * | ** | 163-x-31 |
| [720-4666] | [774-4617] | [289-5529] | |||||
| Q9N4M4 | ANC1_CAEEL | 8545aa | 915-x-6 | 903-x-4.27 | 58-x-4 | ** | *** |
| [3000-8491] | [4342-8199] | [2336-2567] | |||||
| P08519 | APOA_HUMAN | 4548aa | 1495-x-3 | 114-x-37 | 114-x-24 | 114-x-39 | 111-x-38 |
| [0-4486] | [7-4220] | [1501-4125] | [18-4523] | [17-4282] | |||
| P20930 | FILA_HUMAN | 4061aa | 1339-x-3 | 323-x-11 | * | 324-x-12 | *** |
| [32-4051] | [268-3902] | [82-3935] | |||||
| Q54CU4 | COLA_DICDI | 11103aa | 433-x-17 | 430-x-17 | * | ** | 424-x-22 |
| [1175-8554] | [1257-8691] | [301-9409] | |||||
| Q8R0W0 | EPIPL_MOUSE | 6548aa | 515-x-8 | 515-x-8 | * | ** | *** |
| [2000-6548] | [2067-6529] | ||||||
| Q9Y6R7 | FCGBP_HUMAN | 5405aa | 1367-x-3 | 1201-x-3 | * | 1201-x-5 | 394-x-13 |
| [1000-5102] | [1100-4811] | [21-5405] | [444-5382] | ||||
| P05790 | FIBH_BOMMO | 5263aa | 1049-x-5 | 168-x-30 | 8-x-19 | ** | *** |
| [1-5247] | [152-5221] | [3362-3495] | |||||
| Q9UKN1 | MUC12_HUMAN | 5478aa | 1548-x-3 | 1557-x-2 | 25-x-8 | 28-x-151 | *** |
| [74-4719] | [446-3569] | [2049-2280] | [215-5123] | ||||
| Q8WXI7 | MUC16_HUMAN | 22152aa | 156-x-61 | 156-x-61 | 156-x-17 | ** | 153-x-61 |
| [12038-21555] | [12047-21567] | [12420-15000] | [12046-21559] | ||||
| Q6PZE0 | MUC19_MOUSE | 7524aa | 652-x-9.6 | 163-x-36.4 | * | ** | *** |
| [1071-7372] | [1281 -7214] | ||||||
| Q8WZ42 | TITIN_HUMAN | 34350aa | 1082-x-4 | 28-x-6 | 10-x-26 | ** | 395-x-28 |
| [22186-26525] | [11428-11596] | [11445-11686] | [20001-29694] | ||||
Analysis of the 12 proteins from the UniProtKB/Swiss-Prot database for which a tandem repeat of length ≥ 4000 aa has been detected by PTRStalker. For each protein (row) and each algorithm that returned at least a result (column) we report the longest TR found by each algorithm above the threshold of 100 aa. For each TR we report: the period -x- repeat number and the [interval spanned]. Fail to report is marked with "*". Note that HHRep and HHRepID are not listed here because they fail to report multi-repeating units, since they only report pairs of homologous substrings. Entries on the TRUST column marked "**" could not be completed because of excessive memory required (see Additional file 1). Entries in the RADAR column marked "***" correspond to absence of a TR cluster in the output, although many interspersed repeats may be found.
Analysis of proteins belonging to the Urea Transporter (UT) family
| Protein | Tandem repeats found by | ||||||
|---|---|---|---|---|---|---|---|
| dvUT | ABM28909 | 337aa | [14-165] | [11-91] | [2-138 ] | [11-139] | 5-x-44 |
| [177-323] | [174-254] | [141-286] | [140-303] | [23-269] | |||
| apUT | YP_001969475 | 300aa | [17-136] | [2-140] | [2-138] | [21-128] | [45-129] |
| [153-284] | [156-288] | [156-286] | [175-278] | [199-277] | |||
| mUT-A1 | AAM00357 | 930aa | [4-452] | [63-337] | [65-493] | [41-421] | 5-x-52 |
| [467-918] | [532-800] | [533-916] | [422-866] | [117-775] | |||
| hsUT-A1 | AAL08485 | 920aa | [4-320] | [50-338] | [87-490] | [102-564] | 6-x-24 |
| [403-763] | [519-800] | [548-906] | [565-916] | [189-721] | |||
dvUT = [D. vulgaris DP4], apUT =[A. pleuropneumoniae AP76], mUT-A1 = [Mus musculus], and hsUT-A1 = [Homo sapiens]. For each protein (row) and for each algorithm that returned at least a result (column) we report the longest fuzzy repeated subsequence detected by each algorithm by giving start and end position of homologous segments. Note that XSTREAM and T-REKS are not listed here because they are unsuitable for detecting long fuzzy dimeric structures.
Analysis of the CIC-0 protein belonging to the Chloride Channel family
| Protein | Tandem repeats found by | |||||
|---|---|---|---|---|---|---|
| CLC-0[CICH_TORMA] | P21564 | 805aa | [8-289] | [119-220] | [82-195] | [102-154] |
| [291-575] | [174-260] | [196-303] | [161-205] | |||
| [214-265] | ||||||
| [517-627] | [542-597] | [447-483] | [329-353] | |||
| [628-800] | [718-770] | [484-527] | [386-407] | |||
| [528-564] | ||||||
For each algorithm that returned at least a result we report the fuzzy repeated subsequences detected by giving its start and its end position. Note that XSTREAM, T-REKS, and HHrepID are not listed here because they could not find any repetitive structure.
Figure 2Average length of the longest TR found in shuffled sequences when using edit distance.
P-values for the Wilcoxon signed rank significance test (one tailed test)
| Length Class | P-value (1-tail) |
|---|---|
| 300 | 1.5967199936708E-3 |
| 400 | 2.0342486747627283E-2 |
| 500 | 2.4673807804867205E-4 |
| 600 | 6.36671915517767E-7 |
| 700 | 1.8930737304578085E-5 |
| 800 | 1.8821140025488957E-5 |
| 900 | 7.568768556122764E-6 |
| 1000 | 5.4931640625E-4 |
Figure 3Average length of the longest TR found in shuffled sequences when using BLOSUM 50 based distance.