Literature DB >> 23418055

Understanding and identifying amino acid repeats.

Abstract

Amino acid repeats (AARs) are abundant in protein sequences. They have particular roles in protein function and evolution. Simple repeat patterns generated by DNA slippage tend to introduce length variations and point mutations in repeat regions. Loss of normal and gain of abnormal function owing to their variable length are potential risks leading to diseases. Repeats with complex patterns mostly refer to the functional domain repeats, such as the well-known leucine-rich repeat and WD repeat, which are frequently involved in protein–protein interaction. They are mainly derived from internal gene duplication events and stabilized by ‘gate-keeper’ residues, which play crucial roles in preventing inter-domain aggregation. AARs are widely distributed in different proteomes across a variety of taxonomic ranges, and especially abundant in eukaryotic proteins. However, their specific evolutionary and functional scenarios are still poorly understood. Identifying AARs in protein sequences is the first step for the further investigation of their biological function and evolutionary mechanism. In principle, this is an NP-hard problem, as most of the repeat fragments are shaped by a series of sophisticated evolutionary events and become latent periodical patterns. It is not possible to define a uniform criterion for detecting and verifying various repeat patterns. Instead, different algorithms based on different strategies have been developed to cope with different repeat patterns. In this review, we attempt to describe the amino acid repeat-detection algorithms currently available and compare their strategies based on an in-depth analysis of the biological significance of protein repeats.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：
Proteome
DNA

Year: 2014 PMID： 23418055 PMCID： PMC4103538 DOI： 10.1093/bib/bbt003

Source DB: PubMed Journal: Brief Bioinform ISSN： 1467-5463 Impact factor: 11.622

INTRODUCTION

Amino acid repeats (AARs) are abundant in protein sequences either as periodic elements in structural proteins such as collagens, keratins, silk and cell wall proteins, or as structural modules in functional proteins such as transcription factors, receptors, ion channels, histones, ubiquitins and calcium storage proteins. Table 1 shows some well-known examples of human repeat-containing proteins (RCPs) gathered in the UniProt/Swiss-Prot Knowledgebase (http://www.uniprot.org/). For example, the major prion protein (PRIO_HUMAN) contains an N-terminal repeat region with several octamers (PHGGGWGQ); the extra-embryonic spermatogenesis homeobox 1 protein (ESX1_HUMAN) has a sequence motif PPxxPxPPx repeated nine times and the alpha-1 type I collagen protein contains a repeat of various lengths of the periodic tri-amino acid GPP. The giant muscle protein Titin composed of 34 350 amino acid residues (TITIN_HUMAN) contains several types of repeating domains. Single amino acid repeats (SAARs) are also common, such as the polyQ repeats in the Forkhead box protein P2 (FOXP2_HUMAN), the androgen receptor (ANDR_HUMAN) and the Huntington’s disease (HD) protein (HD_HUMAN). Other SAARs including polyL, polyA and polyH can also be found in many other proteins. RCPs are distributed in all life kingdoms, and especially abundant in eukaryotes [1].

Table 1:

Some examples of AARs in human proteins

UniProt ID	Protein	AA	Repeat pattern
SECR_HUMAN	Secretin	121	polyL
PRIO_HUMAN	Major prion protein	253	(PHGGGWGQ)₄
ANKR1_HUMAN	Ankyrin repeat domain-containing protein 1	319	Ankyrin repeat
CASQ2_HUMAN	Calsequestrin-2	399	D/E-Rich
ESX1_HUMAN	Homeobox protein ESX1	406	(PPxxPxPPx)₉
WDR1_HUMAN	WD repeat-containing protein 1	606	WD repeat
UBC_HUMAN	Polyubiquitin-C	685	Ubiquitin
FOXP2_HUMAN	Forkhead box protein P2	715	polyQ
LRRN1_HUMAN	Leucine-rich repeat neuronal protein 1	716	Leucine Rich Repeat
ANDR_HUMAN	Androgen receptor	919	polyQ, polyG, polyP
SRBP2_HUMAN	Sterol regulatory element-binding protein 2	1141	polyS, (PQ)_4, (SGSS)₂
BRD4_HUMAN	Bromodomain-containing protein 4	1362	polyP, polyH, polyQ, K-Rich, S-Rich
CO1A1_HUMAN	Collagen alpha-1(I) chain	1464	(GPP)_n
CAC1A_HUMAN	Brain calcium channel I	2505	polyQ, polyH, polyG
HD_HUMAN	Huntington disease protein	3142	polyQ, polyP, polyT, polyE, HEAT domain
MLL2_HUMAN	Histone-lysine N-methyltransferase MLL2	5537	(S/P-P-P-E/P-E/A)₁₅
TITIN_HUMAN	Titin	34 350	Several types of repeating domains: TPR WD RCC1 PEVK Kelch Z Ig repeats

Some examples of AARs in human proteins It is known that some AARs such as the leucine-rich repeats (LRRs) form the structural framework for protein–protein interaction, and the repeat fragment in zinc finger transcription factors binds to cis-elements of DNA promoters. AARs can also cause problems such as the mis-folding of prion proteins [2]. Furthermore, modification of repeat length may introduce abnormal function. A typical case is the expansion of polyQ, resulting in several neurological disorders such as mental retardation, HD, inherited ataxias and muscular dystrophy.

Classification of amino acid repeat patterns at sequence level

Mathematical and statistical methodologies can be applied to study the particular functional and evolutionary background of an AAR. Several approaches have been proposed to classify AARs into different categories depending on the characteristics of repeat units, including the sequence similarity among repeat units, the distance between adjacent repeat units and the complexity of the sequence pattern of the repeat units. The first approach is to classify AARs according to the similarity among the repeat units. Based on this approach, AARs can be classified into two main groups: perfect repeats and imperfect repeats. The repeat units in perfect repeat fragments are identical, e.g. AAAAAAA and PQPQPQPQ, whereas the repeat units in imperfect repeat fragments are not exactly the same, e.g. AAWAAAA and QQQMLQQQFL. Imperfect repeats with highly variable, but still recognizable, repeat units are also called divergent repeats. The second approach for repeat classification is based on the distance between adjacent units. AARs can be classified as tandem repeats (TRs) or non-tandem repeats (NTRs). The units in TRs are continuously distributed in the repeat sequence, whereas the units in NTRs are sequentially interspersed. The third approach takes the complexity of the sequence pattern of the repeat units into consideration. Based on this approach, AARs can be roughly classified as simple repeats or complex repeats. Simple repeats generally refer to the continuous or interrupted runs of single amino acid residues or short peptides. The regions in a protein sequence containing simple repeats are often called simple sequences (SSs) or low complexity regions (LCRs). On the other hand, most of the complex repeats usually have sophisticated patterns of repeat units with variable lengths ranging from 10 to >100 residues, and these complex repeats patterns are frequently recognized as repeated protein domains [3]. In practice, it is rather difficult to strictly distinguish the different classes owing to the complicated patterns of AARs. For example, some domain repeats also contain SSs, such as the abundant leucine residues found in an LRR domain. And in the case of point mutations or insertions/deletions (INDELs), the original perfectly repeated units in proteins could gradually evolve into non-perfect tandem repeats (NPTRs). The above approaches used to classify AARs are all based on the protein sequence. However, they are insufficient to reveal the biological significance of AARs, as proteins play their functional roles by folding into particular secondary and tertiary structures, which are difficult to deduce through amino acid patterns at sequence level. Data from several experiments show that proteins with similar tertiary structures may share low sequence identity [4, 5]. And similar functional domains of proteins do not necessarily correspond to recognizable sequence repeat patterns [3, 6–8]. Therefore, in-depth study of protein repeats requires better understanding of the correspondence of repeat sequences with their structures and functions. In addition, the acquisition of such biological knowledge is more sophisticated than simply classifying sequential repeat data.

Biological significance of different patterns of AARs

Biologically, different amino acid repeat patterns imply different functional and evolutionary backgrounds. Repeats with simple patterns, such as single AARs, mainly exist in intrinsically unstructured regions (IURs) of proteins [9, 10]. Such protein regions that do not fold into a 3D structure commonly have functions related to molecular recognition and molecular assembly [11, 12]. Single amino acid or trinucleotide repeats like polyQ are involved in neurodegenerative diseases such as HD [13], where their length variations often result in either loss of normal or gain of abnormal function[14, 15]. Most SAARs are presumed to be originally derived from replicative DNA slippage [16] in the coding region. Expansion of some SAARs might also result from unequal chromosomal crossover, such as the polyA in the human HOX13 gene [17]. In general, perfect amino acid runs are inherently mutable and are frequently interrupted by point mutations [18] to become simple sequences [19]. In addition to SAARs, sequential tandem repeats (PTRs and NPTRs) with highly similar units are prevalent in protein sequences. We have found that ∼13% of all proteins deposited in the public protein databases contain at least one tandem repeat fragment. And >40% of the tandem repeats are PTRs, while ∼60% PTRs are single amino acid runs [1]. Errors in sequencing and automatic annotation procedures might have introduced some false-positive PTRs into the public protein knowledgebase. However, this cannot undermine the biological significance of frequently occurring PTRs in protein sequences, especially considering the fact that functional PTRs are being continuously experimentally identified, and most of them are conserved among orthologous proteins [20-22]. Consistent with this scenario, conservation of amino acid tandem repeats is a strong indication for biological relevance. The phylogenetically conserved repeat fragments among orthologous proteins should have a conserved function, such as the conserved polyQ regions in primate FOXP2 proteins [23]. In contrast, however, variable repeat unit length in corresponding regions of orthologous proteins indicates a different scenario. These repeats are probably going through a rapid change driven by selection [24]. More interestingly, tandem repeats have been shown to play an important role in micro-evolution by catalysing the rapid production of genetic and phenotypic variation among organisms [25-28]. Repeats with complex patterns have comparatively stable structures and conserved functions, which are generally called domain repeats. Domain repeats are among the most common protein motifs in the Pfam database [29], such as LRRs, Zinc finger repeats, Ankyrin repeats and Tetratricopeptide repeats (TPRs) [30]. These domain repeats are mostly involved in transcription regulation, cell-cycle control and signal transduction [31-34] and widely spread in the proteomes of different species across different life kingdoms [35]. Many genes containing these domain repeats in the coding region are significant in certain diseases [36], as sequence identity increases the chance of protein aggregation [37] and mis-folding. Domain repeats are thought to have evolved through internal gene duplications arising from recombination events [3, 38], such as unequal crossing over [39] and exon shuffling [40]. The duplications may involve several domains at a time [3, 41]. In addition, a number of specific sequence-based signals such as the ‘gate-keeper’ residues [41] play a crucial role in preventing inter-domain aggregation. Therefore, these repeat patterns are generally obscure at sequence level, and a sophisticated search is required to detect them.

REPEAT DETECTION STRATEGIES

During the past decade, several strategies for the identification of AARs from protein sequences have been reported. Among these approaches, the three major ones are self-comparison, pattern recognition and complexity measurement. Table 2 shows the algorithms and publicly available tools including online resources that can be used to detect AARs of various types.

Table 2:

Repeat detection algorithms

Method	Repeat type^a	Ref	Availability
Self-comparison
REP	Domain	[42]	http://www.embl.de/∼andrade/papers/rep/search.html
COACH	Domain	[43]	http://www.drive5.com/lobster/
TPRpred	Domain	[44]	http://tprpred.tuebingen.mpg.de/
REPRO	Domain	[45]	http://www.ibi.vu.nl/programs/reprowww/
TRUST	Divergent	[46]	http://www.ibi.vu.nl/programs/trustwww/
Internal Repeat Finder	Divergent	[47]	http://nihserver.mbi.ucla.edu/Repeats/
HHrep	Divergent	[48]	http://hhrep.tuebingen.mpg.de/hhrep/
RADAR	Divergent	[49]	http://www.ebi.ac.uk/Tools/Radar/
HHrepID	Divergent	[50]	http://toolkit.tuebingen.mpg.de/hhrepid/
Pattern recognition
REPETITA	Solenoid	[51]	http://protein.bio.unipd.it/repetita/
LSTM	Domain	[52]	http://www.bioinf.jku.at/software/LSTM_protein/
ARD	Alpha-Rod	[53]	http://www.ogic.ca/projects/ard/
Complexity measurement
SIMPLE	Simple	[19]	http://www.biochem.ucl.ac.uk/bsm/SIMPLE/
GBA	Simple	[54]	xli@cise.ufl.edu
Others
XSTREAM	NPTR	[55]	http://jimcooperlab.mcdb.ucsb.edu/xstream/
Apriod	PPP	[56]	hwan@mindgen.org
LocRepeat	PPP	[57]	http://www.cs.cityu.edu.hk/∼lwang/software/LocRepeat/
REPfind	NPTR	[58]	adebiyi@informatik.uni-tuebingen.de
Reptile	Perfect	[59]	http://reptile.unibe.ch/
SUFFIX	Perfect	[60]	http://www.cs.ucdavis.edu/∼gusfield/strmat.html

aNPTR = non-perfect tandem repeat; PPP = pseudo-periodic partitions.

Repeat detection algorithms aNPTR = non-perfect tandem repeat; PPP = pseudo-periodic partitions. In the following section, we will give a brief introduction to the amino acid repeat-detection strategies focusing on the general principles behind these strategies.

The self-comparison strategy

One of the most intuitive strategies to detect repeat patterns in protein sequences is the self-comparison method. The idea of this approach is rather simple, i.e. comparing a protein sequence to itself. Sequence comparison is a fundamental bioinformatics method that has been extensively used to search similar regions among biological sequences. The global sequence alignment method was first proposed in the 1970s [61] and focuses on finding the optimal alignment of two entire biological sequences using dynamic programming. Soon after, the Smith–Waterman local alignment algorithm [62] was developed to recognize the better aligned sub-regions between two sequences in order to show meaningful biological relevance. On aligning a sequence with itself for the purpose of identifying repeat patterns, the sub-optimal alignments become obscured by the best (and most obvious) alignment. This optimal alignment should be excluded from the initial search. The reliability of identifying sub-optimal alignments of protein sequences using the dynamic programming method has been evaluated [62]. A very distinguishing feature of this method is the use of a scoring system that gives scores to paired amino acids and penalties to unmatched gaps. Substitution matrices such as PAM [63] and BLOSUM [64] are the basis of the scoring system and represent the specific evolutionary relevance among different amino acids. More specifically tuned scoring matrices have also been proposed. These matrices take special features of amino acids such as polarity, electrostatic charge, structure, molecular volume and codon bias [65] into account. One of the greatest advantages of using a scoring system for identifying sub-optimal alignments is that statistical models can be applied to define reliable criteria [66, 67]. In principle, the self-alignment repeat-detection methods are the extension of an alignment-based homology-detection approach. Thus, they have inherited characteristics that are more suitable for detecting divergent internal repeats in protein sequences. The units of these repeats generally have low identities and ambiguous boundaries, but share evolutionarily conserved sites or motifs, which are presumed to have crucial functions. As such, the accurate definition of repeat length and repeat number according to substantial biological significance is a sophisticated problem. And this is especially true for detecting repeat patterns without prior knowledge, also called ‘de novo’ repeat detection. On the other hand, the algorithms depending on prior knowledge, such as REP, COACH and TPRpred [42-44], generally search repeat patterns from sequence databases by profiles constructed with known repeat families using hidden Markov models (HMMs) [68]. Therefore, the repeat patterns identified by these programs are usually well-known, and some of them are experimentally studied functional protein domain repeats. It is generally believed that detecting repeat patterns with a self-alignment-based method is a feasible strategy. However, it also has some flaws and limitations. First, the computational complexity of performing self-alignment is high. The general complexity for a sequence with n amino acids is O (n2) for both time and space, which will increase exponentially with the increase of the sequence length. Fortunately, this problem is not too serious for protein sequences, as their average length is around 320 AAs [69]. And the computational capacity of current computer hardware is powerful enough to handle this problem within acceptable time and space. In addition, several optimization strategies have been recently applied to sequence alignments, such as the implementation of the Smith–Waterman algorithm with the new technology of graphics processing units (GPUs) [70], and the parallel computing version of the REPRO [71] algorithm [72] can handle much longer sequences within a reasonable time. One of the main purposes for detecting AARs is to find novel repeat patterns and infer their functional and evolutionary roles. As the majority of repeat patterns in protein sequences have not been well studied, de novo repeat-detection algorithms are more widely used, such as PEPRO, Internal Repeat Finder, RADAR, TRUST, HHrep and HHrepID [45–50, 56, 57]. All of them identify repeats using the self-comparison strategy, but differ in some aspects. For example, Internal Repeat Finder assumes that the statistically significant sub-optimal alignment scores should have a Poisson distribution [47]. TRUST uses the particular strategy on sub-optimal alignments, which could increase the chance and reliability to identify divergent repeats [46]. HHrep [48] and its optimized version HHrepID [50] compares a sequence with itself by the HMM–HMM [73] strategy, which looks for the sub-optimal alignments using a profile HMM constructed by iterations of PSI-BLAST [74].

The pattern recognition strategy

The second strategy to detect AARs from protein sequences uses the conventional method of pattern recognition. The two main algorithms of this strategy are the discrete Fourier transform (DFT) and neural networks. DFT has been widely applied in the research area of signal processing. Generally, it can decompose signals into constituent frequencies, so that the cryptic patterns hidden in the signals could be analysed intuitively. Early studies showed that DFT can be used to detect periodic patterns in collagen protein [75], but also has some fundamental difficulties which limit its usage [45]. The accuracy of DFT-based methods is easily biased by the length variation of the repeat units caused by mutations or INDELs, as this will weaken the periodical pattern of the transformed Fourier spectral amplitudes. Some recent algorithms make efforts to provide better discrimination on Fourier spectral amplitudes using newly developed methods. For example, REPETITA yields better accuracy than self-alignment methods on detecting solenoid repeats by introducing several optimized strategies of the DFT-based method [51]. In addition, the stationary wavelet packet transform has been widely used in bioinformatics and computational biology in recent years [76]. As a state of the art optimization DFT algorithm [77], it has been shown to have good quality on detecting protein repeat patterns [78]. The neural network-based method is another well-studied pattern-recognition strategy, which is also capable of identifying similar patterns in protein sequences [79]. A well-established neural network is able to associate homologous patterns in the protein sequence with the input patterns and can be trained to adapt the patterns. Several neural network algorithms show good accuracy and time efficiency on protein homologue detection. LSTM is able to combine amino acid properties with patterns and does not rely on pre-defined scoring matrices for similarity measurements [52]. The ARD neural network is designed to identify specific alpha-rod repeat patterns and has been applied to the analysis of Huntingtin protein sequences [53].

The complexity measurement strategy

The third approach of identifying AARs takes complexity measurement into consideration. LCRs are widely distributed in protein sequences. LCRs commonly contain particular repeat patterns that have continuous repetitions of very short units, such as the SAARs and cryptically simple sequences [19]. Apparently, these repeats have special functional and evolutionary properties that differ from the repeats with more complex patterns and longer units. Their typical short unit length makes both the self-comparison- and the pattern recognition-based strategies less well suited to identify LCR repeats efficiently. Fortunately, several algorithms have been introduced to detect repeats involved in LCRs, most of them using a strategy to measure the complexity of sequences within a sliding window. As for complexity measuring, SIMPLE [19] awards simplicity score to the central amino acid of each window, and is most suitable for detecting short unit cryptic repeats. SEG [80], DSR [81] and CARD [82] are based on Shannon entropy [83], which displays several limitations when decoding complex protein sequences (43). The main drawback of sliding windows-based algorithms is that they all require a pre-specified window size, and repeats that are longer or shorter than the window are not detectable. On the other hand, non-sliding window algorithms show more flexibility on detecting repeats in LCRs. GBA [54] constructs a graph for each protein sequence, and finds short subsequences as LCR candidates through traversing. Coronado [84] introduces the composition-modified scoring matrices to identify LCRs within cell wall proteins of fungi. These algorithms are an important complement to the sliding window-based algorithms.

Other strategies

As described above, the self-comparison strategy and the pattern recognition strategy are mostly suitable for detecting divergent repeats, whereas the complexity measurement strategy is mostly suitable for detecting simple unit repeats. In addition, exclusive and optimized strategies for sequential tandem repeats are also particularly useful. Sequential tandem repeats implicated in the amino acid fragments with tandem repeat patterns are comparatively more explicit than divergent repeats. They are widely spread in many proteomes across wide taxonomic ranges, but are still insufficiently studied. Hamming distance [85] and edit distance, also called Levenshtein distance [86], are widely used for measuring the similarity of sequential tandem repeats [87-90]. Differing from hamming distance, which only accounts for point mutations, edit distance-measuring algorithms also consider insertions and deletions. In addition, Apriod [56] and LocRepeat [57] focus on finding the ‘pseudo-periodic partitions’, which are gradually evolved patterns among repeat units. Given that NPTRs are originally evolved from PTRs, Xstream [55] and REPfind [58] detect NPTRs based on the extension of exact repeats seeds, which could decrease the computational complexity of both time and space. Most of the repeat-detection algorithms can identify PTRs together with other repeat patterns incidentally. But as some of the PTRs are nested in larger NPTR fragments, which can hardly be distinguished by the common strategies, an exclusive algorithm for detecting PTRs is also necessary. For example, the suffix tree-based strategy is supportive to identify all PTRs in a protein sequence with linear time complexity [60]. Reptile uses a ‘brute-force’ strategy to detect PTRs from the proteins of parasite antigens [59]. Following the definition of statistically significant repeat runs in protein sequences [91], the cut-off sizes of five, four, three and two of the repeat unit repetitions are common criteria for identifying mono-amino, di-amino, tri-amino and all other repeats, respectively.

SUMMARY AND PERSPECTIVE

Identifying repeat patterns in proteins is the first step towards the understanding of their physiological function and evolutionary mechanism. During the evolution process, these patterns become so intricate that no single algorithm is adequate to identify all of them. There is no doubt that an in-depth investigation of their biological background is required to choose proper algorithms for the identification of specific patterns. In general, self-comparison algorithms are suitable to detect de novo repeats with complex patterns. Pattern recognition-based algorithms are suitable to detect repeats with low sequence identities but high intrinsic biological similarities. Complexity measurement-based algorithms can be applied to detect repeats with simple patterns involved in LCRs. For the tandem repeats that have more sequentially repetitive patterns, one should consider the strategies that measure the similarity of repeat units by edit or hamming distance. The biological significance of protein repeats has been discussed for years. Internal duplication in genomes is one of the most important evolutionary mechanisms for species to adapt the environment [92-94]. As a result, repetitive patterns at the DNA level such as interspersed microsatellites and tandem tri-nucleotide repeats are prevalent. Intragenic repeats are presumed to have potential roles on generating functional variability [95, 96]. And the repeats in coding regions corresponding to AARs are more likely to go through adaptive competition [24, 97, 98]. Therefore, large amount of repeats in proteins is less likely to be regarded as ‘junk proteins’ [99], which merely have non-essential roles. At the same time, their variable characters and vulnerabilities to disorder and diseases has been a scientific puzzle for a long time. Frequently asked questions are: Is the characteristics of similar repeat patterns coherent in different proteomes across different life kingdoms? Could the functional and evolutionary roles of certain repeats correspond to their particular characters, such as position bias, GC content constrains and codon usage? How could the conserved functions of particular repeats have been evolved by selection? And what are the structure and sequence-based strategies to prevent repeats from aggregation? The insufficient understanding of protein repeats is not only due to the difficulty of identification, but also because of the lack of integrated repository for large-scale investigation and comparison of repeats among a variety of proteomes across different kingdoms. To that end, we developed ProRepeat (http://prorepeat.bioinformatics.nl), which integrates non-redundant tandem repeats detected by several algorithms from the UniProt [69] and RefSeq [100] protein databases and offers powerful analysis tools for finding biologically interesting properties of query results. In addition, we also integrated ProRepeat with ProGMap—a tool we developed for the integration of annotation resources for protein orthology [101]. With this set-up, we will be making large-scale orthologous comparisons on protein repeats over a broad taxonomy range especially eukaryotes in the near future. Amino acid repeats are abundant in protein sequences. They can be classified into different categories depending on the characters of the repeat units. Different amino acid repeat patterns imply different functional and evolutionary backgrounds. The three major approaches for detection of amino acid repeats are the self-comparison strategy, the pattern recognition strategy and the complexity measurement strategy.

90 in total

1. Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules.

Authors: J L Sussman; D Lin; J Jiang; N O Manning; J Prilusky; O Ritter; E E Abola
Journal: Acta Crystallogr D Biol Crystallogr Date: 1998-11-01

2. Detecting cryptically simple protein sequences using the SIMPLE algorithm.

Authors: M Mar Albà; Roman A Laskowski; John M Hancock
Journal: Bioinformatics Date: 2002-05 Impact factor: 6.937

3. Evolutionary analysis of amino acid repeats across the genomes of 12 Drosophila species.

Authors: Melanie A Huntley; Andrew G Clark
Journal: Mol Biol Evol Date: 2007-06-29 Impact factor: 16.240

4. Comparative and functional characterization of intragenic tandem repeats in 10 Aspergillus genomes.

Authors: John G Gibbons; Antonis Rokas
Journal: Mol Biol Evol Date: 2008-12-04 Impact factor: 16.240

5. The mathematical theory of communication. 1963.

Authors: C E Shannon
Journal: MD Comput Date: 1997 Jul-Aug

6. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.

Authors: S Karlin; S F Altschul
Journal: Proc Natl Acad Sci U S A Date: 1990-03 Impact factor: 11.205

Review 7. Statistical significance of sequence patterns in proteins.

Authors: S Karlin
Journal: Curr Opin Struct Biol Date: 1995-06 Impact factor: 6.809

8. Novel protein domains and repeats in Drosophila melanogaster: insights into structure, function, and evolution.

Authors: C P Ponting; R Mott; P Bork; R R Copley
Journal: Genome Res Date: 2001-12 Impact factor: 9.043

9. Ongoing and future developments at the Universal Protein Resource.

Authors:
Journal: Nucleic Acids Res Date: 2010-11-04 Impact factor: 16.971

10. The energy landscapes of repeat-containing proteins: topology, cooperativity, and the folding funnels of one-dimensional architectures.

Authors: Diego U Ferreiro; Aleksandra M Walczak; Elizabeth A Komives; Peter G Wolynes
Journal: PLoS Comput Biol Date: 2008-05-16 Impact factor: 4.475

23 in total

1. Revealing aperiodic aspects of solenoid proteins from sequence information.

Authors: Thomas Hrabe; Lukasz Jaroszewski; Adam Godzik
Journal: Bioinformatics Date: 2016-06-09 Impact factor: 6.937

2. Epidemiology and genetic diversity of SARS-CoV-2 lineages circulating in Africa.

Authors: Olayinka Sunday Okoh; Nicholas Israel Nii-Trebi; Abdulrokeeb Jakkari; Tosin Titus Olaniran; Tosin Yetunde Senbadejo; Anna Aba Kafintu-Kwashie; Emmanuel Oluwatobi Dairo; Tajudeen Oladunni Ganiyu; Ifiokakaninyene Ekpo Akaninyene; Louis Odinakaose Ezediuno; Idowu Jesulayomi Adeosun; Michael Asebake Ockiya; Esther Moradeyo Jimah; David J Spiro; Elijah Kolawole Oladipo; Nídia S Trovão
Journal: medRxiv Date: 2021-05-19

3. Profiles of low complexity regions in Apicomplexa.

Authors: Fabia U Battistuzzi; Kristan A Schneider; Matthew K Spencer; David Fisher; Sophia Chaudhry; Ananias A Escalante
Journal: BMC Evol Biol Date: 2016-02-29 Impact factor: 3.260

4. Positive and strongly relaxed purifying selection drive the evolution of repeats in proteins.

Authors: Erez Persi; Yuri I Wolf; Eugene V Koonin
Journal: Nat Commun Date: 2016-11-18 Impact factor: 14.919

5. Single Amino Acid Repeats in the Proteome World: Structural, Functional, and Evolutionary Insights.

Authors: Amitha Sampath Kumar; Divya Tej Sowpati; Rakesh K Mishra
Journal: PLoS One Date: 2016-11-28 Impact factor: 3.240

Review 6. Tandem Repeats in Proteins: Prediction Algorithms and Biological Role.

Authors: Marco Pellegrini
Journal: Front Bioeng Biotechnol Date: 2015-09-24

7. Detecting repetitions and periodicities in proteins by tiling the structural space.

Authors: R Gonzalo Parra; Rocío Espada; Ignacio E Sánchez; Manfred J Sippl; Diego U Ferreiro
Journal: J Phys Chem B Date: 2013-07-05 Impact factor: 2.991

8. Systematic analysis of compositional order of proteins reveals new characteristics of biological functions and a universal correlate of macroevolution.

Authors: Erez Persi; David Horn
Journal: PLoS Comput Biol Date: 2013-11-21 Impact factor: 4.475

9. Protein Repeats from First Principles.

Authors: Pablo Turjanski; R Gonzalo Parra; Rocío Espada; Verónica Becher; Diego U Ferreiro
Journal: Sci Rep Date: 2016-04-05 Impact factor: 4.379

Review 10. Repeat-containing protein effectors of plant-associated organisms.

Authors: Carl H Mesarich; Joanna K Bowen; Cyril Hamiaux; Matthew D Templeton
Journal: Front Plant Sci Date: 2015-10-21 Impact factor: 5.753