| Literature DB >> 31584084 |
Ole K Tørresen1, Bastiaan Star1, Pablo Mier2, Miguel A Andrade-Navarro2, Alex Bateman3, Patryk Jarnot4, Aleksandra Gruca4, Marcin Grynberg5, Andrey V Kajava6,7, Vasilis J Promponas8, Maria Anisimova9,10, Kjetill S Jakobsen1, Dirk Linke11.
Abstract
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with 'ready-to-use' deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31584084 PMCID: PMC6868369 DOI: 10.1093/nar/gkz841
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Summary of proteins from UniProtKB/Swiss-Prot where the length of repetitive region has changed between different versions of the database
| Proteins ( | Proteins with different sequence between versions ( | Proteins with different repetitive region lengths ( | Average/standard deviation of the length of repetitive regions in original version of the sequencea | Average/standard deviation of the length of repetitive regions in the version 2018_06 | Average/standard deviation of the difference in lengths of repetitive regionsa |
|---|---|---|---|---|---|
| 554 241 | 74434 | 1669 | 31.14/72.09 | 35.20/84.08 | 13.57/45.69 |
aMeasured in amino acid residues.
Differences of repetitive region lengths in evolutionarily distinct groups of organisms
| Database name | Number of proteins | Number of proteins with STRs | % of proteins with STRs | Mediana | Averagea | Standard deviationa | Number of clustersb |
|---|---|---|---|---|---|---|---|
| UniProtKB/Swiss-Prot (total) | 554 241 | 28003 | 5.05% | 14.75 | 15.14 | 3.69 | 6237 |
| Archaea | 19 525 | 351 | 1.80% | 10.71 | 10.63 | 1.27 | 45 |
| Bacteria | 333 691 | 6794 | 2.04% | 17.38 | 17.45 | 2.66 | 1048 |
| Euk: Fungi | 33 613 | 3996 | 11.89% | 13.46 | 13.79 | 3.65 | 893 |
| Euk: Invertebrata | 27 607 | 3372 | 12.21% | 17.34 | 18.62 | 7.95 | 812 |
| Euk: Vertebrata | 18 292 | 1461 | 7.99% | 13.66 | 13.90 | 2.42 | 1801 |
| Euk: Plants | 42 101 | 3601 | 8.55% | 12.51 | 12.82 | 2.98 | 795 |
| Viruses | 16 852 | 889 | 5.28% | 14.07 | 14.15 | 2.57 | 203 |
aRepetitive region length, measured in amino acid residues.
bClustering was used to define repeat classes. Should a protein contain three different, co-localized STRs, the clustering method will produce 6 clusters: three with regular STRs and three with fused repeats. See also supplementary material for more information.
Figure 1.DNA alignment of a ∼39 kb-long DNA region containing the yrIlm gene and flanking CDS in Y. ruckeri genomes deposited in GenBank. Each CDS is indicated by a yellow arrow, with the percentage of sequence identity to CSF007-82 reported inside the arrow. yrIlm consists of an array of tandemly repeated, identical Ig-like domains (in red) and in addition of Ig-like domains of lower pairwise sequence similarity (in orange). It is usually capped by a C-type lectin domain (CTLD, in green). The dashed lines indicate gaps in the DNA alignment. In strain 150 the grey box indicates a contig break in the assembly. The asterisk (*) indicates assemblies generated through PacBio SMRT sequencing. Note that the other assemblies have significant lower repeat numbers, suggesting that the repeats were not found using short-read sequencing technologies. Modified from Wrobel,A., Ottoni,C., Leo,J.C., Gulla,S. and Linke,D. (2018) The repeat structure of two paralogous genes, Yersinia ruckeri invasin (yrInv) and a ‘Y. ruckeri invasin-like molecule’, (yrIlm) sheds light on the evolution of adhesive capacities of a fish pathogen. Journal of Structural Biology, 201, 171–183, with permission from Elsevier.