| Literature DB >> 17963481 |
Erik Arner1, Ellen Kindlund, Daniel Nilsson, Fatima Farzana, Marcela Ferella, Martti T Tammi, Björn Andersson.
Abstract
BACKGROUND: Repeats are present in all genomes, and often have important functions. However, in large genome sequencing projects, many repetitive regions remain uncharacterized. The genome of the protozoan parasite Trypanosoma cruzi consists of more than 50% repeats. These repeats include surface molecule genes, and several other gene families. In the T. cruzi genome sequencing project, it was clear that not all copies of repetitive genes were present in the assembly, due to collapse of nearly identical repeats. However, at the time of publication of the T. cruzi genome, it was not clear to what extent this had occurred.Entities:
Mesh:
Substances:
Year: 2007 PMID: 17963481 PMCID: PMC2204015 DOI: 10.1186/1471-2164-8-391
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Cassette estimation. Schematic description of the procedure for cassette estimation. The top of the figure shows the layout of a repeated region. The repeat unit can consist of a gene, as well as flanking, non-coding sequence, and the repeat array is flanked by unique sequence (indicated by dashed lines). The middle of the figure shows how, typically, the shotgun reads will form a single alignment that represents a merge of all repeat copies. The graph at the bottom of the figure shows the number of deviating bases (y-axis) along the alignment (x-axis). The number of bases deviating from consensus will typically be low within the alignment, and increase towards the ends due to the presence of shotgun reads partially sampling unique sequence.
Figure 2Distribution of estimated copy numbers. The estimated copy number is calculated for each annotation by averaging the alignment depth along the annotation and dividing the average by 7, the average shotgun coverage. The distributions of all annotations (red), of all hypothetical genes (green) and of all trans-sialidase annotations (blue) are shown. The distribution shows the number of annotations (y-axis) for each estimate (x-axis). The graph to the left (A) shows the peak of the two first distributions at 2. The graph to the left is a zoomed in version of the higher estimates. The average estimated copy number of the trans-sialidases is 16.
Draft genome copy numbers and estimated copy numbers of repeated genes
| Annotation | No. Genome copies | No. Genome copies w. depth ≥ 100 reads | Avg. No. Estimated copies |
| trans-sialidase | 1430 | 725 | 24 |
| MASP | 1377 | 913 | 23 |
| RHS | 752 | 166 | 26 |
| TcMUCII | 728 | 708 | 38 |
| DGF-1 | 565 | 231 | 57 |
| GP63 | 425 | 158 | 27 |
| protein kinase | 311 | 38 | 36 |
| elongation factor 1-gamma | 178 | 13 | 18 |
| ATPase | 119 | 5 | 119 |
| ATP-dependent DEAD/H RNA helicase | 99 | 51 | 28 |
| UDP-Gal or UDP-GlcNac-dependent glycosyltransferase | 66 | 7 | 21 |
| beta galactofuranosyl glycosyltransferase | 66 | 52 | 27 |
| TcMUC | 59 | 44 | 44 |
| N-acetyltransferase complex ARD1 subunit | 58 | 52 | 28 |
| TcMUCI | 57 | 55 | 41 |
| serine/threonine protein kinase | 47 | 6 | 31 |
| glycine dehydrogenase | 37 | 36 | 27 |
| mucin-like glycoprotein | 28 | 9 | 23 |
| receptor-type adenylate cyclase | 25 | 2 | 17 |
| tryptophanyl-tRNA synthetase | 23 | 21 | 44 |
| amino acid transporter | 22 | 7 | 38 |
| histone H4 | 21 | 21 | 79 |
| casein kinase | 20 | 12 | 123 |
| histone H2A | 18 | 15 | 100 |
| DnaJ chaperone protein | 18 | 4 | 23 |
| heat shock protein 70 | 18 | 3 | 100 |
| metallopeptidase | 17 | 4 | 19 |
| expression site-associated gene (ESAG-like) | 16 | 7 | 16 |
| cation transporter | 16 | 6 | 55 |
| myosine heavy chain | 13 | 1 | 19 |
| mannosyl-oligosaccharide 1, 2-alpha mannosidase 1B | 13 | 9 | 20 |
| tyrosine aminotransferase | 13 | 9 | 38 |
| cystathionine beta-synthase | 12 | 12 | 63 |
| elongation factor 1-alpha | 12 | 10 | 39 |
| amastin | 12 | 6 | 19 |
| TcSMUGS | 11 | 11 | 39 |
| zinc finger protein | 11 | 1 | 16 |
| histone H3 | 11 | 9 | 42 |
| tuzin | 10 | 8 | 39 |
| DNA-directed RNA polymerase subunit | 10 | 7 | 14 |
| cysteine peptidase | 9 | 12 | 59 |
| helicase | 8 | 1 | 18 |
| flagellar calcium-binding protein | 8 | 8 | 66 |
| TcSMUGL | 8 | 8 | 61 |
| glutamamyl carboxypeptidase | 7 | 6 | 144 |
| hexose transporter | 7 | 6 | 35 |
| clathrin coat assembly protein AP19 | 7 | 2 | 76 |
| metacaspase | 6 | 4 | 25 |
| elongation factor 2 | 6 | 3 | 15 |
| chaperonin HSP60, mitochondrial precursor | 5 | 5 | 32 |
| calcium-binding protein | 5 | 3 | 74 |
| folate/pteridine transporter | 5 | 5 | 20 |
| heat shock protein 85 | 5 | 5 | 54 |
| serine carboxypeptidase (CBP1) | 5 | 5 | 31 |
| cruzipain precursor | 5 | 4 | 81 |
| kinase | 4 | 1 | 32 |
| histone H2B | 4 | 2 | 63 |
| tricohyaline | 4 | 1 | 16 |
| D-isomer specific 2-hydroxyacid dehydrogenase protein | 4 | 3 | 29 |
| membrane-bound acid phosphates | 3 | 1 | 20 |
| antigenic protein | 3 | 1 | 18 |
| prostaglandin F2-alpha synthase | 3 | 3 | 28 |
| mitochondrial RNA editing lipase 1 | 3 | 1 | 15 |
| mitotubule-associated protein Gb4 | 3 | 1 | 15 |
| oligosaccharyl transferase subunit | 3 | 1 | 19 |
| ubiquitine | 3 | 1 | 17 |
| 69 kDa paraflagellar rod protein | 2 | 2 | 16 |
| clathrin assembly protein | 2 | 2 | 76 |
| glucose transporter | 2 | 1 | 27 |
| pteridine reductase | 2 | 1 | 18 |
| L-threonine 3-dehydrogenase | 2 | 2 | 24 |
| TcMUCIII | 2 | 2 | 38 |
| U3 small nuclear ribonucloprotein snRNP | 2 | 2 | 25 |
| acetylornithine deacetylase | 1 | 1 | 153 |
| cytochrome oxidase assembly protein | 1 | 1 | 16 |
| heat shock protein 83 | 1 | 1 | 60 |
| monoglyceride lipase | 1 | 1 | 49 |
| paraxonemal rod protein PAR2 | 1 | 1 | 15 |
The table lists the 78 annotated genes with an average alignment depth of 100 or higher. The table does not include the hypothetical genes. The first column lists the name of the annotated gene. The second column lists the number of copies of that particular annotation in the draft genome. The third column lists the number of these annotated copies that have an average shotgun depth of 100 or higher, corresponding to an estimated copy number of 14 or higher. The fourth column lists the average estimated copy number of those annotated copies having an average shotgun depth of 100 or higher (listed in column three).
Figure 3Coverage of two trans-sialidases. A, B shows two contigs with annotated putative trans-sialidases. C, D show the coverage every 100 bp along the genes. Tc00.1047053511875 (A, C) has an average shotgun depth of 9, indicating only one copy in the genome. Tc00.1047053511105.60 has an average depth of 103, indicating 15 copies actually being present in the genome. This example shows how trans-sialidases in T. cruzi can be both unique (Tc00.1047053511875.20), sequence similarity wise, or closely resemble many others (Tc00.1047053511105.60).
Functional classification of repeated genes
| Function | No. of diff. annotations |
| Metabolism | 16 |
| Cell growth, division and DNA synthesis | 11 |
| Protein synthesis | 4 |
| Protein destination | 13 |
| Transport proteins | 8 |
| Signal transduction | 4 |
| Cellular organization | 4 |
| Surface antigens | 11 |
| Other | 8 |
The 78 annotations from Table 1 were clustered into functional groups. Note the remarkable spread of function of the repeated genes in T. cruzi.
Figure 4Active and inactive copies of trans-sialidase. Comparison of two trans-sialidase repeat groups in DNPTrapper. Boxes indicate reads, colored dots indicate DNPs. Only part of the alignment is shown. The reads have been clustered in DNPTrapper based on their DNP content, with reads sharing similar DNP patterns being grouped together. The lower group contains a C – T base substitution (circled) that corresponds to a Tyr – His substitution in the protein, rendering this repeat copy to lose its trans-sialidase activity.
Figure 5Protein sequence alignments of gene with transmembrane regions. Protein sequence alignment of 17 good coverage groups, from EAN81429.1. The names of the amino acid sequences to the left represent the read group their consensus sequence was derived from. Boxes show a region of the sequences with two predicted transmembrane helixes (TMH2 and TMH3). The arrows indicate positions inside the TMH regions where there is an amino acid change. It is worth to notice that most of the differences are seen in TMH2 but not as much in TMH3. Identical residues are shaded. Left numbers show the sequence position.