| Literature DB >> 18976483 |
Thomas Wicker1, Apurva Narechania, Francois Sabot, Joshua Stein, Giang T H Vu, Andreas Graner, Doreen Ware, Nils Stein.
Abstract
BACKGROUND: Barley has one of the largest and most complex genomes of all economically important food crops. The rise of new short read sequencing technologies such as Illumina/Solexa permits such large genomes to be effectively sampled at relatively low cost. Based on the corresponding sequence reads a Mathematically Defined Repeat (MDR) index can be generated to map repetitive regions in genomic sequences.Entities:
Mesh:
Substances:
Year: 2008 PMID: 18976483 PMCID: PMC2584661 DOI: 10.1186/1471-2164-9-518
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Previously published barley sequences used for re-annotation.
| sequence | size (bp) | genes* | TE content (bp) | Reference |
| AF474373 | 124052 | 9 | 70585 (56.9%) | [ |
| AF521177 | 211664 | 14 | 80220 (37.9%) | [ |
| AY268139 | 120562 | 2 | 86442 (71.7%) | [ |
| AY485643 | 114996 | 10 | 51173 (44.5%) | [ |
| AY642926 | 184425 | 5 | 88708 (48.1%) | [ |
| AY643842S2 | 129099 | 6 | 92563 (71.7%) | [ |
| AY643842S3 | 160856 | 5 | 128684 (80.0%) | [ |
| AY661558 | 439775 | 2 | 385242 (87.6%) | [ |
| EF067844 | 518343 | 5 | 422967 (81.6%) | [ |
| Total | 2003772 | 1406590 (70.2%) | ||
*Based on expert annotation; count includes gene fragments and pseudogenes and may differ from gene number given in the respective publication.
Previously published sequences from diploid wheat T. monococcum used for re-annotation and comparison with their MDR profiles
| sequence | size (bp) | genes* | TE content (bp) | Reference |
| AY491681 | 101082 | 8 | 43970 (43.5%) | [ |
| AY951944 | 190450 | 4 | 134076 (70.4%) | [ |
| AF459639 | 215222 | 5 | 157972 (73.4%) | [ |
| AF326781 | 211009 | 5 | 160577 (76.1%) | [ |
| AY146588 | 285425 | 4 | 232906 (81.6%) | [ |
| AY485644 | 438809 | 8 | 313748 (71.5%) | [ |
| AY188331 | 133606 | 1 | 112229 (84.0%) | [ |
| AY188332 | 95522 | 2 | 78041 (81.7%) | [ |
| AY188333 | 112309 | 1 | 85242 (75.9%) | [ |
| Total | 1783434 | 1318766 (73.9%) | ||
*Based on expert annotation; count includes gene fragments and pseudogenes and may differ from gene number given in the respective publication.
Figure 1K-mer composition of the MDR index. The fraction of collapsed discrete and all 20-mers in the set is shown as a function of the repeat level up to 500 copies. The curve for the collapsed discrete 20-mers converges to 1 rapidly, indicating that most 20-mers in the set are relatively infrequent in the genome. The curve that plots all available 20 mers converges more slowly and is a reflection of a small fraction of high frequency 20-mers in the set.
Figure 2MDR plots of publicly available sequences and their corresponding expert annotations. The MDR plots at the top of each panel indicate the coverage with 20-mers at each position of the sequence. Note that the scale for the MDR signal is logarithmic. The corresponding expert annotation is displayed underneath the plot. TEs are indicated as coloured boxes with each colour corresponding to a TE superfamily. Nested TEs are raised above those into which they have inserted. The highly abundant elements in (a) at positions 17 kb – 38 kb (and all others with the same MDR signal strength) represent BARE1 elements, the most abundant TE in barley. a. through c. represent sequences from barley while d. and e. are sequences from einkorn wheat Triticum monococcum. Note that the MDR signal is much weaker in the T. monococcum sequences.
Comparison of the fractions that were identified as repetitive by manual annotation and through MDR analysis.
| Sequence | Exp1 | MDR2 | OL3 | New4 | Total5 |
| AF474373 | 56.8 | 48.1 | 40.9 | 7.1 | 63.9 |
| AF521177 | 37.4 | 23.3 | 18.5 | 4.8 | 42.2 |
| AY268139 | 71.7 | 58.6 | 55.6 | 3.0 | 74.7 |
| AY485643 | 44.4 | 38.2 | 30.8 | 7.4 | 51.8 |
| AY642926 | 48.1 | 38.0 | 32.9 | 5.1 | 53.2 |
| AY643842S2 | 71.7 | 50.5 | 48.6 | 1.9 | 73.6 |
| AY643842S3 | 79.9 | 69.3 | 68.0 | 1.3 | 81.2 |
| AY661558 | 87.6 | 62.1 | 59.6 | 2.5 | 90.1 |
| EF067844 | 81.5 | 57.7 | 56.0 | 1.7 | 83.2 |
| Average | 64.3 | 49.5 | 45.6 | 3.9 | 68.2 |
All figures are in % of the total length of the sequence analysed.
1Repetitive fraction identified by expert annotation.
2Repetitive fraction identified by MDR analysis.
3Overlap, fraction covered by both.
4Fraction identified as repetitive exclusively through MDR.
5Expert annotation and MDR analysis combined
Figure 3Detection of gene-containing portions of low-pass survey sequenced BAC clones. For all, the MDR plot is indicated at the top. Underneath, the positions of Gaps in the sequence are indicated as vertical bars in horizontal black lines. Regions that were repeat masked are indicated as light grey boxes. Candidate gene islands are indicated at the bottom as dark grey bars with genes indicated as black boxes (CNS: conserved non-coding sequence).
Comparison of fractions of de novo sequenced BACs that were identified as repetitive by repeat masking and MDR analysis.
| BAC | size | BLASTN1 | BLASTX2 | combined3 | MDR4 | combined5 | New6 |
| 773C20 | 91561 | 67.5 | 32.6 | 71.4 | 52.3 | 74.4 | 3 |
| 333E11 | 96473 | 57.0 | 23.0 | 59.5 | 47.8 | 65.9 | 6.5 |
| 567E19 | 98005 | 69.3 | 25.8 | 71.2 | 60.1 | 76.5 | 5.3 |
| 89E23 | 104909 | 26.8 | 9.7 | 30.1 | 25.5 | 38.1 | 8 |
| 789F12 | 97547 | 73.9 | 36.7 | 79.7 | 57.2 | 81.7 | 2.1 |
| 318G23 | 93787 | 67.3 | 20.1 | 70.6 | 53.1 | 77.1 | 6.5 |
| 104J20 | 112610 | 72.9 | 33.8 | 77.4 | 55.5 | 80.5 | 3.2 |
| 297P14 | 78796 | 70.9 | 24.6 | 71.0 | 57.6 | 73.6 | 2.6 |
| Total | 773688 | 62.8 | 25.8 | 66.1 | 50.8 | 70.7 | 4.7 |
All figures are in % of the total length of the sequence analysed.
1Fraction masked based on BLASTN search against TREP
2Fraction masked based on BLASTX search against PTREP
3Fraction masked when information form BLASTN and BLSTX is combined
4Repetitive fraction identified by MDR analysis
5Fraction masked when information form BLAST and MDR is combined
6Fraction identified as repetitive exclusively through MDR
Figure 4MDR plots of a The linear scale (top) illustrates the strong variation in relative abundance between the two elements but also of different regions within the two elements. The LTRs of BARE1 are roughly 3-fold over-represented whereas a region containing tandem repeats in the Caspar element are at least 20 times more abundant than the rest of the element. The grey box indicates a region of low-complexity DNA. The logarithmic representation allows an easy identification of variable regions (e.g. in the BARE1 LTR and between the two CDS in Caspar).