| Literature DB >> 30367582 |
Chong Chu1, Jingwen Pei2, Yufeng Wu3.
Abstract
BACKGROUND: Repeat elements are important components of most eukaryotic genomes. Most existing tools for repeat analysis rely either on high quality reference genomes or existing repeat libraries. Thus, it is still challenging to do repeat analysis for species with highly repetitive or complex genomes which often do not have good reference genomes or annotated repeat libraries. Recently we developed a computational method called REPdenovo that constructs consensus repeat sequences directly from short sequence reads, which outperforms an existing tool called RepARK. One major issue with REPdenovo is that it doesn't perform well for repeats with relatively high divergence rates or low copy numbers. In this paper, we present an improved approach for constructing consensus repeats directly from short reads. Comparing with the original REPdenovo, the improved approach uses more repeat-related k-mers and improves repeat assembly quality using a consensus-based k-mer processing method.Entities:
Keywords: De novo genome assembly; Repeat elements; Sequence analysis
Mesh:
Substances:
Year: 2018 PMID: 30367582 PMCID: PMC6101065 DOI: 10.1186/s12864-018-4920-6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1High-level procedure of improved repeat construction. Thick bars: genomic sequences. Yellow thick bars: repeat copies. Colored squares within thick bars: mutations (substitutions and indels) within repeats. Thin bars: k-mers. There are six main steps. a K-mer counting for the reads. b Find the highly frequent k-mers and k-mers with intermediate frequencies according to a user-specified cutoff on k-mer frequency. c Find repeat-related k-mers by aligning those k-mers of intermediate frequencies to highly frequent k-mers. d Improve k-mer quality with a consensus-based approach. e Assemble the improved k-mers. f Merge contigs that have reliable prefix-suffix overlap
Fig. 2Illustration of two example repeats that are not fully constructed by the original REPdenovo. Highly frequent 30-mers of one human individual NA19230 are aligned to the human consensus repeats in Repbase. The left part (a) shows the alignments on repeat “LTR2B”. Two gaps are formed when 30-mers originated from highly divergent regions have low frequencies due to repeat copy divergence. The right part (b) shows the alignments on repeat “LTR10C”. The colored bars are variations on copies. The assembled contigs are fragmented because the 30-mers are of highly divergence
Comparison between the two versions of REPdenovo and RepARK on Human, Arabidopsis thaliana, and Drosophila melanogaster data
| Species | Methods |
|
|
|
|
|
|---|---|---|---|---|---|---|
| REPdenovo* | 6192 | 108 | 332 | 0.61 | 0.49 | |
| REPdenovo | 4648 | 89 | 220 | 0.66 | 0.55 | |
| Human | RepARK | 2046 | 1 | 168 | 0.34 | 0.21 |
| REPdenovo* | 808 | 24 | 102 | 0.42 | 0.31 | |
| REPdenovo | 508 | 11 | 68 | 0.46 | 0.34 | |
| Arabidopsis | RepARK | 632 | 8 | 59 | 0.33 | 0.21 |
| REPdenovo* | 3644 | 69 | 177 | 0.83 | 0.61 | |
| REPdenovo | 3031 | 33 | 133 | 0.67 | 0.49 | |
| Drosophila | RepARK | 2,787 | 26 | 133 | 0.66 | 0.44 |
REPdenovo*: the new method. N: the total number of repeats constructed. N and N0 are the number of hit Repbase repeats with at least 85% and 0% similarity respectively. C: the average Repbase coverage which indicates the average percent of a repeat in Repbase is covered by the constructed repeats. C: the average Repbase coverage by the longest assembled repeat
Fig. 3Comparison of the fully constructed repeats in Repbase for the two version of REPdenovo. Bullet circles: hit Repbase repeats constructed by both versions of REPdenovo. Empty circles: hit Repbase repeats constructed only by the new version. Figure in the right-up corner is zoomed in the red rectangle region. There are 154(out of all the 220) bullet circles and 57 empty circles. Most of these 57 ones fall in higher divergent and lower copy number regions (the regions of blue rectangles)
Fig. 4Length distribution of the selected constructed Hummingbird repeats
Masking information of the 1617 long reads validated Hummingbird repeats
| Category | LINE | SINE | LTR | Retroposon | Satellite | Simple_repeat | Low_complexity | rRNA | Other |
|---|---|---|---|---|---|---|---|---|---|
| Unique | 371 | 0 | 139 | 0 | 10 | 98 | 31 | 0 | 5 |
| Dup. | 557 | 6 | 244 | 0 | 19 | 216 | 52 | 1 | 81 |
For one repeat, RepeatMasker may report several hits depending on whether the repeat is composed of regions of different repeat types. “Unique” only counts those repeats with one unique masked repeat family, while “Dup.” allows one repeat counted more than once
Fig. 5Comparison between the two versions of REPdenovo on constructing one sample repeat “LTR2B”. The old version generates three pieces of the repeat, while the new version constructs the whole repeat