| Literature DB >> 25674564 |
Giuseppe Narzisi1, Michael C Schatz2.
Abstract
Repetitive sequences are abundant in the human genome. Different classes of repetitive DNA sequences, including simple repeats, tandem repeats, segmental duplications, interspersed repeats, and other elements, collectively span more than 50% of the genome. Because repeat sequences occur in the genome at different scales they can cause various types of sequence analysis errors, including in alignment, de novo assembly, and annotation, among others. This mini-review highlights the challenges introduced by small-scale repeat sequences, especially near-identical tandem or closely located repeats and short tandem repeats, for discovering DNA insertion and deletion (indel) mutations from next-generation sequencing data. We also discuss the de Bruijn graph sequence assembly paradigm that is emerging as the most popular and promising approach for detecting indels. The human exome is taken as an example and highlights how these repetitive elements can obscure or introduce errors while detecting these types of mutations.Entities:
Keywords: indel mutation; next-generation sequencing; nucleic acid; repetitive sequences; sequence analysis; sequence assembly; variant detection
Year: 2015 PMID: 25674564 PMCID: PMC4306302 DOI: 10.3389/fbioe.2015.00008
Source DB: PubMed Journal: Front Bioeng Biotechnol ISSN: 2296-4185
Figure 1Micro-assembly. (A) typical workflow of the micro-assembly strategy: given a specific region of interest, the reads, previously aligned to the reference, are extracted, assembled, and then contigs are aligned back to the reference to discover mutations. (B) Schematic representation of a bubble in a De Bruijn graph induced by a heterozygous mutation within a repetitive sequence composed of two near-identical copies (with 1 bp mismatch) that are 15 bp apart. Each block represents a sequence; sequences that have the same base pair composition have the same color; the length of each sequence (in base pairs) is reported inside each block. Each node of the graph is colored according to the order and sequence composition contained in it. Simple (non-branching) paths are represented in a single node. Alt1 and Alt2 are two alternative alignments of the sequence contained in the bottom side of the bubble, where the dashed line represents the alignment gap. The top side of the bubble matches the reference. Any k-mer longer than the longest identical repeat (>49 bp) and shorter than the longest near-identical repeat (<69 bp) would create a bubble like the one depicted (B) where a jump is allowed from the first copy of the near-identical repeat to the second copy.
Figure 2Repeat content in the human exome. Repeat content distribution in the human exome target regions as a function of the k-mer size. The sequence of each target exon is analyzed to check for the presence of a repeat structure within the same region defining the exon. The y-axis reports the percentage of those exons that have been found to contain an identical or near-identical repeat of size k (up to three mismatches).