| Literature DB >> 25426137 |
Wentian Li1, Jan Freudenberg1.
Abstract
Power-law distributions are the main functional form for the distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size of fragments and reads may prevent an unique alignment of repeat sequences to the reference sequence. Repeats in the human genome can be as long as 10(4) bases, or 10(5) - 10(6) bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of 10(3) bases. With a read length of 1000 bases, slightly more than 1% of the assembled genome, and slightly less than 1% of the 1 kb reads, are unmappable, excluding the unassembled portion of the human genome (8% in GRCh37/hg19). The slow decay (long tail) of the power-law function implies a diminishing return in converting unmappable regions/reads to become mappable with the increase of the read length, with the understanding that increasing read length will always move toward the direction of 100% mappability.Entities:
Keywords: copy number variations; mappability; next-generation sequencing; power-law distribution; repeats
Year: 2014 PMID: 25426137 PMCID: PMC4226227 DOI: 10.3389/fgene.2014.00381
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Illustration of the problem caused by repeats in the reads alignment to a reference genome. The length of the repeat unit is D basepairs, the length of a (typical) DNA fragment is F, and the length of a read in the paired-end sequencing is R. If D < R, the fragment can be mapped to the genome uniquely. On the other hand, if D > F, the fragment is unmappable. If R < D < F, the fragment may or may not be mappable. When the whole fragment is sequenced, we consider R = F. Usually the fragment size is not fixed, whereas the read length is fixed. The distribution of fragment sizes and read lengths are P = P(x = F) and P = P(x = R). The distribution of maximum repeat length and copy number is P = P(x = D,y = C). The distribution of fixed repeat length and copy number is P = P(x = D0, y = C).
Figure 2(A–D) are on exact repeats and (E–F) on approximate repeats, all in log-log scale. (A) The number of repeat types as a function of the fixed repeat unit length D0. The number of repeat types with exact two (three) copies in the genome, C = 2 (C = 3), are shown separately. (B) The difference of number of repeat types at D0 and D0 + 1. This is an upper limit of the number of maximal repeat types at D. (C) The number of repeat types at fixed repeat unit lengths (D0 = 50, 150, 500, 1000, as a function of copy number C). (D) The difference between the number of repeat types at different D0's (e.g., between D0 = 50 and D0 = 150). This represents the sum of upper limits of number of maximal repeat types at length D, summing over all D's between the two values (e.g., 50 and 150). (E) Number of appearence in the segmental duplication track from the UCSC Genome Browser with certain size D as a function of D. The three power-law functions, 1/D, 1/D2, 1/D3 are drawn for a comparison. (F) Number of segmental duplication names as a function of copy number C (number of pairwise alignment lines plus one). The power-law function 1/C3 is drawn for a comparison.