| Literature DB >> 18416815 |
Colm T O'Dushlaine1, Denis C Shields.
Abstract
BACKGROUND: Tandem repeat (TR) variants in the human genome play key roles in a number of diseases. However, current models predicting variability are based on limited training sets. We conducted a systematic analysis of TRs of unit lengths 2-12 nucleotides in Whole Genome Shotgun (WGS) sequences to define the extent of variation of 209,214 unique repeat loci throughout the genome.Entities:
Mesh:
Year: 2008 PMID: 18416815 PMCID: PMC2364633 DOI: 10.1186/1471-2164-9-175
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Summary of variants and non-variants detected in the WGS dataset divided according to unit length of the repeat.
| 58686 (76.2) | 18369 (23.8) | 5.53 | 6.05 | 0.40 | 0.26 | |
| 6928 (55.0) | 5676 (45.0) | 5.65 | 4.81 | 0.27 | 0.27 | |
| 27413 (54.1) | 23299 (45.9) | 5.71 | 5.05 | 0.27 | 0.27 | |
| 7478 (35.5) | 13603 (64.5) | 5.76 | 4.67 | 0.17 | 0.24 | |
| 2338 (21.3) | 8638 (78.7) | 5.66 | 6.66 | 0.09 | 0.19 | |
| 3400 (9.2) | 33386 (90.8) | 7.64 | 158.18 | 0.04 | 0.13 | |
| 106243 (50.8) | 102971 (49.2) | 5.98 | 66.53 | 0.26 | 0.28 | |
Predictor variables derived from the repeat and flanking sequences whose impact on variability were tested by regression.
| Population size: number of unique sequences from which estimate of repeat variability was obtained | |
| Based on percentage composition: | |
| %T in the repeat, e.g. 50% in TGTGTGTG | |
| %G in the repeat, e.g. 50% in TGTGTGTG | |
| %C in the repeat, e.g. 50% in CACACACA | |
| %A in the repeat, e.g. 50% in CACACACA | |
| Tandem Repeat Finder (TRF) [16] program-derived overall score | |
| inferred consensus [16] | |
| % matches between actual repeat units and the inferred consensus [16] | |
| Length of tandem repeat unit, e.g. 2 for TGTGTGTG | |
| Length of the tandem repeat array, e.g. 8 for TGTGTGTG | |
| Number of copies of repeat unit, e.g. 4 for TGTGTGTG | |
| Observed/expected dinucleotide bias of 10 dimers in the tandem repeat array [24] | |
| Melting temperature of the sequence [26] | |
| Fraction of the sequence represented by the bases G or C, e.g. 0.5 for TGTGTGTG | |
| Free energy of the tandem repeat sequence RNA secondary structure [37] | |
| Melting temperature of the sequence [26] | |
| Fractional G+C content of the two 20 bp/500 bp flanking sequences | |
| Number of CpG (CG) dinucleotides in the two 20 bp/500 bp flanking sequences | |
| Total number of SNPs in the two 500 bp flanks | |
| Mean SNP minor allele frequency for SNPs in the two 500 bp flanks | |
| Distance in nucleotides from the nearest promoter | |
| Distance in nucleotides from the nearest gene/CDS | |
| Distance in nucleotides from the nearest Multi-species Conserved Sequences defined in [33] | |
| Distance in nucleotides from the nearest CpG island (defined by the UCSC genome browser) | |
| Distance in nucleotides from the nearest regulatory region (UCSC genome browser) | |
Univariate logistic predictors of whether or not a TR is polymorphic in the Whole Genome Shotgun datasets.
| unit_length | Repeat | 3.115 | 5.828 | -46.6 | 0.19 |
| score | Repeat | 84.569 | 61.891 | 36.6 | 0.14 |
| copy_number | Repeat | 20.440 | 11.724 | 74.3 | 0.08 |
| %match | Repeat | 94.261 | 89.711 | 5.1 | 0.05 |
| AC | Repeat | 2.252 | 1.396 | 61.3 | 0.04 |
| %indels | Repeat | 2.083 | 4.619 | -54.9 | 0.04 |
| CA | Repeat | 2.280 | 1.757 | 29.8 | 0.03 |
| SNP_allele_freq | Repeat flank | 0.166 | 0.167 | -0.6 | 0.02 |
| AA | Repeat | 1.025 | 1.183 | -13.4 | 0.01 |
| blocklength | Repeat | 57.130 | 48.430 | 18.0 | 0.01 |
| G+C_repeat | Repeat | 32.098 | 28.592 | 12.3 | < 0.01 |
| entropy | Repeat | 1.050 | 1.100 | -4.5 | < 0.01 |
| CC | Repeat | 1.052 | 1.150 | -8.5 | < 0.01 |
| CpG_flank500 | Repeat flank | 4.695 | 5.496 | -14.6 | < 0.01 |
| G+C_flank500 | Repeat flank | 41.427 | 42.363 | -2.2 | < 0.01 |
| tm_flank500 | Repeat flank | 53.014 | 53.418 | -0.8 | < 0.01 |
| G+C_flank20 | Repeat flank | 38.653 | 39.873 | -3.1 | < 0.01 |
| tm_flank20 | Repeat flank | 51.606 | 52.134 | -1.0 | < 0.01 |
| C | Repeat | 15.695 | 13.831 | 13.5 | < 0.01 |
| G | Repeat | 15.559 | 13.717 | 13.4 | < 0.01 |
| GC | Repeat | 1.010 | 1.088 | -7.2 | < 0.01 |
| TA | Repeat | 0.921 | 0.877 | 5.0 | < 0.01 |
| A | Repeat | 33.180 | 35.523 | -6.6 | < 0.01 |
| CpG flank 20 | Repeat flank | 0.190 | 0.226 | -15.9 | < 0.01 |
| tm_repeat | Repeat | 48.744 | 47.967 | 1.6 | < 0.01 |
| num_SNPs | Repeat flank | 2.380 | 2.220 | 7.2 | < 0.01 |
| RNA_free_energy | Repeat | -3.978 | -3.157 | 26.0 | < 0.01 |
| nearest_promoter | Distant repeat flank | 475996.5 | 439298.5 | 8.3 | < 0.01 |
| CG | Repeat | 0.930 | 0.973 | -4.4 | < 0.01 |
| AT | Repeat | 0.991 | 0.963 | 2.9 | < 0.01 |
| AG | Repeat | 1.313 | 1.354 | -3.0 | < 0.01 |
| T | Repeat | 34.591 | 35.767 | -3.3 | < 0.01 |
| pop_size | 6.317 | 5.642 | 12.0 | < 0.01 | |
| nearest_gene | Distant repeat flank | 143738.9 | 133874.5 | 7.4 | < 0.01 |
| nearest_CDS | Distant repeat flank | 151516.1 | 141369.5 | 7.2 | < 0.01 |
| GA | Repeat | 1.346 | 1.382 | -2.7 | < 0.01 |
| nearest_regulatory | Distant repeat flank | -3642189 | -3363138 | 8.3 | < 0.01 |
| nearest_MCS | Distant repeat flank | 66672.9 | 72695.2 | -8.3 | < 0.01 |
1 Significant with p < 0.00005 in both Mann-Whitney and t-test. Only variables with at least one significant p-value at the 5% level are shown, and dinucleotide biases in the flanking sequences of repeats were also excluded.
Figure 1Significant predictors of repeat variability from the generic logistical model, sorted by absolute value of the z-score.
Multivariate Analysis. Logistic regression coefficients for the 3 most predictive covariates are shown. A more detailed table, using all covariates, is given [see Additional file 6].
| Score | 0.050 | < 0.001 | 294.85 | < 0.0001 | 0.0492083 → 0.0498669 |
| Unit length | -0.452 | 0.001 | -311.89 | < 0.0001 | -0.4545942 → -0.4489163 |
| %match | 0.066 | < 0.001 | 196.10 | < 0.0001 | 0.0657407 → 0.0670681 |
| -7.585 | 0.035 | -214.63 | < 0.0001 | -7.654186 → -7.515656 |
Figure 2Histogram of predictions from the generic logistic regression model, broken down according to whether or not the repeats were variable.
Figure 3Distributions of predicted invariant (blue), predicted variant (red) and observed variant repeats (green) across the human genome. Counts are shown as counts of repeats per 2.5 Mb (y-axis) against genomic position in Mb (x-axis). Gap regions (light blue) and centromeres (orange) are also highlighted as peaks or lines raised above the base level. These regions were not included in the frequency estimations.