| Literature DB >> 17784950 |
Taane G Clark1, Toby Andrew, Gregory M Cooper, Elliott H Margulies, James C Mullikin, David J Balding.
Abstract
BACKGROUND: We describe the distribution of indels in the 44 Encyclopedia of DNA Elements (ENCODE) regions (about 1% of the human genome) and evaluate the potential contributions of small insertion and deletion polymorphisms (indels) to human genetic variation. We relate indels to known genomic annotation features and measures of evolutionary constraint.Entities:
Mesh:
Year: 2007 PMID: 17784950 PMCID: PMC2375018 DOI: 10.1186/gb-2007-8-9-r180
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Indel density (for all 44 ENCODE regions)
| ENCODE region | Chromosome | Number of indels | Indels (bp) | Size (bp) | Density (per 100 kb) | Density (bp per 100 kb) | Gene (bp%) | Val. SNP (per 100 kb) | SNP:indel | |
| (per 100 kb) | (bp/100 kb) | |||||||||
| 1: ENm001 CFTR | 7 | 189 | 533 | 1,877,426 | 10.1 | 28.4 | 1.2 | 64.5 | 5.8 | 2.3 |
| 2: ENm002 Interleukin | 5 | 139 | 535 | 1,000,000 | 13.9 | 53.5 | 3.0 | 101.1 | 6.6 | 1.9 |
| 3: ENm003 ApoCluster | 11 | 59 | 187 | 500,000 | 11.8 | 37.4 | 2.1 | 93.2 | 8.4 | 2.5 |
| 4: ENm004 | 22 | 289 | 789 | 1,700,000 | 17.0 | 46.4 | 2.1 | 89.7 | 4.9 | 1.9 |
| 5: ENm005 | 21 | 368 | 982 | 1,695,985 | 21.7 | 57.9 | 2.4 | 108.1 | 4.3 | 1.9 |
| 6: ENm006 | X | 97 | 249 | 1,338,447 | 7.2 | 18.6 | 5.5 | 34.5 | 7.4 | 1.9 |
| 7: ENm007 | 19 | 207 | 711 | 1,000,876 | 20.7 | 71.0 | 4.9 | 151.6 | 7.8 | 2.1 |
| 8: ENm008 AlphaGlobin | 16 | 118 | 253 | 500,000 | 23.6 | 50.6 | 5.2 | 120.2 | 5.0 | 2.4 |
| 9: ENm009 BetaGlobin | 11 | 168 | 545 | 1,001,592 | 16.8 | 54.4 | 4.2 | 181.4 | 10.9 | 3.3 |
| 10: ENm010 HOXACluster | 7 | 95 | 317 | 500,000 | 19.0 | 63.4 | 2.4 | 89.4 | 4.5 | 1.4 |
| 11: ENm011 1GF2H19 | 11 | 62 | 228 | 606,048 | 10.2 | 37.6 | 2.1 | 102.3 | 13.4 | 2.7 |
| 12: ENm012 FOXP2 | 7 | 128 | 370 | 1,000,000 | 12.8 | 37.0 | 0.3 | 73.2 | 5.9 | 2.0 |
| 13: ENm013 | 7 | 139 | 483 | 1,114,424 | 12.5 | 43.3 | 1.0 | 105.7 | 7.7 | 2.4 |
| 14: ENm014 | 7 | 128 | 322 | 1,163,197 | 11.0 | 27.7 | 0.8 | 83.4 | 7.6 | 3.0 |
| 15: ENr111 | 13 | 96 | 364 | 500,000 | 19.2 | 72.8 | 0.3 | 128.2 | 5.7 | 1.8 |
| 16: ENr112 | 2 | 55 | 156 | 500,000 | 11.0 | 31.2 | 0.0 | 94.2 | 10.6 | 3.0 |
| 17: ENr113 | 4 | 56 | 152 | 500,000 | 11.2 | 30.4 | 0.1 | 104.0 | 9.6 | 3.4 |
| 18: ENr114 | 10 | 101 | 284 | 500,000 | 20.2 | 56.8 | 1.0 | 142.8 | 8.5 | 2.5 |
| 19: ENr121 | 2 | 108 | 270 | 500,000 | 21.6 | 54.0 | 0.8 | 140.0 | 6.1 | 2.6 |
| 20: ENr122 | 18 | 76 | 287 | 500,000 | 15.2 | 57.4 | 1.9 | 139.8 | 8.2 | 2.4 |
| 21: ENr123 | 12 | 65 | 136 | 500,000 | 13.0 | 27.2 | 2.5 | 122.0 | 9.2 | 4.5 |
| 22: ENr131 | 2 | 75 | 202 | 500,064 | 15.0 | 40.4 | 3.6 | 123.4 | 6.9 | 3.1 |
| 23: ENr132 | 13 | 43 | 169 | 500,000 | 8.6 | 33.8 | 1.9 | 123.8 | 14.5 | 3.7 |
| 24: ENr133 | 21 | 112 | 293 | 500,000 | 22.4 | 58.6 | 2.2 | 165.0 | 6.3 | 2.8 |
| 25: ENr211 | 16 | 68 | 251 | 500,001 | 13.6 | 50.2 | 0.1 | 114.8 | 8.8 | 2.3 |
| 26: ENr212 | 5 | 70 | 118 | 500,000 | 14.0 | 23.6 | 0.3 | 112.6 | 7.6 | 4.8 |
| 27: ENr213 | 18 | 74 | 165 | 500,000 | 14.8 | 33.0 | 0.6 | 91.8 | 6.0 | 2.8 |
| 28: ENr221 | 5 | 68 | 156 | 500,000 | 13.6 | 31.2 | 1.4 | 105.0 | 6.9 | 3.4 |
| 29: ENr222 | 6 | 73 | 201 | 500,000 | 14.6 | 40.2 | 0.9 | 104.0 | 6.4 | 2.6 |
| 30: ENr223 | 6 | 130 | 384 | 500,000 | 26.0 | 76.8 | 2.2 | 135.2 | 4.5 | 1.8 |
| 31: ENr231 | 1 | 91 | 178 | 500,000 | 18.2 | 35.6 | 4.8 | 94.6 | 4.9 | 2.7 |
| 32: ENr232 | 9 | 93 | 282 | 500,000 | 18.6 | 56.4 | 3.2 | 112.0 | 5.8 | 2.0 |
| 33: ENr233 | 15 | 47 | 126 | 500,000 | 9.4 | 25.2 | 7.3 | 59.8 | 7.4 | 2.4 |
| 34: ENr311 | 14 | 50 | 171 | 500,000 | 10.0 | 34.2 | 0.0 | 93.0 | 8.8 | 2.7 |
| 35: ENr312 | 11 | 54 | 176 | 500,000 | 10.8 | 35.2 | 0.0 | 142.0 | 12.2 | 4.0 |
| 36: ENr313 | 16 | 83 | 242 | 500,000 | 16.6 | 48.4 | 0.0 | 108.4 | 6.4 | 2.2 |
| 37: ENr321 | 8 | 84 | 257 | 500,000 | 16.8 | 51.4 | 0.4 | 94.0 | 4.6 | 1.8 |
| 38: ENr322 | 14 | 86 | 323 | 500,000 | 17.2 | 64.6 | 0.8 | 127.6 | 7.3 | 2.0 |
| 39: ENr323 | 6 | 77 | 176 | 500,000 | 15.4 | 35.2 | 0.7 | 78.8 | 4.7 | 2.2 |
| 40: ENr324 | X | 70 | 138 | 500,000 | 14.0 | 27.6 | 1.3 | 43.8 | 4.3 | 1.6 |
| 41: ENr331 | 2 | 67 | 204.0 | 500,000 | 13.4 | 40.8 | 6.4 | 118.8 | 8.4 | 2.9 |
| 42: ENr332 | 11 | 60 | 184 | 500,000 | 12.0 | 36.8 | 6.5 | 88.6 | 7.6 | 2.4 |
| 43: ENr333 | 20 | 89 | 226 | 500,000 | 17.8 | 45.2 | 6.1 | 77.2 | 4.0 | 1.7 |
| 44: ENr334 | 6 | 79 | 235 | 500,000 | 15.8 | 47.0 | 2.2 | 83.4 | 5.4 | 1.8 |
Manual (regions 1-14; each approx. 500 kb-2 MB) and random (regions 15-44 each approx. 500 kb) selected ENCODE regions are defined [12] as:
Manual: genomic regions with well studied genes and availability of comparative sequence
Random: selected randomly across the genome, stratified by gene density and non-exonic conservation
The ten Encyclopedia of DNA Elements (ENCODE) regions with in-depth single nucleotide polymorphism (SNP) discovery are ENm010, ENm013, ENm014, ENr112, ENr113, ENr123, ENr131, ENr213, ENr232, and ENr321. bp, base pairs; kb, kilobases.
Comparison of indel and SNP density by ENCODE experimental features
| Indels | Validated SNPs | |||
| bp/100 kb | 99% CI | bp | bp/100 kb | |
| Manual | 43.4 | 34.4 to 54.7 | 14,390 | 95.9 |
| Random | 43.4 | 37.5 to 50.2 | 16,343 | 109.0 |
| Overall | 43.4 | 38.3 to 49.1 | 30,733 | 102.4 |
| RNA transcription | ||||
| CDS | 0.7 | 0.1 to 8.6 | 421 | 62.4 |
| TSS | 3.3 | 42 | 68.7 | |
| RACEfrags | 6.6 | 1.3 to 33.9 | 278 | 65.4 |
| TARs/transfrags | 12.3 | 6.8 to 22.3 | 591 | 93.1 |
| Pseudo-exons | 19.1 | 5.8 to 63.3 | 132 | 96.9 |
| 3' UTR | 23.6 | 13.5 to 41.3 | 370 | 84.8 |
| 3' UTR | 27.4 | 3.8 to 198.7 | 97 | 83.2 |
| TUF | 36.9 | 20.2 to 67.6 | 423 | 97.6 |
| Open chromatin | ||||
| FAIRE-sites | 23.8 | 15.5 to 36.7 | 1,232 | 89.8 |
| DHS (NHGRI) | 19.7 | 8.3 to 46.9 | 297 | 95.9 |
| DHS (Regulome) | 27.0 | 13.4 to 54.4 | 450 | 90.1 |
| DNA-protein interaction/transcript regulation | ||||
| HisPolTAF | 32.4 | 22.5 to 46.5 | 850 | 79.0 |
| Seq_specific (all motifs) | 35.8 | 23.1 to 55.3 | 1,098 | 93.5 |
| SeqSp (sequence specific factors) | 42.5 | 20.1 to 89.5 | 421 | 79.4 |
| Ancestral repeats | 26.5 | 21.7 to 32.5 | 5,749 | 95.9 |
| Evolutionary constraint | ||||
| MCS strict | 4.1 | 1.6 to 10.4 | 229 | 30.6 |
| MCS moderate | 11.2 | 6.8 to 18.5 | 667 | 44.0 |
| MCS loose | 26.4 | 20.9 to 33.4 | 2,052 | 56.4 |
| Cell cycle | ||||
| EarlyRepSeg | 43.5 | 33.3 to 56.9 | 6,165 | 89.8 |
| MidRepSeg | 43.2 | 35.3 to 53.0 | 7,418 | 95.7 |
| LateRepSeg | 41.9 | 32.9 to 53.3 | 8,896 | 111.3 |
bp, base pairs; CDS, coding sequence; CI, confidence interval; DHS, DNAse hypersensitive sites; ENCODE, Encyclopedia of DNA Elements; FAIRE, formaldehyde assisted isolation of regulatory elements; kb, kilobases; MCS, multi-species conserved sequence; NHGRI, National Human Genome Research Institute; transfrag, transcribed fragment; RACEfrag, rapid amplification of cDNA ends fragment; SNP, single nucleotide polymorphism; TAR, transcriptionally active region; TSS, transcription start site; TUF, transcripts of unknown function; UTR, untranslated region.
Indel density for annotation features (across all 44 ENCODE regions)
| Indels | Rate (number per 100 kb) | Rate (bp per 100 kb) | |||||
| bp | 99% CI | bp | 99% CI | Feature length (kb) | |||
| Manual | 2,186 | 6,504 | 14.6 | 11.7 to 18.2 | 43.4 | 34.4 to 54.7 | 14,998 |
| Random | 2,300 | 6,506 | 15.3 | 13.6 to 17.3 | 43.4 | 37.5 to 50.2 | 15,000 |
| Overall | 4,486 | 13,010 | 15.0 | 13.4 to 16.7 | 43.4 | 38.3 to 49.1 | 29,998 |
| RNA transcription | |||||||
| CDS | 5 | 5 | 0.7 | 0.1 to 8.6 | 0.7 | 0.1 to 8.6 | 675 |
| TSS | 2 | 2 | 3.3 | 3.3 | 61 | ||
| RACEfrags | 9 | 28 | 2.1 | 0.8 to 5.4 | 6.6 | 1.3 to 33.9 | 425 |
| TARs/transfrags | 37 | 78 | 5.8 | 3.5 to 9.6 | 12.3 | 6.8 to 22.3 | 634 |
| Pseudo-exons | 9 | 26 | 6.6 | 2.6 to 16.6 | 19.1 | 5.8 to 63.3 | 136 |
| 3' UTR | 48 | 103 | 11.0 | 7.2 to 16.7 | 23.6 | 13.5 to 41.3 | 436 |
| 5' UTR | 7 | 32 | 6.0 | 1.6 to 22.3 | 27.4 | 3.8 to 198.7 | 117 |
| TUF | 53 | 160 | 12.2 | 7.8 to 19.2 | 36.9 | 20.2 to 67.6 | 433 |
| Open chromatin | |||||||
| FAIRE-sites | 106 | 327 | 7.7 | 5.6 to 10.6 | 23.8 | 15.5 to 36.7 | 1,372 |
| DHS (NHGRI) | 19 | 61 | 6.1 | 3.3 to 11.3 | 19.7 | 8.3 to 46.9 | 310 |
| DHS (Regulome) | 43 | 135 | 8.6 | 5.3 to 14.0 | 27.0 | 13.4 to 54.4 | 499 |
| DNA-protein intreraction/transcript regulation | |||||||
| HisPolTAF | 141 | 348 | 13.1 | 10.0 to 17.2 | 32.4 | 22.5 to 46.5 | 1,076 |
| Seq_specific (all motifs) | 131 | 420 | 11.2 | 8.3 to 15.0 | 35.8 | 23.1 to 55.3 | 1,174 |
| SeqSp (sequence specific factors) | 54 | 225 | 10.2 | 6.2 to 16.7 | 42.5 | 20.1 to 89.5 | 530 |
| Ancestral repeats | 532 | 1,592 | 7.9 | 6.7 to 9.2 | 26.5 | 21.7 to 32.5 | 5,998 |
| Evolutionary constraint | |||||||
| MCS strict | 19 | 31 | 2.5 | 1.3 to 5.1 | 4.1 | 1.6 to 10.4 | 748 |
| MCS moderate | 78 | 170 | 5.1 | 3.5 to 7.6 | 11.2 | 6.8 to 18.5 | 1,515 |
| MCS loose | 356 | 960 | 9.8 | 8.2 to 11.7 | 26.4 | 20.9 to 33.4 | 3,637 |
| Cell cycle | |||||||
| EarlyRepSeg | 1,124 | 2,989 | 16.4 | 13.8 to 19.4 | 43.5 | 33.3 to 56.9 | 6,868 |
| MidRepSeg | 1,190 | 3,352 | 15.4 | 13.5 to 17.5 | 43.2 | 35.3 to 53.0 | 7,751 |
| LateRepSeg | 1,110 | 3,345 | 13.9 | 12.1 to 15.9 | 41.9 | 32.9 to 53.3 | 7,991 |
bp, base pairs; CDS, coding sequence; CI, confidence interval; DHS, DNAse hypersensitive sites; ENCODE, Encyclopedia of DNA Elements; FAIRE, formaldehyde assisted isolation of regulatory elements; kb, kilobases; MCS, multi-species conserved sequence; NHGRI, National Human Genome Research Institute; transfrag, transcribed fragment; RACEfrag, rapid amplification of cDNA ends fragment; TAR, transcriptionally active region; TSS, transcription start site; TUF, transcripts of unknown function; UTR, untranslated region.
Experimental feature definitions
| Feature | Term | Definition |
| RNA transcription (coding and noncoding) | CDS | Coding sequence: well characterized transcribed regions with an annotated protein-coding open reading frame (ORF) |
| RACEfrags | 5' and 3' rapid Amplification of cDNA ends (RACE), using polyA or total RNA to construct full-length cDNA. This technique has revealed previously unrecognized UTRs | |
| TARs/transfrags | Transcriptionally active regions/transcribed fragments as determined by analyses of cellular RNA (polyA or total) hybridizations to multiple microarray platforms. For the analyses reported here, portions of TARs/transfrags overlapping any CDS, 5' or 3' UTR annotations were removed from the dataset | |
| Pseudo-exons | A pre-mRNA sequence that resembles an exon but is not recognized as such by the splicing machinery | |
| TSS | Transcription start site | |
| 5' UTR | Untranslated region: portions of CDS-containing transcripts before the start codon. For the analyses reported here, 5' UTRs overlapping alternatively transcribed CDS annotations were removed from the dataset | |
| TUF | Transcripts of unknown function for noncoding transcripts | |
| 3' UTR | Untranslated region: portions of CDS-containing transcripts after the stop codon | |
| Transcript regulation: open chromatin/DNA-protein interaction | DHS | DNAse I hypersensitive sites are short regions of DNA that are relatively easily cleaved by deoxyribonuclease. Regions of open chromatin detected by quantitative chromatin profiling and novel microarray-based methods. For the analyses reported here, regions that overlap repetitive sequence were removed. Measures of DHS are reported using two sources: the ENCODE Regulome group and the NHGRI |
| FAIRE-sites | Formaldehyde assisted isolation of regulatory elements: a procedure used to isolate chromatin that is resistant to the formation of protein-DNA crosslinks. Data suggest that depletion of nucleosomes (the most basic organizational unit of chromatin) at active regulatory regions, such as promotors, is the primary underlying basis for FAIRE [38] | |
| HisPolTAF | Histone modifications, RNA polymerase II (PolII), and transcription regulator TAF250 | |
| Sequence specific factors | Regions of DNA determined to be bound by sequence-specific transcription factors through chromatin immunoprecipitation followed by microarray chip hybridization (so-called 'ChIP-Chip') analyses | |
| Sequence specific (all motifs) | Computationally identified short sequence motifs found to be over-represented in the sequence specific factors dataset | |
| Ancestral repeats | Mobile elements with well defined consensus sequences that inserted into the ancestral genome prior to mammalian radiation. These sequences are considered to be predominantly non-functional and are often used as models of neutrally evolving DNA | |
| Cell cycle | EarlyRepSeg | Early replicating segments |
| MidRepSeg | Mid replicating segments | |
| LateRepSeg | Late replicating segments | |
| Evolutionary constraint | MCS strict | Multi-species conserved sequences: strict criteria |
| MCS moderate | Multi-species conserved sequences: modest criteria | |
| MCS loose | Multi-species conserved sequences: loose criteria |
Figure 1Indel rate versus MCS modest for human and 13 mammals. Indel rate and multi-species constrained sequences (MCS modest) are both expressed as base pairs (bp) per 100 kilobases (kb). The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression.
Figure 3Indel rate versus all AR sequence rate. Indel rate and ancestral repeat (AR) sequence rate are both expressed as base pairs (bp) per 100 kilobases (kb). The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression. Note that the same relationship is observed for indel rate versus long AR bp per 100 kb.
Figure 2Indel rate versus GERP score comparing human and primates. Indel rate is expressed as base pairs (bp) per 100 kilobases (kb). The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression. GERP, genomic evolutionary rate profiling.
Figure 4AR sequence rate versus MCS modest. Ancestral repeat (AR) sequence rate and multi-species conserved sequences (MCS modest) are both expressed as base pairs (bp) per 100 kilobases (kb). The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression.
Figure 5MCS modest versus GERP human-primate score. Multi-species conserved sequences (MCS modest) is expressed as base pairs (bp) per 100 kilobases (kb). The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression. GERP, genomic evolutionary rate profiling.
Figure 6AR sequence rate versus GC content. Ancestral repeat (AR) sequence rate is expressed as base pairs (bp) per 100 kilobases (kb). The reduced local GC content observed in AR sequence reflects the process of deamination of methylated CpG to TpG dinucleotides in vertebrate sequence over long evolutionary periods of time [3]. The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression.
Figure 7Indel rates versus GC content. Indel rate is expressed as base pairs (bp) per 100 kilobases (kb). The solid line represents the fit from a cubic smoothing spline, whereas the dashed line is the fit from a robust linear regression.
Comparison of ENCODE and Bhangale et al. (ten ENCODE regions) indel data
| ENCODE (44 ENCODE regions/Baylor) | Bhangale | |||||||
| Indels | Rate (per 100 kb) | Indels | Rate (per 100 kb) | |||||
| bp | bp | bp | bp | |||||
| Manual | 2,186 | 6,504 | 14.6 | 43.4 | 362 | 1,122 | 13.0 | 40.4 |
| Random | 2,300 | 6,506 | 15.3 | 43.4 | 502 | 1,350 | 14.3 | 38.6 |
| Overall | 4,486 | 13,010 | 15.0 | 43.4 | 864 | 2,472 | 13.8 | 39.4 |
| RNA transcription | ||||||||
| CDS | 5 | 5 | 0.7 | 0.7 | 1 | 1 | 1.2 | 1.2 |
| TSS | 2 | 2 | 3.3 | 3.3 | 0 | 0 | 0.0 | 0.0 |
| RACEfrags | 9 | 28 | 2.1 | 6.6 | 0 | 0 | 0.0 | 0.0 |
| TARs/transfrags | 37 | 78 | 5.8 | 12.3 | 6 | 11 | 7.5 | 13.7 |
| Pseudo-exons | 9 | 26 | 6.6 | 19.1 | 2 | 10 | 9.7 | 48.7 |
| 3' UTR | 48 | 103 | 11.0 | 23.6 | 11 | 29 | 18.7 | 49.2 |
| 5' UTR | 7 | 32 | 6.0 | 27.4 | 4 | 8 | 37.3 | 74.6 |
| TUF | 53 | 160 | 12.2 | 36.9 | 4 | 18 | 8.1 | 36.4 |
| Open chromatin | ||||||||
| FAIRE sites | 106 | 327 | 7.7 | 23.8 | 17 | 72 | 5.6 | 23.6 |
| DHS (NHGRI) | 19 | 61 | 6.1 | 19.7 | 1 | 1 | 2.8 | 2.8 |
| DHS (Regulome) | 43 | 135 | 8.6 | 27.0 | 15 | 40 | 8.5 | 22.6 |
| DNA-protein intreraction/transcript Regulation | ||||||||
| HisPolTAF | 141 | 348 | 13.1 | 32.4 | 32 | 114 | 12.8 | 45.5 |
| Seq_specific (all motifs) | 131 | 420 | 11.2 | 35.8 | 28 | 122 | 33.4 | 145.3 |
| SeqSp (sequence specific factors) | 54 | 225 | 10.2 | 42.5 | 9 | 45 | 5.1 | 25.6 |
| Ancestral repeats | 532 | 1,592 | 7.9 | 26.5 | 110 | 280 | 8.7 | 22.1 |
| Evolutionary constraint | ||||||||
| MCS strict | 19 | 31 | 2.5 | 4.1 | 5 | 9 | 3.3 | 5.9 |
| MCS moderate | 78 | 170 | 5.1 | 11.2 | 17 | 36 | 5.4 | 11.4 |
| MCS loose | 356 | 960 | 9.8 | 26.4 | 63 | 136 | 8.4 | 18.1 |
| Cell cycle | ||||||||
| EarlyRepSeg | 1,124 | 2,989 | 16.4 | 43.5 | 161 | 495 | 16.4 | 50.4 |
| MidRepSeg | 1,190 | 3,352 | 15.4 | 43.2 | 270 | 797 | 16.4 | 48.3 |
| LateRepSeg | 1,110 | 3,345 | 13.9 | 41.9 | 300 | 819 | 11.3 | 31.0 |
Both datasets (Encyclopedia of DNA Elements [ENCODE] and that reported by Bhangale and coworkers [19]) are based on a subset of 8 African Americans (the Baylor samples). bp, base pairs; CDS, coding sequence; CI, confidence interval; DHS, DNAse hypersensitive sites; ENCODE, Encyclopedia of DNA Elements; FAIRE, formaldehyde assisted isolation of regulatory elements; kb, kilobases; MCS, multi-species conserved sequence; NHGRI, National Human Genome Research Institute; transfrag, transcribed fragment; RACEfrag, rapid amplification of cDNA ends fragment; SNP, single nucleotide polymorphism; TAR, transcriptionally active region; TSS, transcription start site; TUF, transcripts of unknown function; UTR, untranslated region.