| Literature DB >> 31072070 |
Abstract
The central goal of medical genomics is to understand the inherited basis of sequence variation that underlies human physiology, evolution, and disease. Functional association studies currently ignore millions of bases that span each centromeric region and acrocentric short arm. These regions are enriched in long arrays of tandem repeats, or satellite DNAs, that are known to vary extensively in copy number and repeat structure in the human population. Satellite sequence variation in the human genome is often so large that it is detected cytogenetically, yet due to the lack of a reference assembly and informatics tools to measure this variability, contemporary high-resolution disease association studies are unable to detect causal variants in these regions. Nevertheless, recently uncovered associations between satellite DNA variation and human disease support that these regions present a substantial and biologically important fraction of human sequence variation. Therefore, there is a pressing and unmet need to detect and incorporate this uncharacterized sequence variation into broad studies of human evolution and medical genomics. Here I discuss the current knowledge of satellite DNA variation in the human genome, focusing on centromeric satellites and their potential implications for disease.Entities:
Keywords: alpha satellite; centromere; genome assembly; human satellites; repeat; satellite DNA; sequence variation; structural variation
Mesh:
Substances:
Year: 2019 PMID: 31072070 PMCID: PMC6562703 DOI: 10.3390/genes10050352
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Proportion of alpha satellite and human satellites 2,3 in the human population. Using 1KG [1] data representing 14 diverse populations (400 male individuals and 414 female individuals) (a) the frequency of 24-mers that have an exact match with alpha satellite [23], (b) the frequency of 24-mers that have an exact match with human satellite 2,3 [31]. (c) Median frequencies (from panel (a) alpha and (b) HSat2,3) are listed relative to the observed frequency in the human reference genome assembly (GRCh38; GCA_000001405.15). (d) Evaluation of 300 Mb of DNA from the collective genomes of 910 people of African descent, previously determined to be missing or unaligned to GRCh38 [39]. Key for human subpopulations: CHB: Han Chinese in Beijing, China; JPT: Japanese in Tokyo, Japan; CHS: Southern Han Chinese; CEU: Utah Residents (CEPH) with Northern and Western European Ancestry; TSI: Toscani in Italia; FIN: Finnish in Finland; GBR: British in England and Scotland; IBS: Iberian Population in Spain; YRI: Yoruba in Ibadan, Nigeria; LWK: Luhya in Webuye, Kenya; GWD: Gambian in Western Divisions in the Gambia; ASW: Americans of African Ancestry in SW USA; MXL: Mexican Ancestry from Los Angeles USA; PUR: Puerto Ricans from Puerto Rico; CLM: Colombians from Medellin, Colombia.
Figure 2Intra-array satellite sequence variation. (a) All normal human centromeric regions contain at least one alpha satellite array, shown in grey, which is tandemly organized in a head-to-tail orientation with occasionally transposable element interruptions (green) and shifts in directionality (black box). The fundamental alpha satellite repeat unit, or ~171 bp monomer, is shown in a variation of shaded colors to illustrate the heterogeneity of the sequencing identity. Multi-monomer repeat units, or ‘higher-order repeats (HORs), are shown by the larger grey arrows that encompass the collection of smaller repeats. In contrast to the individual monomers, these repeats are shown to be identical, or near-identical (98–100%). In addition to single nucleotide differences between the HORs, larger rearrangements (shown as a deletion of five monomers) are observed to occur and expand and contract within the array. (b) Satellite array length predictions on the X chromosome (DXZ1) [7], grey shading marks the previously observed PFGE Southern length range [25]. (c) Inversion detected using error-corrected PacBio reads [68]. (d) RP13-511L2 is an X-specific BAC that represents the transition from core alpha satellite to the edge of the array. HOR pair-wise repeat identity (muscle alignment [72]) showing increased divergence approaching the chromosome arm (43,346 bp), as typically observed at the edge of the array.
Figure 3Disease-associated variants in centromere-associated haplotypes. (a) Centromeres act as the primary constriction of chromosomes, and are historically defined by the reduction of meiotic recombination (indicated by blue). Therefore, sequences in these regions are commonly inherited in large linkage blocks, or cenhaps (shown in the linkage disequilibrium heat map) [70]. (b) Study of disease and clinically associated single nucleotide variants (GWAS Catalog (green), ClinVar SNVs (yellow) in the Xq cenhap region (with Linkage Disequilibrium heat map from (a) enlarged) and a collection of annotated genes (RefSeq, white), of which variation have been attributed to a human disease (OMIM data, grey).
Description of centromere-adjacent single nucleotide polymorphisms (SNPs) identified by published Genome-Wide Association Studies (GWAS), collected in the NHGRI-EBI GWAS Catalog published jointly by the National Human Genome Research Institute (NHGRI) and the European Bioinformatics Institute (EMBL-EBI) [80]. SNPs are included if found within a two-megabase window of an alpha satellite reference model (GRCh38) and do not overlap with annotated genes or segmental duplication).
| Trait | SNPs | CEN adjacent (2Mb) Regions | Citation |
|---|---|---|---|
| Cancer | rs930395, rs2241024, rs142427110, rs35951924, rs199501877, rs11146838, rs6490525, rs2050203, rs7278690, rs35505947 | 4p12; 5p12; 5q11; 10p11; 13q12; 18p11; 19q11; 20p11; 21q11 | [ |
| Cardiovascular disease | rs10132760, rs12186641, rs9367716, rs71566846, rs223290, rs144961578, rs3813127, rs1657346, rs1254531, rs10793514 | 5q11.2; 6p11.2; 6q11.1; 10q11.21; 14q11.2; 18q11.2 | [ |
| Neurodegenerative diseases | rs11826064, rs13168838, rs62365447, rs140996952, rs1480597, rs10783624, rs7989524, rs6822736, rs13110633, rs2424635 | 4p11; 4q12; 5p12; 5q11.1; 6q11.1; 10q11; 11p11; 12q12; 13q12; 20p11 | [ |
| Scoliosis/Bone Density (Spine) | rs8111296, rs11652527, rs1436931, rs6061081, rs17599071, rs10136383, rs9288898, rs10772040, rs4562194, rs810967, rs6050182, rs6511621, rs11229654, rs6551418, rs1006899 | 3p11.1; 3q11.2; 6q12; 7q11.21; 10q11.21; 11q11; 12p11.21; 14q11.2; 17q11.2; 19p12; 20p11.21; 21q11.2 | [ |
| Digestive system disease | rs4243971, rs2342002, rs4800353, rs6058869, rs6087990 | 6q11.1; 18q11.2; 20q11.21 | [ |