| Literature DB >> 35601492 |
Zhenhua Liu1, Guihu Zhao2, Yuhui Xiao3, Sheng Zeng4, Yanchun Yuan1, Xun Zhou1, Zhenghuan Fang2, Runcheng He1, Bin Li2, Yuwen Zhao1, Hongxu Pan1, Yige Wang1, Guoliang Yu3, I-Feng Peng3, Depeng Wang3, Qingtuan Meng5, Qian Xu1, Qiying Sun6, Xinxiang Yan1, Lu Shen1,2, Hong Jiang1,7, Kun Xia8, Junling Wang1, Jifeng Guo1, Fan Liang3, Jinchen Li2,6,8, Beisha Tang1,2,5,7.
Abstract
Background: Short tandem repeats (STRs) are highly variable elements that play a pivotal role in multiple genetic diseases and the regulation of gene expression. Long-read sequencing (LRS) offers a potential solution to genome-wide STR analysis. However, characterizing STRs in human genomes using LRS on a large population scale has not been reported.Entities:
Keywords: TRcards; brain tissue; database; highly variable STRs; long-read sequencing; short tandem repeats; synaptic function
Year: 2022 PMID: 35601492 PMCID: PMC9117641 DOI: 10.3389/fgene.2022.810595
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.772
FIGURE 1Genome-wide profiling of STR compared with published databases. (A–F) Distribution of repeat sizes in TRcards, Hg19 reference, and WebSTR database by stratifying STR according to genomic features, including the exonic region (A), the untranslated region (UTR) (B), the intronic proximal region (defined as the location within 1 kb from the nearby gene) (C), the distal region of the intron (defined as a location more than 1 kb away from the nearby gene), the upstream region (E), and the downstream region (F). (G–H) Distribution of repeat sizes for CCG unit (G) and CAG units (H) in TRcards, WebSTR, and Hg19 reference database. Repeat unit CCG and repeat unit CAG are common STR repeat units reported to cause genetic diseases. In all panels, blue = Hg 19 reference; yellow = WebSTR; red = TRcards (our dataset). The x-axis shows the repeat size (repeat sizes above 30 are combined together). The y-axis shows the percentage of total STR loci.
FIGURE 2Distribution of repeat sizes for disease-associated STR loci. Selected well-studied disease-associated STR loci. Repeat unit CAG includes ATXN1 (Spinocerebellar Ataxia Type 1), ATXN2 (Spinocerebellar Ataxia Type 2), ATXN3 (Spinocerebellar Ataxia Type 3), CACNA1A (Spinocerebellar Ataxia Type 6), ATXN7 (Spinocerebellar Ataxia Type 7), ATXN8OS (Spinocerebellar Ataxia Type 8), HTT (Huntington’s disease), and AR (Spinal and Bulbar Muscular Atrophy). Repeat unit CCG includes GIPC1 (oculopharyngodistal myopathy) and LRP12 (oculopharyngodistal myopathy). Repeat unit TAAAA includes SAMD12 (familial cortical myoclonic tremor with epilepsy). Repeat unit AAAAG includes RFC1 (cerebellar ataxia, neuropathy, and vestibular areflexia syndrome).
FIGURE 3Genome-wide evaluation of STR variability. (A) The distribution of STR variability in the dSTR subset, eSTR subset, and FM-eSTR subset. TRcards = our entire STR dataset, dSTR = disease-associated STR, eSTR = expression STRs, FM-eSTR = fine-mapped eSTRs. (B) The distribution of STR variability in the different motif size subsets. (C) The distribution of STR variability in the different genomic region subsets. In all panels, the x-axis gives the normalized RDI value and the y-axis gives the percentage of STR loci. Normalized RDI = normalized repeat dynamic index. We refer to the STR with normalized RDI score at 0–0.2, 0.2–0.4, 0.4–0.6, 0.6–0.8, and 0.8–1.0 as very lowly variable (vlSTR), lowly variable STR (lSTR), moderately variable STR (mSTR), highly variable STR (hSTR), and very highly variable STR (vhSTR), respectively. Colors denote different STR subsets. The brown dashed line in (B) and (C) shows the reference percentage in the entire dataset.
FIGURE 4Tissue-specific expression profiles of vhSTRs and hSTRs. (A) Tissue-specific expression pattern of vhSTRs. (B) Tissue-specific expression pattern of hSTRs. Heatmaps show the expression patterns of different STRs subset across different tissues based on the normalized expression level. vhSTRs = very highly variable STRs, hSTRs = highly variable STRs. The rows represent the entire dataset of vhSTR or hSTR and their subsets stratified by different motif sizes, genomic regions, and repeat units. The columns represent the tissues.
FIGURE 5Enrichment pathways of vhSTRs and hSTRs. (A) Significant GO terms of vhSTR. (B) Significant GO terms of hSTR. (C) Significant KEGG pathways of vhSTR. (D) Significant KEGG pathways of hSTR. vhSTRs = very highly variable STRs, hSTRs = highly variable STRs. GO = Gene Ontology, KEGG = Kyoto Encyclopedia of Genes and Genomes. In all panels, the rows represent the entire dataset of vhSTR or hSTR and their subset stratified by different motif sizes, genomic regions, and eSTR repeat units. The columns represent the GO terms or KEGG pathway. Fisher’s exact test was used to calculate the p-value for each tissue.
FIGURE 6Snapshot of the TRcards web interface (http://www.genemed.tech/trcards/home). “Home page” shows the introduction and motivation of TRcards. There are approaches to access specific STRs in the “Search page” and “Browse page” through different input query types. CACCC repeat in FAM41C gene is illustrated as an example to show the information for each STR locus, including the chromosome, the starting position of the repeat, the end position of the repeat, the 5 and 95 percentile of the repeat counts, nRDI value, and the plot of repeat size distribution.