| Literature DB >> 36166477 |
Mayra M Bañuelos1,2,3, Yuómi Jhony A Zavaleta4, Alennie Roldan4, Rochelle-Jan Reyes4, Miguel Guardado1, Berenice Chavez Rojas4, Thet Nyein1, Ana Rodriguez Vega4, Maribel Santos4, Emilia Huerta-Sanchez2,3, Rori V Rohlfs4.
Abstract
A set of 20 short tandem repeats (STRs) is used by the US criminal justice system to identify suspects and to maintain a database of genetic profiles for individuals who have been previously convicted or arrested. Some of these STRs were identified in the 1990s, with a preference for markers in putative gene deserts to avoid forensic profiles revealing protected medical information. We revisit that assumption, investigating whether forensic genetic profiles reveal information about gene-expression variation or potential medical information. We find six significant correlations (false discovery rate = 0.23) between the forensic STRs and the expression levels of neighboring genes in lymphoblastoid cell lines. We explore possible mechanisms for these associations, showing evidence compatible with forensic STRs causing expression variation or being in linkage disequilibrium with a causal locus in three cases and weaker or potentially spurious associations in the other three cases. Together, these results suggest that forensic genetic loci may reveal expression levels and, perhaps, medical information.Entities:
Keywords: CODIS loci; STRs; data privacy; forensic genetics; gene expression
Mesh:
Year: 2022 PMID: 36166477 PMCID: PMC9546536 DOI: 10.1073/pnas.2121024119
Source DB: PubMed Journal: Proc Natl Acad Sci U S A ISSN: 0027-8424 Impact factor: 12.779
Fig. 1.Correlations between CODIS loci and the expression of neighboring genes. Associations of CODIS STR–gene pairs are shown as negative log p values. Red dotted line denotes the significant p value threshold. CODISeSTRs are shown in dark blue, and non-CODISeSTRs are shown in light blue.
Genomic features of CODISeSTRs
| CODISeSTR | Location relative to genes | Distance to nearest TSS, bp (genomic percentile, %) | Distance to TSS of associated gene, bp | Distance to nearest DHS site, bp (genomic percentile, %) | Distance to nearest lymph DHS site, bp (genomic percentile) | Length, bp (genomic percentile, %) | Repeating unit |
|---|---|---|---|---|---|---|---|
| D3S1358 | Intronic to LARS2 | 31,194 (52.6) | 152,15 to LARS2 | 1,916 (72.3) | 4,651 (69.1) | 63 (96.7) | [AGAT]n |
| D2S441 | Intergenic | 41,143 (45.8) | 41,143 to C1D | 14,064 (28.1) | 19,514 (39.3) | 47 (93.6) | [TGCC]m[TTCC]n |
| CSF1PO | Intronic to CSF1R | 4,649 (88.6) | 4,649 to CSF1R | 0 (100.0) | 0 (100.0) | 51 (94.8) | [AGAT]n |
| D18S51 | Intronic to BCL2 | 36,928 (48.4) | 85,535 to KDSR | 13,230 (29.3) | 3,714 (73.2) | 71 (97.3) | [AGAA]n |
| FGA | Intronic to FGA | 2,922 (92.7) | 2,922 to FGA | 8,065 (40) | 27,933 (27.9) | 87 (98.1) | [TTTC]m TTTTTTCT[CTTT]n CTCC[TTCC]o |
Fig. 2.LARS2–D3S1358 CAVIAR and local LD landscape. Local LD and CAVIAR score landscapes in a 100-kb window centered on the LARS2 gene for the FIN subpopulation (A), GBR subpopulation (B), and TSI subpopulation (C). For each plot, Upper shows LD between the CODISeSTR D3S1358 versus each variant in the ρ causal set, and Lower shows CAVIAR scores for variants in the ρ causal set. Dark green circles enclose putative causal variants in both CAVIAR and LD panels. Chr, chromosome.
Putative mechanisms for observed CODISeSTR-expression associations
| CODISeSTR– | Association observed at subpopulation level | CODISeSTR fit to FMeSTR Profile | CODISeSTR LD with CAVIAR causal variants | CODISeSTR LD with DHS sites active In lymphoblasts |
|---|---|---|---|---|
| CSF1PO– | Yes | Strong | Low | Overlaps with DHS site |
| D18S51– | Yes | Moderate | Low | Low |
| D3S1358– | Yes | Weak | Moderate–High | Low–Moderate |
| CSF1PO– | No | Strong | N/A | Low–Moderate |
| D2S441– | No | Weak | N/A | Low |
| FGA– | No | Weak | N/A | Low |
*Strong fit is defined as satisfying most or all of the FMeSTR characteristics described in Table 1; moderate is defined as satisfying at least half; weak is defined as satisfying less than half.
†High LD is considered ≥ 0.7; moderate LD is between 0.4 and 0.69; low LD is <0.4. N/A values indicate STR–gene pairs that were not included in the CAVIAR analysis.