| Literature DB >> 36125534 |
Max Garzon1, Sambriddhi Mainali2, Maria Fernanda Chacon3, Shima Azizzadeh-Roodpish2.
Abstract
The current pandemic (COVID-19) has made evident the need to approach pathogenicity from a deeper and more systematic perspective that might lead to methodologies to quickly predict new strains of microbes that could be pathogenic to humans. Here we propose as a solution a general and principled definition of pathogenicity that can be practically implemented in operational ways in a framework for characterizing and assessing the (degree of) potential pathogenicity of a microbe to a given host (e.g., a human individual) just based on DNA biomarkers, and to the point of predicting its impact on a host a priori to a meaningful degree of accuracy. The definition is based on basic biochemistry, the Gibbs free Energy of duplex formation between oligonucleotides and some deep structural properties of DNA revealed by an approximation with certain properties. We propose two operational tests based on the nearest neighbor (NN) model of the Gibbs Energy and an approximating metric (the h-distance.) Quality assessments demonstrate that these tests predict pathogenicity with an accuracy of over 80%, and sensitivity and specificity over 90%. Other tests obtained by training machine learning models on deep features extracted from DNA sequences yield scores of 90% for accuracy, 100% for sensitivity and 80% for specificity. These results hint towards the possibility of an operational, objective, and general conceptual framework for prior identification of pathogens and their impact without the cost of death or sickness in a host (e.g., humans.) Consequently, a reasonable prediction of possible pathogens might pave the way to eventually transform the way we handle and prepare for future pandemic events and mitigate the adverse impact on human health, while reducing the number of clinical trials to obtain similar results.Entities:
Keywords: Digital genomic signature; Gibbs energy; Hybridization; Machine learning; Pathogenic relationship; Pathogens/nonpathogens; h-distance
Year: 2022 PMID: 36125534 PMCID: PMC9486766 DOI: 10.1007/s00438-022-01951-w
Source DB: PubMed Journal: Mol Genet Genomics ISSN: 1617-4623 Impact factor: 2.980
The parameters used in the proposed definition of pathogenic relationship between microbes and humans based on the PNP-G test
| ID | Threshold | Radius |
|---|---|---|
| bacs20C-OnGrid200 | 20,545.0 | 7,925.0 |
| bacs40B-OnGrid100 | 19,558.5 | 428.5 |
| funs20C-OnGrid200 | 5,102.5 | 894.5 |
| funs40B-OnGrid100 | 15,891.0 | 545.0 |
The parameters used in the proposed definition of pathogenic relationship between microbes and humans based on the PNP-h test
| ID | Threshold | Radius | |
|---|---|---|---|
| bacs20COnGrid200 | 14 | 29,928.5 | 29,927.5 |
| bacs40BOnGrid100 | 27 | 19,460.5 | 123.5 |
| funs20COnGrid200 | 14 | 59,721.5 | 101.5 |
| funs40BOnGrid100 | 27 | 18,670 | 177 |
Machine learning models for prediction and assessment of pathogenicity tests of microbes in Homo sapiens based on genomic signatures
| Machine learning models | ID | Implementation |
|---|---|---|
| k-Nearest Neighbors | kNN | Python (Pedregosa et al. |
| Support Vector Machines (SVMs) with radial basis kernel | RBF | Python (Pedregosa et al. |
| Decision Trees | DT | Python (Pedregosa et al. |
| Multilayer perceptrons | MLP | Python (Pedregosa et al. |
| Adaboost | AB | Python (Pedregosa et al. |
The proxies for microbes (25 pathogens and 25 nonpathogens) and host species pathogen/nonpathogen and host used in the assessment of the pathogenicity tests PNP-G and PNP-h
| ID | Target taxon | Length of oligos | No of points in dataset | Approximation of hybridization affinity |
|---|---|---|---|---|
bacs20C-G-On-Grid100| bacs40B-G-OnGrid200 | Bacteria | 20 | 40 | 300*100 | 200*200 | Gibbs Energy |
funs20C-G-OnGrid100| funs40B-G-OnGrid200 | Fungi | 20 | 40 | 300*100 | 200*200 | |
bacs20C- bacs40B- | Bacteria | 20 | 40 | 300*100 | 200*200 | |
funs20C- funs40B- | Fungi | 20 | 40 | 300*100 | 200*200 |
Fig. 1A DNA sequence x is shredded into fragments of the same length n as that of the probes on an nxh basis so that the total number of fragments hybridizing with each oligo can be counted for each probe to obtain a feature vector from x. The oligos for the basis are judiciously selected in such a way that no cross hybridization occurs among probes in the basis itself and, moreover, that every random fragment hybridizes to (ideally exactly) one probe. An ideal basis thus produces feature vectors that are fully reproducible and contain much of the information in the original sequence x
Nxh bases used to extract predictor features for machine learning models to predict pathogenicity of microbes in Homo sapiens
| Basis | Length | Size | Avg | Entropy | |
|---|---|---|---|---|---|
| 3mE4b | 3 | 4 | 1.1 | 1.09 | 0.45 |
| 4mP3-3 | 4 | 3 | 2.1 | 1.0 | 0 |
| 8mP10 | 8 | 10 | 4.1 | 1.1 | 0.57 |
The quality of a basis can be quantified by the Shannon entropy (uncertainty) of the random variable that counts the number of random target oligos that hybridize to the probes in the basis. An ideal basis (such as 4mP3-3) has entropy 0 and leaves no uncertainty in the hybridization count
Fig. 2Performance assessment of the definition of pathogenicity of bacteria and fungi using thresholding methods, based on the decision about hybridization events between oligos in the proxies of a host and a microorganism (Top: based on Gibbs Energy and Bottom: based on h-distance.) The x-axis represents different data sets for proxies and grids (IDs are in Table 4.)
Fig. 3Performance assessment of the definition of pathogenicity of bacteria (top), fungi (middle) and combined (bottom) obtained using machine learning models trained on genomic signatures
The average values of the Gibbs energies (kCal/Mol) and/or h-distances between the sequences of shreds of pathogens and hosts are large enough to conclude that the specimens selected in the sample data are diverse enough to provide strong evidence of the scalability of the PNP-G and PNP-h tests to other pathogens in these taxa and H. sapiens hosts
| Avg Gibbs (kCal/Mol) / h-distance | bac20C | fun20C | bac40B | fun40B |
|---|---|---|---|---|
| bac20C | − 4.4/11.9 | − 3.9/11.9 | ||
| fun20C | − 3.9/11.9 | − 3.5/10.8 | ||
| bac40B | − 8.9/25.5 | − 8.0/25.4 | ||
| fun40B | − 8.0/25.4 | − 7.0/23.6 |
The sample of specimens from bacteria and fungi that are pathogenic/nonpathogenic to humans
| Microorganism | Species | Accession ID | Category |
|---|---|---|---|
| CP001608.1 | Pathogens | ||
| CP002103.1 | |||
| CP003191.1 | |||
| CP002110.1 | |||
| CP001844.2 | |||
| FR872582.1 | |||
| CP002457.1 | |||
| CP002637.1 | |||
| CP003040.1 | |||
| CP002428.1 | |||
| AP011533.1 | |||
| CP002689.1 | |||
| CP002544.1 | |||
| CP002458.1 | |||
| CP001662.1 | |||
| CP001641.1 | |||
| CP001642.1 | |||
| CP002329.1 | |||
| CP002003.1 | |||
| CP002001.1 | |||
| CP002004.1 | |||
| CP002002.1 | |||
| FR687253.1 | |||
| AP009333.1 | |||
| AP011945.1 | |||
| CP002850.1 | Nonpathogens | ||
| CP002455.1 | |||
| FR873481.1 | |||
| CP002888.1 | |||
| CP003068.1 | |||
| AP010803.1 | |||
| CP002623.1 | |||
| CP003244.1 | |||
| CP002379.1 | |||
| CP002589.1 | |||
| FN995097.1 | |||
| CP002830.1 | |||
| FR668087.1 | |||
| CP002992.1 | |||
| CP002385.1 | |||
| CP003101.3 | |||
| CP002365.1 | |||
| CP002844.1 | |||
| CP002464.1 | |||
| CP002341.1 | |||
| CP000156.1 | |||
| CP002652.1 | |||
| CP000647.1 | |||
| CP002442.1 | |||
| CP002390.1 | |||
| MT815704.1 | Pathogens | ||
| AY955840.1 | |||
| NC_007935.1 | |||
| CP025773.1 | |||
| CP003834.1 | |||
| AY101381.1 | |||
| NC_018792.1 | |||
| NC_004336.1 | |||
| CP022335.1 | |||
| NC_015923.1 | |||
| AB568600.1 | |||
| AB568599.1 | |||
| NC_053321.1 | |||
| MT849287.1 | |||
| AP018713.1 | |||
| NC_005256.1 | |||
| AY347307.1 | |||
| KU761332.1 | |||
| KU761331.1 | |||
| KU761330.1 | |||
| KU761329.1 | |||
| NC_018046.1 | |||
| JQ864234.1 | |||
| JQ864233.1 | |||
| KC993188.1 | |||
| KY498478.1 | Nonpathogens | ||
| KY498477.1 | |||
| KY213951.1 | |||
| NC_026614.1 | |||
| KC683708.1 | |||
| KX657750.1 | |||
| NC_031515.1 | |||
| NC_012145.1 | |||
| EU852811.1 | |||
| NC_001326.1 | |||
| X54421.1 | |||
| NC_040930.1 | |||
| MK457734.1 | |||
| AF275271.2 | |||
| NC_004312.1 | |||
| MK618140.1 | |||
| MK618139.1 | |||
| MK618138.1 | |||
| MK618137.1 | |||
| MK618136.1 | |||
| MK618135.1 | |||
| MK618134.1 | |||
| MK618133.1 | |||
| MK618132.1 | |||
| MK618131.1 |
The sample of host H. sapiens specimens used to build a grid
| Category | Species | Accession ID |
|---|---|---|
| NC_012920.1 | ||
| AP009475.1 | ||
| AP009474.1 | ||
| AP009473.1 | ||
| AP009472.1 | ||
| AP009471.1 | ||
| AP009470.1 | ||
| AP009469.1 | ||
| AP009468.1 | ||
| AP009467.1 | ||
| AP009466.1 | ||
| AP009465.1 | ||
| AP009463.1 | ||
| AP009462.1 | ||
| AP009461.1 | ||
| AP009460.1 | ||
| AP009459.1 | ||
| AP009458.1 | ||
| AP009457.1 | ||
| AP009456.1 | ||
| AP009455.1 | ||
| AP009454.1 | ||
| AP009453.1 | ||
| AP009452.1 | ||
| AP009451.1 | ||
| AP009450.1 | ||
| AP009449.1 | ||
| AP009448.1 | ||
| AP009447.1 | ||
| AP009446.1 | ||
| AP009445.1 | ||
| AP009444.1 |