| Literature DB >> 36032686 |
Yi-Hui Zhou1,2, George Sun3.
Abstract
In the United States, colorectal cancer is the second largest cause of cancer death, and accurate early detection and identification of high-risk patients is a high priority. Although fecal screening tests are available, the close relationship between colorectal cancer and the gut microbiome has generated considerable interest. We describe a machine learning method for gut microbiome data to assist in diagnosing colorectal cancer. Our methodology integrates feature engineering, mediation analysis, statistical modeling, and network analysis into a novel unified pipeline. Simulation results illustrate the value of the method in comparison to existing methods. For predicting colorectal cancer in two real datasets, this pipeline showed an 8.7% higher prediction accuracy and 13% higher area under the receiver operator characteristic curve than other published work. Additionally, the approach highlights important colorectal cancer-related taxa for prioritization, such as high levels of Bacteroides fragilis, which can help elucidate disease pathology. Our algorithms and approach can be widely applied for Colorectal cancer prediction using either 16 S rRNA or shotgun metagenomics data.Entities:
Keywords: 16S rRNA; colorectal cancer; feature engineering; machine learning; mediation analysis; prediction
Year: 2022 PMID: 36032686 PMCID: PMC9415616 DOI: 10.3389/fmolb.2022.921945
Source DB: PubMed Journal: Front Mol Biosci ISSN: 2296-889X
FIGURE 1This is the Microbiome Host Trait Prediction (MHTP) pipeline to build the prediction of colorectal cancer.
FIGURE 2Two groups of sample sizes are used, with n 1= n 2=50, n 1= n 2=100 respectively. Here we compare two methods: 1. our MHTP approach 2. zero-inflated modeling with random forest. The left panel is the area under the curve comparison among the simulated data with the different effect sizes. The right panel shows the mean square error changes with the effect size increase.
The area under the curve, prediction accuracy, and mean square error are listed for the CRC status prediction using the real data. AUCRF refers to feature engineering in the first column; MHTP is our proposed method, which can combine with the best machine learning method.
| Method | AUC | Prediction accuracy | MSE |
|---|---|---|---|
| ZIBB + Random Forest | 0.777 | 0.732 | 0.191 |
| ZIBB + Bart | 0.782 | 0.741 | 0.182 |
| AUCRF + ZIBB + Random Forest | 0.804 | 0.757 | 0.168 |
| AUCRF + ZIBB + Bart | 0.802 | 0.743 | 0.180 |
| MHTP (with RF) | 0.867 | 0.792 | 0.142 |
| MHTP (with Bart) | 0.882 | 0.796 | 0.147 |
FIGURE 3The receiver operating characteristic (ROC) curve compares the performance using Random Forest and our MHTP, on the second real CRC dataset.
FIGURE 4The cluster of OTUs that are highly correlated with Bacteroides fragilis.