| Literature DB >> 31823717 |
Ping Zhang1, Nicholas P West2,3, Pin-Yen Chen2, Mike W C Thang4, Gareth Price4, Allan W Cripps2,5, Amanda J Cox3.
Abstract
BACKGROUND: Principal components analysis (PCA) is often used to find characteristic patterns associated with certain diseases by reducing variable numbers before a predictive model is built, particularly when some variables are correlated. Usually, the first two or three components from PCA are used to determine whether individuals can be clustered into two classification groups based on pre-determined criteria: control and disease group. However, a combination of other components may exist which better distinguish diseased individuals from healthy controls. Genetic algorithms (GAs) can be useful and efficient for searching the best combination of variables to build a prediction model. This study aimed to develop a prediction model that combines PCA and a genetic algorithm (GA) for identifying sets of bacterial species associated with obesity and metabolic syndrome (Mets).Entities:
Keywords: Biomarker; Genetic algorithm; Obesity; PCA
Mesh:
Substances:
Year: 2019 PMID: 31823717 PMCID: PMC6904994 DOI: 10.1186/s12859-019-3001-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A diagram of proposed method. s1, s2...sn are the 16S rRNA sequences for this study (can be from other sequencing). v1 to vn are the abundance (normalized) of each species detected in each individual. m = number of PCs created by PCA, n = number of individuals included in the sample. PCA is used to produce PC scores for each individual, and GA is used to select the best subset of PCs to distinguish obesity from healthy cases
GA selected PCs and the classification model performance (ROC)
| Data for creating PCA | Result | Model | Model | Model | Model | Model | Model |
|---|---|---|---|---|---|---|---|
| All | PCs selected | PC1+, PC2–, PC7+, PC11+, PC15–, PC27– | PC1+, PC2–, PC7+, PC11+, PC27– | PC1+, PC2–, PC7+, PC27– | PC1+, PC2–, PC7+ | PC1+, PC7+ (or PC2–) | PC1+ |
| AUC (CV) | 0.87 | 0.85 | 0.84 | 0.81 | 0.77 | 0.69 | |
| Obesity | PCs selected | PC2–, PC4–, PC14+, PC16– PC18+, PC19– | PC2–, PC4–, PC14+, PC18+ PC19– | PC2–, PC4–, PC14+, PC18+ | PC2–, PC14+, PC18+ | PC14+, PC18+ | PC14+ |
| AUC (CV) | 0.92 | 0.92 | 0.90 | 0.87 | 0.84 | 0.80 | |
| Healthy | PCs selected | PC1+,PC3+, PC5–,PC23+, PC28–,PC34+ | PC1+,PC3+, PC23+,PC28– PC34+ | PC1+,PC23+, PC28–,PC34+ | PC1+,PC23+, PC34+ | PC1+,PC34+ | PC1+ |
| AUC (CV) | 0.92 | 0.90 | 0.88 | 0.87 | 0.83 | 0.72 |
+ Positive correlation coefficient in the model
– Negative correlation coefficient in the model
Top species included in the GA selected 1, 2, 3, 4, 5 or 6 PCs produced with different data sets
| Dataset for creating PCA | High contribution variables (high coefficients in the corresponding PC) included in the most selected components | |||||
|---|---|---|---|---|---|---|
| Prausnitzii–a | Gnavus+ | Eutactus+a | Moorei– | Eggerthii–a | Zeae+a | |
| Eutactus–a | Faecis–a | Prausnitzii+a | Obeum– | Dispar–a | Gnavus– | |
| Formicigenerans–a | Copri+ | Aerofaciens– | Lenta+a | Adolescentis+ | Stutzeri+a | |
| Catus–a | Muciniphila–a | Catus– | Animalis– | Mucilaginosa–a | Bromii+a | |
| Faecis–a | Adolescentis–a | Adolescentis–` | Torques– | Aerofaciens+ | Fragilis+a | |
| Eutactus–a | Uniformis+ | Dolichum– | Producta– | Caccae+a | Formicigenerans+a | |
| Bromii+ | Catus–a | Lenta– | Prausnitzii+a | Parainfluenzae+a | Bromii– | |
| Adolescents–a | Dispar+ | Aerofaciens+a | Aerofaciens– | Formicigenerans+a | Distasonis– | |
| Formicigenerans+ | Faecis+ | Producta– | Fragilis– | Adolescentis– | Eutactus+a | |
| Producta–a | Distasonis–a | Gnavus– | Faecis+a | Dispar– | Perfringens+a | |
| Prausnitzii–a | Stutzeri–a | Callidus–a | Ovatus– | Copri+ | Copri+a | |
| Eutactus–a | Zeae+ | Moorei+ | Longum+a | Muciniphila–a | Muciniphila+a | |
| Catus–a | Gnavus+ | Formigenes+ | Distasonis+a | Formigenes–a | Prausnitzii– | |
| Formicigenerans–a | Dispar+ | Prausnitzii+ | Fragilis– | Catus+ | Formigenes+a | |
| Faecis–a | Lenta–a | Catus–a | Aerofaciens– | Biforme+ | Eutactus+a | |
Comp1, Comp2, Comp3, Comp4, Comp5 and Comp6 represent the 6 PCs selected by GA. For experiment with whole dataset they are PC1, PC7, PC2, PC27, PC11 and PC15 respectively; for experiment with obesity sample, they are PC14, PC18, PC2, PC4, PC19 and PC16; for experiment with healthy sample, they are PC1, PC34, PC23, PC28, PC3 and PC5
aSpecies has a positive correlation with the probability of having healthy body mass
+ Positive correlation with the corresponding PC
– Negative correlation with the corresponding PC
Fig. 2ROC produced from the top PCs of PCA and from the PCs selected by GA. a: 1--PC1, 2--PC1 + PC2, 3--PC1 + PC2 + PC3, 4--PC1 + PC2 + PC3 + PC4, 5-- PC1 + PC2 + PC3 + PC4 + PC5, 6-- PC1 + PC2 + PC3 + PC4 + PC5 + PC6; b 1--PC1, 2--PC1 + PC34, 3--PC1 + PC34 + PC23, 4--PC1 + PC23 + PC28 + PC34, 5-- PC1 + PC3 + PC23 + PC28 + PC34, 6--PC1 + PC3 + PC5 + PC23 + PC28 + PC34
Sets of species selected by GA using the species abundance as the input variables of logistic regression models
| GA Selected Species | AUC | |||||
|---|---|---|---|---|---|---|
| Adolescentis | Catus | Eutactus | Gnavus | Muciniphila | Prausnitzii | 0.87 |
| Adolescentis | Distasonis | Eutactus | Gnavus | Muciniphila | Prausnitzii | 0.87 |
| Aerofaciens | Distasonis | Eutactus | Gnavus | Longum | Muciniphila | 0.88 |
| Anginosus | Distasonis | Eutactus | Gnavus | Muciniphila | Prausnitzii | 0.87 |
| Catus | Distasonis | Eutactus | Gnavus | Muciniphila | Prausnitzii | 0.88 |
| Catus | Eutactus | Gnavus | Longum | Muciniphila | Prausnitzii | 0.86 |
| Distasonis | Eutactus | Gnavus | Longum | Muciniphila | Prausnitzii | 0.88 |
GA Selected Species lists the set of species selected by GA, each row one set. AUC is the area under the ROC curve produced by the corresponding logistic regression model with the selected set of species. The result was cross validated with the same cross validation set up as the earlier experiments
Fig. 3Frequencies of the species selected from the multiple runs of GA. The GA was run with logistic regression as the classification model and species abundance as the input. GA was used to select the combination of the species for classification of obese individuals from the health group. The number on top of each bar is how many times out of 100 GA runs the corresponding species was selected