| Literature DB >> 23216988 |
Tahir Mehmood1, Jonas Warringer, Lars Snipen, Solve Sæbø.
Abstract
BACKGROUND: Multivariate approaches have been successfully applied to genome wide association studies. Recently, a Partial Least Squares (PLS) based approach was introduced for mapping yeast genotype-phenotype relations, where background information such as gene function classification, gene dispensability, recent or ancient gene copy number variations and the presence of premature stop codons or frameshift mutations in reading frames, were used post hoc to explain selected genes. One of the latest advancement in PLS named L-Partial Least Squares (L-PLS), where 'L' presents the used data structure, enables the use of background information at the modeling level. Here, a modification of L-PLS with variable importance on projection (VIP) was implemented using a stepwise regularized procedure for gene and background information selection. Results were compared to PLS-based procedures, where no background information was used.Entities:
Mesh:
Year: 2012 PMID: 23216988 PMCID: PMC3598729 DOI: 10.1186/1471-2105-13-327
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1L-PLS algorithm. Steps for extracting the first set of v-vectors for the L-PLS algorithm are visualized by arrows. Deflated data matrices replace the initially centered matrices at subsequent steps of v-vector extraction. The mean-vectors used for data centering are also displayed.
Figure 2An overview of variable elimination in two stages. An overview of the variable elimination in two stages in L-PLS. Stage 1 eliminates variables in , using the VIPcriterion. After stage 1, we have reduced the number of columns (genes) in both and . In stage 2, we eliminate rows of , using the criterion VIPin a similar fashion.
Figure 3An overview of the testing/training. An overview of the testing/training procedure used in this study. The rectangles illustrate the predictor matrix. At level 1 we split the data into a test set and a training set (25/75). This was repeated 10 times. Inside our suggested method, the stepwise elimination, there are two levels of cross-validation. First a 10-fold cross-validation was used to optimize selection parameters fand d. Second, a leave-one-out cross-validation was used to optimize the L-PLS parameters, such as αand the number of components.
Figure 4Power analysis based on simulated data. Power of selecting the correct variables as function distribution of degree of information content of Z in L-PLS αis presented. α=0 indicates the no background information is used in modeling, while higher value of αindicates the higher influence of background variables in modeling.
An overview of model parameters and complexity
| Melibiose 2% Rate | 0.1 | 0.99 | 0.73 | 5 | 0.61 | 30 | 15 |
| Melibiose 2% Efficiency | 0.1 | 0.99 | 0.65 | 4 | 0.60 | 30 | 14 |
| Cupper chloride 0.375mM Rate | 0.5 | 0.99 | 0.78 | 6 | 0.57 | 31 | 15 |
| Cupper chloride 0.375mM Efficiency | 0.1 | 0.90 | 0.51 | 9 | 0.52 | 30 | 14 |
| NaCl 0.85M Rate | 0.1 | 0.95 | 0.62 | 1 | 0.61 | 31 | 14 |
| NaCl 1.25M Rate | 0.5 | 0.95 | 0.85 | 2 | 0.80 | 30 | 15 |
| NaCl 0.85M Efficiency | 0.1 | 0.95 | 0.77 | 4 | 0.71 | 30 | 15 |
| NaCl 1.25M Efficiency | 0.1 | 0.99 | 0.67 | 5 | 0.60 | 30 | 15 |
| Maltose 2% Rate | 0.1 | 0.99 | 0.63 | 5 | 0.60 | 31 | 15 |
| Maltose 2% Efficiency | 0.1 | 0.90 | 0.54 | 7 | 0.50 | 30 | 15 |
| Galactose 2% Rate | 0.1 | 0.95 | 0.75 | 7 | 0.50 | 30 | 16 |
| Galactose 2% Efficiency | 0.1 | 0.95 | 0.56 | 7 | 0.61 | 30 | 15 |
| Heat 37 | 0.1 | 0.99 | 0.65 | 6 | 0.58 | 30 | 15 |
| Heat 40 | 0.1 | 0.99 | 0.78 | 6 | 0.82 | 30 | 15 |
| Heat 37 | 0.1 | 0.99 | 0.51 | 1 | 0.59 | 31 | 14 |
| Heat 40 | 0.5 | 0.90 | 0.62 | 8 | 0.67 | 30 | 15 |
| Sodium arsenite oxide 3.5mM Rate | 0.1 | 0.90 | 0.59 | 8 | 0.54 | 30 | 15 |
| Sodium arsenite oxide 5mM Rate | 0.1 | 0.99 | 0.66 | 5 | 0.62 | 30 | 15 |
| Sodium arsenite oxide 3.5mM Efficiency | 0.1 | 0.99 | 0.73 | 3 | 0.55 | 31 | 14 |
| Sodium arsenite oxide 5mM Efficiency | 0.1 | 0.90 | 0.51 | 2 | 0.63 | 31 | 14 |
The suggested approach select the model parameters at each level of 10-fold cross validation, hence the measure of central tendency is used to summaries the results. For each fitted model mode of selected step length (f), mode of rejection level (d), mode of model complexity (αand no. of components), mean of RMSE on test data, number of selected genes and number of selected background variables are listed.
Figure 5An example of selection of variables in both stages. Number of X- and Z- variables remaining in the model, after each iteration, for the response ‘Melibiose 2% Rate’. X-variables are displayed with red curve and are scaled on vertical left axis, while Z-variables are displayed with blue curve and are scaled on vertical right axis, both stages are shown on x-axis separately.
Figure 6The distribution of degree of informativeness of Z in L-PLS and comparison of complexity of the models, number of variables and RMSE. Results for the phenotype ‘Melibiose 2% Rate’ are presented for three models, M1 (2-stage stepwise variable elimination in L-PLS), M2 (stepwise variable elimination in PLS) and M3 (St-PLS). In the upper left panel (a), the information content of the background information () in M1 is presented. Comparison of number of used PLS components in the upper right panel (b), the number of selected variables in the lower left panel (c), RMSE on training data in the lower center panel (d), RMSE on test data in the lower right panel (e) for each model is presented.
Figure 7Selectivity score. The selectivity score is sorted in descending order and is presented here for X-variables (genes) for M1, M2 and M3. Only the first 50 values are shown from while all 51 values are shown from .
Selectivity score based selection of GO terms and gene variations
| Melibiose 2% Rate | Paralog, frame shift variations, transport, stop codon variations, cellular protein catabolic process, transposition, copy number variations, response to stress, DNA metabolic process, mitochondrion organization, Essential.gene, RNA metabolic process, cellular amino acid and derivative metabolic process, response to chemical stimulus, response to chemical stimulus and metabolism |
| Melibiose 2% Efficiency | Copy number variations, transposition, Paralog, frame shift variations, stop codon variations, transport, cellular amino acid and derivative metabolic process, response to chemical stimulus, cell cycle, signal transduction, conjugation, RNA metabolic process, translation, mitochondrion organization, cellular carbohydrate metabolic process |
| Cupper chloride 0.375 mM Rate | Stop codon variations, Paralog, transposition, frame shift variations, copy number variations, RNA metabolic process, transport, protein modification process, response to stress, generation of precursor metabolites and energy, cellular respiration, DNA metabolic process, transcription, response to chemical stimulus, chromosome organization |
| Cupper chloride 0.375 mM Efficiency | Paralog, frame shift variations, transport, transposition, stop codon variations, Essential.gene, copy number variations, cellular amino acid and derivative metabolic process, RNA metabolic process, response to stress, protein modification process, chromosome organization, ribosome biogenesis, cell cycle, response to chemical stimulus |
| NaCl 0.85 M Rate | Generation of precursor metabolites and energy, cellular respiration, frame shift variations, stop codon variations, Paralog, copy number variations, transport, heterocycle metabolic process, sporulation resulting in formation of a cellular spore, transposition, transcription, cellular carbohydrate metabolic process, Essential.gene, RNA metabolic process, protein modification process |
| NaCl 1.25 M Rate | Cellular respiration, stop codon variations, frame shift variations, Paralog, generation of precursor metabolites and energy, Essential.gene, cellular lipid metabolic process, RNA metabolic process, transport, mitochondrion organization, cofactor metabolic process, transposition, response to chemical stimulus, transcription, DNA metabolic process |
| NaCl 0.85 M Efficiency | Paralog, transposition, transport, conjugation, frame shift variations, stop codon variations, signal transduction, RNA metabolic process, response to stress, chromosome organization, response to chemical stimulus, translation, ribosome biogenesis, mitochondrion organization, cellular amino acid and derivative metabolic process |
| NaCl 1.25 M Efficiency | Transposition, Paralog, copy number variations, frame shift variations, stop codon variations, response to stress, protein modification process, chromosome organization, transport, cellular amino acid and derivative metabolic process, translation, conjugation, RNA metabolic process, Essential.gene, mitochondrion organization |
| Maltose 2% Rate | Paralog, transposition, frame shift variations, stop codon variations, RNA metabolic process, response to chemical stimulus, transport, transcription, DNA metabolic process, copy number variations, response to stress, protein modification process, cellular amino acid and derivative metabolic process, heterocycle metabolic process, cellular aromatic compound metabolic process |
| Maltose 2% Efficiency | stop codon variations, generation of precursor metabolites and energy, Paralog, cellular amino acid and derivative metabolic process, transposition, cellular respiration, Essential.gene, protein modification process, heterocycle metabolic process, cellular aromatic compound metabolic process, transport, frame shift variations, RNA metabolic process, ribosome biogenesis, response to chemical stimulus |
| Galactose 2% Rate | DNA metabolic process, stop codon variations, translation, generation of precursor metabolites and energy, cellular respiration, Paralog, mitochondrion organization, copy number variations, cellular amino acid and derivative metabolic process, frame shift variations, transport, Essential.gene, response to stress, chromosome organization, meiosis |
| Galactose 2% Efficiency | Paralog, DNA metabolic process, frame shift variations, RNA metabolic process, stop codon variations, transport, generation of precursor metabolites and energy, cellular respiration, copy number variations, mitochondrion organization, cellular amino acid and derivative metabolic process, heterocycle metabolic process, transposition, protein folding, chromosome organization |
| Heat 37° Rate | Frame shift variations, transport, Paralog, generation of precursor metabolites and energy, heterocycle metabolic process, Essential.gene, cellular protein catabolic process, DNA metabolic process, mitochondrion organization, cellular respiration, transposition, stop codon variations, copy number variations, RNA metabolic process, transcription |
| Heat 40° Rate | Paralog, transport, frame shift variations, generation of precursor metabolites and energy, cellular protein catabolic process, heterocycle metabolic process, copy number variations, transposition, stop codon variations, DNA metabolic process, translation, response to stress, protein modification process, conjugation, Essential.gene |
| Heat 37° Efficiency | Generation of precursor metabolites and energy, DNA metabolic process, cellular amino acid and derivative metabolic process, cellular respiration, heterocycle metabolic process, stop codon variations, Essential.gene, RNA metabolic process, cofactor metabolic process, vitamin metabolic process, transposition, translation, Paralog, frame shift variations, transport |
| Heat 40° Efficiency | Paralog, frame shift variations, copy number variations, transport, transposition, protein modification process, cellular carbohydrate metabolic process, cellular amino acid and derivative metabolic process, heterocycle metabolic process, RNA metabolic process, response to stress, generation of precursor metabolites and energy, cell cycle, signal transduction, conjugation |
| Sodium arsenite oxide 3.5 mM Rate | Stop codon variations, Paralog, frame shift variations, copy number variations, cellular amino acid and derivative metabolic process, transposition, transport, response to stress, RNA metabolic process, conjugation, translation, transcription, protein modification process, Essential.gene, chromosome organization |
| Sodium arsenite oxide 5 mM Rate | Stop codon variations, copy number variations, transposition, frame shift variations, transport, generation of precursor metabolites and energy, cellular respiration, Paralog, protein modification process, Essential.gene, transcription, cell cycle, DNA metabolic process, response to chemical stimulus, ribosome biogenesis |
| Sodium arsenite oxide 3.5 mM Efficiency | Paralog, stop codon variations, frame shift variations, transposition, copy number variations, RNA metabolic process, cellular amino acid and derivative metabolic process, transport, DNA metabolic process, translation, cellular carbohydrate metabolic process, peroxisome organization, response to stress, protein modification process, transcription |
| Sodium arsenite oxide 5 mM Efficiency | Paralog, frame shift variations, stop codon variations, transposition, copy number variations, transport, protein modification process, cellular carbohydrate metabolic process, generation of precursor metabolites and energy, cellular respiration, Essential.gene, RNA metabolic process, response to stress, response to chemical stimulus, translation |
Selected variables from the background information matrix that have a selectivity score above 0.2 for each phenotype obtained through the proposed model. Variables correspond to the presence or absence of specific gene amplifications, and the presence or absence of premature stop codons and frameshifts.