| Literature DB >> 34262264 |
Gang Li1, Jan Zrimec1, Boyang Ji1,2, Jun Geng1, Johan Larsbrink1, Aleksej Zelezniak1,3, Jens Nielsen1,2,4, Martin Km Engqvist1.
Abstract
BACKGROUND: A challenge in developing machine learning regression models is that it is difficult to know whether maximal performance has been reached on the test dataset, or whether further model improvement is possible. In biology, this problem is particularly pronounced as sample labels (response variables) are typically obtained through experiments and therefore have experiment noise associated with them. Such label noise puts a fundamental limit to the metrics of performance attainable by regression models on the test dataset.Entities:
Keywords: experiment noise; label noise; machine learning; regression models; upper bound
Year: 2021 PMID: 34262264 PMCID: PMC8243133 DOI: 10.1177/11779322211020315
Source DB: PubMed Journal: Bioinform Biol Insights ISSN: 1177-9322
Figure 1.Schematic diagram depicting the estimation of the upper bound of model performance based on experimental label noise. can be approximated from the standard errors (se) of samples in the dataset, and can be approximated as the variance of the target values. Data shown were randomly generated, se denotes standard error of sample i. is the expected upper bound for the coefficient of determination derived in this study (see section “Estimating the theoretical upper bound of regression model performance”).
Figure 2.Monte Carlo simulations of the upper bound of assuming different levels of feature noise. and are expected upper bounds for the coefficient of determination with equations derived by Benevenuta and Fariselli et al and this study, respectively. is the expected upper bound for obtained by Monte Carlo simulation as described in the “Results” section. is the obtained via a 2-fold cross-validation with a support vector machine. Two real functions were tested; (A) linear and (B) nonlinear. is the variance of noise added to feature vector , with examples of the observed data distributions depicted in Figure S1. (C) Monte Carlo simulation on data cleaning via gradually removing the samples with the largest . n/10 indicates that n features out of a complete set of 10 features were used to train and validate the model. Noise values are given as the average variance of all samples .
Figure 3.Development of machine learning models for the prediction of enzyme optimal temperature (Topt). (A) Performance of classical models using iFeatures and UniRep encoding feature sets as well as a deep neural networks with one-hot encoded protein sequence as input. (B, C) Comparison of model performance on raw and clean dataset (B) with; and (C) without optimal growth temperature (OGT) as one of the features. The features calculated by iFeature were grouped into 20 sub-feature sets as described in the “Methods” section. Error bars show the standard deviation of scores obtained in 5-fold cross validation. is the expected upper bound for the coefficient of determination derived in this study (see section “Estimating the theoretical upper bound of regression model performance”).
Figure 4.Amount of experimental noise affects estimates of and model performance. Analysis of the effect of experimental noise in the response variables on the upper bound estimates (black) and predictive performance of ML models (red) with the case of a large yeast multi-experiment transcriptomics dataset. The noise level was varied by adjusting the number of data replicates with random sampling (inset figure). Lines and shaded areas depict means and standard deviations of the 30 measurements per each n replicates, depicted as points. is the expected upper bound for the coefficient of determination derived in this study (see section “Estimating the theoretical upper bound of regression model performance”), CV denotes cross validations.
Figure 5.is applicable even in case the experimental noise is unknown. Analysis of 34 quantitative traits of S. cerevisiae from its pan-genome composition. (A) A random forest model applied to predict yeast phenotypes from genomics features. Genomes are represented as (B) gene presence/absence table and (C) copy number variance table in the pangenome. (D) Obtained R2 score for 35 different phenotypes. Experimental trait values were taken from Peter et al. Detailed label description can be found in Table S2 of Peter et al. Error bars show the standard deviation of scores obtained in 5-fold cross validation. is the expected upper bound for the coefficient of determination derived in this study (see section “Estimating the theoretical upper bound of regression model performance”).