Literature DB >> 35061733

Machine learning accurately predicts the multivariate performance phenotype from morphology in lizards.

Simon P Lailvaux¹, Avdesh Mishra², Pooja Pun³, Md Wasi Ul Kabir³, Robbie S Wilson⁴, Anthony Herrel⁵, Md Tamjidul Hoque³.

Abstract

Completing the genotype-to-phenotype map requires rigorous measurement of the entire multivariate organismal phenotype. However, phenotyping on a large scale is not feasible for many kinds of traits, resulting in missing data that can also cause problems for comparative analyses and the assessment of evolutionary trends across species. Measuring the multivariate performance phenotype is especially logistically challenging, and our ability to predict several performance traits from a given morphology is consequently poor. We developed a machine learning model to accurately estimate multivariate performance data from morphology alone by training it on a dataset containing performance and morphology data from 68 lizard species. Our final, stacked model predicts missing performance data accurately at the level of the individual from simple morphological measures. This model performed exceptionally well, even for performance traits that were missing values for >90% of the sampled individuals. Furthermore, incorporating phylogeny did not improve model fit, indicating that the phenotypic data alone preserved sufficient information to predict the performance based on morphological information. This approach can both significantly increase our understanding of performance evolution and act as a bridge to incorporate performance into future work on phenomics.

Entities: Chemical

Mesh：
Biological Evolution

Year: 2022 PMID： 35061733 PMCID： PMC8782310 DOI： 10.1371/journal.pone.0261613

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

A major goal of evolutionary biology is accurate prediction of the phenotype from the genotype. The emerging field of phenomics in particular aims to quantify every aspect of the phenotype of an organism–that is, every measurable trait–and ultimately to relate it back, through several intermediate levels of biological organization, to the genome itself [1, 2]. However, while our ability to sequence genomes has advanced enormously in recent years, our capacity to characterize entire phenomes has not kept pace, particularly for phenotypes that are time consuming or otherwise difficult to quantify. Because some phenotypes are easier to measure than others, certain types of traits are either entirely absent from existing phenomes, or are described only in the most general terms [3]. Prime among these are those traits that describe how organisms conduct dynamic, ecologically relevant tasks such as jumping, running, flying, or biting, referred to collectively as whole-organism performance traits [4, 5]. Performance traits are key predictors of both survival and reproductive success in animals and as such form a cornerstone of the study of adaptation [4-6]. Performance is typically studied within the context of the ecomorphological paradigm, a statistical framework which states that morphology determines performance, which in turn affects fitness [7]. This paradigm has guided performance research for nearly 40 years and has been successfully applied to understand variation in morphology, performance, and fitness in a variety of animal species and over multiple levels of biological organization [8]. However, properly measuring maximum performance is time consuming, and doing so for suites of multiple performance traits in the same animals has proven to be a significant challenge. Consequently, despite intense interest in performance over the last several decades [6, 9–11], the entire whole-organism performance phenotype, comprising all or even most of the performance abilities of which a given species is capable, has therefore seldom been rigorously quantified [12]. Furthermore, even in cases where animals within a sample can be measured for multiple performance traits, the resulting datasets are rarely comprehensive, usually being limited to only two or three performance traits, and are typically characterized by missing data [e.g. 13]. Individual datapoints might fail to be collected for reasons ranging from logistical constraints and equipment failure to lack of cooperation of the subject being measured or even lack of continued availability of a given subject or species. These missing individual-level data cause further problems at the population and species levels for the analysis of evolutionary trends in particular. For example, comparative analyses of multiple phenotypic traits across a phylogeny are sensitive to missing data because even a single absent data point (i.e., mean value) for a given trait can force the exclusion of an entire species, reducing the overall power of the analysis. Approaches to incomplete comparative datasets based on imputing "placeholder" values, such as the PHYLOPARS method, do allow for the execution of an analysis that would not otherwise run with missing trait values [14, 15], but the accuracy of these methods is likely to be variable, frequently unverifiable, and prone to error at worst, particularly for situations with large amounts of missing data, or where missing data are not dispersed randomly across taxa. One approach to addressing these issues is to predict data rather than measure it. The deterministic relationship between morphology and performance in particular offers scope for the prediction of unmeasured performance from individual morphology [16]. However, despite both the utility of the ecomorphological paradigm and the clear general validity of the morphology-to-performance relationship, modeling performance as a function of morphology alone is not always straightforward. Performance expression can be moderated, enhanced, or constrained by a variety of factors, including behavior [17]; energetic costs [14, 18–20]; elastic storage mechanisms [21, 22]; and the often complex relationships among performance and other facets of the integrated organismal phenotype [23-26]. Such constraints are especially relevant when animals are required to conduct multiple, yet different performance tasks on a day-to-day basis, many of which have conflicting morphological bases that cannot be optimized simultaneously. This can lead to trade-offs among specific performance traits such that specialization for one trait precludes high levels of expression in another [27, 28]. For example, birds such as gannets that dive from great heights to capture prey up to 30m below the water surface are often poor fliers because the ideal requirements for deep diving (high mass) and flying (low mass) are opposite [29]. Although intuitive, similar trade-offs among suites of several performance traits have proven difficult to uncover due in part to individual variation in performance expression [13, 30, 31]. The existence of many-to-one mapping, whereby the same performance trait is produced by different morphological forms [32, 33], is a further complication for accurately predicting whole-organism performance. Consequently, the extension of this predictive scenario to a multivariate morphology-performance situation involving numerous, potentially conflicting performance traits is yet more challenging. Collectively, these constraints significantly limit our current ability to accurately predict multiple performance traits from a given underlying morphology. The requirement for large-scale performance phenotyping coupled with existing constraints on both the measurement and prediction of multivariate performance demands that we adopt a new perspective on either performance measurement or inference. In the present study, we develop a machine learning method to accurately predict the multivariate performance phenotype from incomplete morphological datasets. Machine learning approaches are increasingly used at the whole-organism level to identify and analyze patterns within extremely large and detailed datasets often collected on only a handful of individuals. For example, machine learning techniques are used to extract meaningful biological signals from “noisy” patterns of individual movements recorded over long time periods using GPS trackers [34, 35], and to connect behavioral phenotypes to genetic sequences in populations of laboratory mice [2]. Furthermore, these methods can also be used to fill in “gaps” in large, complex datasets by deriving appropriate decision-boundaries for extrapolation, and ultimately to produce accurate predictions from novel data [36]. Here we adopt the latter approach, using machine learning to build an application to predict unmeasured maximum performance values at the level of the individual animal from a large and fragmentary dataset on lizard morphology and performance drawn from 68 species representing 8 different lizard families. Lizards are model organisms for the study of performance in general, and locomotion in particular [37]. We therefore take advantage of the substantial existing data on various lizard morphologies and associated performance phenotypes to train, test, and validate an machine learning model for imputing the multivariate performance phenotype from existing data on morphology. Specifically, we built a “stacked” machine learning model combining the outputs of several distinct regressor layers into a best-performing model that accurately predicts 5 distinct performance traits, including one complex, multicomponent trait (jumping ability) from 14 simple morphological measures across a range of diverse lizard taxa. Furthermore, we show that the addition of phylogenetic information on the relatedness of taxa in the model does not enhance model performance.

Materials and methods

We built our machine learning model (hereafter termed MVPpred: “Multivariate Performance Phenotype Predictor”) in three steps: missing value imputation; feature selection and classification; and stacking. Below we describe the nature of the training dataset, and outline briefly the process of model development and validation.

Morphology and performance dataset

We assembled a training dataset comprising morphology and maximum performance data for nearly 2,000 individual lizards from 68 species. Data were sourced from the authors’ personal datasets, contributions from other lizard performance researchers, and from publicly available data [38]. Performance data collected by different individuals and research groups are likely to be comparable given that whole-organism performance has the benefit of having standard protocols for maximum performance measurement to ensure that maximum values are recorded for each trait [39]. Morphology data are also commonly collected in a standardized manner, and here comprise measurements of head dimensions (head length, head width, and head height); body size (snout-vent length [SVL] and body mass); individual limb elements (femur, tibia, metatarsus, longest hind toe, humerus, radius, metacarpal, longest fore toe); and tail length. We considered 5 commonly-measured maximum performance traits that capture an array of diverse lizard terrestrial performance capacities: sprint speed (shortest time to traverse a set distance on a runway set at 45o or less to the horizontal because some lizard species tend to hop on horizontal surfaces [40]); endurance (longest time an individual is able to keep pace at a set, sub-VO2 max speed on a treadmill before becoming exhausted [41, 42]); climbing (shortest time to traverse a set distance on a vertical runway [43, 44]); stamina (longest distance an individual is able to run when chased at maximum speed around a circular racetrack before becoming exhausted [45, 46]); jumping, (which is a composite variable comprising maximum distance, acceleration, velocity, and power of a jump at a given angle measured via a force plate or high-speed camera [47, 48]); and bite force (maximum force measured when a lizard is induced to bite down in a standardized manner on bite plates connected to a force transducer [49, 50]). However, because data were collected by different groups to test a variety of hypotheses, these data are, for many species, incomplete in terms of either the measured morphology or performance, or both. Furthermore, variation in the availability of data means that the machine learning training dataset is highly unbalanced in terms of both taxonomic representation and the amount of data available for each taxon (Fig 1); in particular, lizards of the genus Anolis are overrepresented relative to non-anoline lizards (see S23 Table for exact sample sizes in S1 File). Any extrapolations or inferences of performance->morphology relationships from such a sparse and fragmentary dataset using standard prediction methods such as model 1 or 2 regression are likely to be highly inaccurate; however, these data represent an ideal test case for machine learning approaches, as well as being representative of real-world data that are available to functional morphologists.

Fig 1

Species names and sample size for each of the 68 taxa comprising the training and verification dataset.

Phylogenetic relationships

Comparative datasets comprising traits measured on multiple taxa must take into account the evolutionary relationships among those taxa because related species share an evolutionary history and thus are not independent data points [51]. Moreover, shared ancestry provides information that could be used to estimate missing data as phylogenetically closely related species will resemble each other in terms of both morphology and function. We pruned the large squamate phylogeny of Pyron et al. [52] to include only the species used in the current dataset (Fig 2). The evolutionary relationships among species were included in the base machine learning model as a distance matrix. However, this inclusion did not improve the predictions (see section A, “Two-step process” in the S1 File). Therefore, our final model is not affected by phylogeny; rather we used only the available morphology->performance dataset for training and confirmed the prediction accuracy by cross-validation.

Fig 2

Phylogenetic relationships among the 68 lizard taxa from 8 families included in the final model.

Note that phylogeny had no effect on the predictive accuracy of the final, stacked model.

Phylogenetic relationships among the 68 lizard taxa from 8 families included in the final model.

Note that phylogeny had no effect on the predictive accuracy of the final, stacked model.

Handling missing values

We used the K-Nearest Neighbor (KNN) method to replace missing performance trait values in two steps. In the KNN method, the Euclidean distance between a target sample (S) belonging to a target feature (F; in this case, a performance phenotype or variable of interest) and all other samples is calculated. The missing value for S is then replaced by the average value of the K closest samples (i.e., those K with the lowest Euclidean distance). Initially, we applied the KNN method within each taxon using K = 5; however, since the dataset contained taxa with no values at all for certain features, we ultimately applied the method to the entire dataset. We found performance trait-specific different values of K while searching values from 3 to 200, and selected the appropriate value of K based on root mean square error (RMSE; see Table 1) for each performance feature (Table 2; the complete search outcome is presented in S1 File).

Table 1

Derivation of indices used to evaluate model classification and prediction.

Name of Metric	Definition
P	True value of the performance feature
Pavg	Mean of true values
Ppred	Predicted value of the corresponding performance feature
Ppred_avg	Mean of predicted values
N	Number of samples
Pearson Correlation Coefficient (PCC)	∑(P-Pavg)(Ppred-Ppred_avg)∑(P-Pavg)2∑(Ppred-Ppred_avg)2
Mean Absolute Error (MAE)	1N∑i = 0N-1\|P-Ppred\|
Root Mean Square Error (RMSE)	1N∑i = 0N-1(P-Ppred)2

Table 2

Optimum K-value search result (range 1 to 100), for various performance traits.

Feature	Optimum K (based on RMSE)	Root Mean Square Error (RMSE)
Jump power	165	7.47
Jump acceleration	29	1.98
Bite force	57	4.53
Jump velocity	16	0.07
Endurance	154	32.08
Sprint speed	84	0.65
Jump distance	46	0.05
Stamina	25	3.66
Angle	34	1.84

Model performance evaluation

Our model consists of both a classification framework that predicts the taxon of a given sample and a regression framework that predicts a given sample’s performance capacity. We measured the performance of the overall model using standard 10-fold cross-validation, whereby the data are divided into 10 sets of samples, 9 of which are used to train the prediction model while the remaining set is used to test the prediction model. We evaluated model performance using the Pearson Correlation Coefficient (PCC) and Mean Absolute Error (MAE) for the regression component (Table 1).

Regression framework

We used the available entire dataset to both train and validate our regression model. For standard training, we used 10-fold cross-validation (10 FCV), whereby we shuffled the dataset and divided it into ten sub-datasets by sequentially selecting equal individual samples at random without replacement. We then evaluated the performance of the cross-validation using the Extra Tree Regressor (ETR) [53]; Gradient Boosting Regressor (GBR) [54]; Random Forest Regressor (RFR) [55], XGBoost Regressor (XGBR) [56], and Support Vector Regressor (SVR) [57]. Extra Trees Regressor (ETR): We have constructed the ET model with 1,000 trees, and the quality of a split is measured by the Gini impurity index. Random Forest Regressor (RFR): we have used a bootstrapping approach to construct 1,000 trees in the forest. XGBoost Regressor (XGBR): In our configuration of XGBR, the values of parameters: max_depth, eta, n_estimators, min_child_weight, subsample, scale_pos_weight, tree_method, and max_bin are set to 6, 0.1, 100, 5, 0.9, 3, hist and 500 respectively and the rest of the parameters were set to their default value. Support Vector Regressor (SVR): For SVR, the RBF kernel parameter, γ, and the cost parameter, C are optimized to achieve the best 10-fold cross-validation accuracy using a grid search.

Stacking framework

We further enhanced the performance of MVPpred using the stacking technique [58]. Briefly, the “no free lunch” theorem states that no single machine learning algorithm is best suited to all scenarios and datasets due to the associated generalization error [58-60] because one machine learning method would learn certain information from the dataset, whereas another would learn something different, depending on the specific underlying statistical learning principle. Stacking is an ensemble technique that combines information from multiple predictive models to generate a new model, and generally improves the prediction results through minimization of generalization error [61-63]. Here, the results (the difference between the predicted value and the original value) of different regressors used in the base layer along with the dataset provided to train the base layer are passed as a training dataset for the regressor used in the stacked meta layer. We explored different combinations (see Table 3) of base and meta layer.

Table 3

Configurations of the five stacked models.

	Base Layer	Meta Layer
SM1	XGBR, RFR, GBR, ETR	ETR
SM2	XGBR, RFR, GBR, ETR	GBR
SM3	XGBR, RFR, GBR, ETR	RFR
SM4	XGBR, RFR, GBR, ETR	XGBR
SM5	XGBR, RFR, GBR, ETR	SVR

Results

Outcome of the regression framework

Of the tested performance features, we found that jump power yields the best PCC and MAE (see Table 1 for the metric and Table 4 for the outcome) using R2 and optimized Support Vector Regressor with RBF-kernel. From Table 4, we can see that jump acceleration was best predicted using the optimized SVR with RBF-kernel. The PCC (defined as a measure of linear correlation between the predicted and the actual value–see Table 1) is 0.97, and MAE (defined as the absolute difference between the predicted and the actual value) is 0.36 m/s2. The results for jump acceleration using different regression methods are given in S7 Table of S1 File.

Table 4

Pearson correlation coefficient (PCC) and mean absolute error (MAE) of features.

Jump acceleration exhibited the highest prediction accuracy (bolded). To aid in the interpretation of MAE, we have also provided the mean value for each performance feature from the overall training dataset, as well as the associated standard errors. Note that MAE has the same units as the associated performance trait.

Feature	Regression method	Mean (±SE)	PCC	MAE
Jump power (W/kg)	SVR	45.94(±0.15)	0.77	1.21
Jump acceleration (m/s²)	SVR	32.17(±0.05)	0.97	0.36
Bite force (N)	GBR	7.74(±0.18)	0.94	1.35
Jump velocity (m/s)	XGBR	1.57(±0.002)	0.95	0.02
Endurance (s)	GBR	213.71(±0.65)	0.28	6.70
Sprint (m/s)	RFR	1.35(±0.02)	0.88	0.23
Jump distance (m)	ETR	0.33(±0.001)	0.84	0.01
Stamina (m)	XGBR	16.53(±0.11)	0.83	1.42
Angle	XGBR	36.44(±0.06)	0.75	0.527

Pearson correlation coefficient (PCC) and mean absolute error (MAE) of features.

Outcome of the stacking framework

We chose regressors for the base layer and meta-layers of the five stacked models (SM1, SM2, SM3, SM4, and SM5) based on Table 4. We used XGBR and SVR in the base layer of all stacking models because they exhibited the best PCC and MAE. The results from different stacking models for different features are summarized in Table 5 –however, the detailed results are available in S5 to S21 Tables of S1 File. SM2 outperformed the other three stacking models in all cases. PCC for these models ranged from 0.93 for jump distance to 0.99 for bite force, jump acceleration, and jump velocity, whereas MAEs were as low as 0.003m for jump distance (with a mean jump distance in the dataset of 0.33m), and as high as 1.73m for endurance (with a mean endurance value in the dataset of 213.71m) (Tables 4 and 5). Because of this superior performance, we used the SM2 stacking model throughout.

Table 5

Pearson correlation coefficient (PCC) and mean absolute error (MAE) of different stacking models for various performance features.

Performance feature	Stacked configuration	PCC	MAE
Jump power (W/kg)	SM2	0.98	0.49
Jump acceleration (m/s²)	SM2	0.99	0.17
Bite force (N)	SM2	0.99	0.57
Jump velocity (m/s)	SM2	0.99	0.01
Endurance (s)	SM2	0.95	1.73
Sprint speed (m/s)	SM2	0.98	0.11
Jump distance (m)	SM2	0.93	0.003
Stamina (m)	SM2	0.98	0.63
Angle	SM2	0.97	0.20
Average		0.973	0.434

Final software

To predict a given performance feature, the final software uses the prediction of the other eight performance features along with the morphological features as input. Our model is highly accurate even in the absence of phylogenetic information describing the relatedness among species. The final, stacked MVPpred model allows researchers to enter simple and easily obtained morphological data for an individual lizard and obtain accurate predictions for each of the 9 performance features pertaining to that individual. Furthermore, researchers could conceivably do this for all individuals in a sample, yielding a population or species mean for each trait that could then be used in comparative analyses. The results are available in S22 Table of S1 File.

Discussion

Measuring every phenotype of a given organism on the scale that phenomics demands may not be possible, necessitating a demand for imputed or inferred data to at least some extent [2]. This will require both a paradigm shift in how we view data that are inferred but not measured from real organisms, and an accompanying advancement in the methods that we use to do so. We built, trained, and validated a machine learning model, which we call MVPpred, to accurately estimate unmeasured maximum performance data from a large dataset on lizard morphology and performance at the level of the individual animal. Our final stacked models predicted maximum multivariate performance with high accuracy, and cross-validation of our approach shows that the final, stacked MVPpred model significantly outperforms both simple statistical prediction methods such as ordinary least squares regression and single machine learning prediction methods in all cases. The prediction accuracy in terms of PCC of the stacked models ranged from 0.93 to 0.99, with low MAE in all cases (ranging from 0.003 to 1.73). Overall, our model was able to generate accurate predictions, even for performance traits that were poorly represented in the training dataset. In addition to imputing the most likely maximum values of relatively simple performance metrics such as sprint speed or bite force, we also successfully and accurately predicted a more complex performance capacity. Jumping ability is itself a multivariate performance trait that can be characterized in several different ways [16, 64]. Some researchers have assessed individual jumping ability through relatively simple metrics such as maximum jump distance [65], whereas others have focused on describing both the kinetics and kinematics of jumping ability through measurement not only of distance, but also the velocity, acceleration, power, and the take-off angle of a jump [64, 66], all of which are interrelated and can trade-off against each other to shape overall jump trajectories [47, 48]. Our model predicted missing data for five key aspects of maximum jump performance (power, distance, acceleration, velocity, and angle), and did so with > 95% accuracy in all cases, suggesting that these methods hold the potential to predict other complex performance traits in different taxa as well. Machine learning has been used to understand performance in the past. In particular, sports scientists have previously applied similar methods to the performance of individual athletes and events [36]; for example, Maszczyk et al. [67] used neural networks to predict the distance of javelin throws, and a similar approach was applied by Edelmann-Nusser et al. [68] to the women’s 200m backstroke. Our study extends this approach to non-human animals in two key ways. First, our model predicts multiple maximum performance abilities as opposed to only one, including the five components of one complex performance ability (i.e., jumping). Second, we do this across 68 different species from 8 different families of lizards comprising a diversity of morphologies and ecological contexts. Our dataset was necessarily opportunistic and consequently is highly unbalanced with regard to species representation, ranging from species represented by several hundred individuals (Anolis carolinensis), to others represented by only a handful of lizards (e.g., Cordylus melanotus; see Fig 1). Such datasets are typically not ideal for comparative studies aimed at identifying interspecific patterns [69, 70], making it all the more remarkable that our model was able to make accurate predictions even for sparsely sampled taxa. This combined multivariate and multispecies application will allow researchers to predict not only individual maximum performance for the traits of interest, but also for multiple traits across multiple species, granting increased flexibility in cases where missing performance data that cannot otherwise be obtained might compromise phenomic or comparative analyses. Although accurately predicting maximum performance variables relating to existing data is valuable in itself, our model goes further and also opens up potential new avenues of investigation. MVPpred produces accurate predictions even in the absence of a known phylogeny, hinting at the potential universality of form-function relationships that might be obscured by variation at different levels of biological organization [see [31, 71], and [13] for examples at the within-species level]. However, although the aim of the current paper was to produce a model that accurately predicts multiple performance capacities, and although we validated those predictions against real data, the MVPpred model in its current form offers little insight into the causality underlying several of the predicted morphology->performance relationships. For example, while traits such as sprint speed and the various jump performance variables have clear deterministic relationships between limb morphology and the magnitude of the performance phenotype that are based on simple mechanical principles (e.g. Bauwens and Garland [72]), relationships between morphology and endurance are less clear cut. Endurance capacity is a function of oxygen delivery and cardiovascular function, which are not directly reflected in simple limb dimensions, and distribution of mass across the organism is more important than mass itself in determining endurance capacity [73]. The biomechanical basis of our model to accurately predict endurance from these morphological data is therefore not immediately apparent, and likely stems from the ability of the model to compute and compare not only relationships between predictors and predicted variables, but relationships among predicted performance variables as well. An important next step is therefore to interrogate our model to uncover and understand these causal relationships as well as any latent predictors that might exist. As such, models such as MVPpred also offer the possibility of a more complete understanding of form-function relationships at the whole-organism level as well and, potentially, a new approach for testing and understanding such relationships. Yet another possibility presented by our model performance, particularly in its accuracy in predicting performance for novel morphologies, is that an expanded and appropriately trained version of MVPpred could in principle allow for the accurate prediction of performance abilities from the bones of extinct organisms that have no living analogues. Similarly, our model could potentially represent a foundation for expanding this predictive approach to encompass other taxa and modes of locomotion beyond terrestrial lizards. The accurate prediction of unmeasured data is a potentially valuable approach, but it also comes with some necessary caveats. Firstly, MVPpred predicts only maximum performance capacities. Although our focus on maximum performance here is consistent with much of the whole-organism performance literature, animals do not always perform to their maximum limits in nature [74], and there are many situations where it might be more useful or appropriate to use all of the available performance data, not just the maximum values, or to explicitly consider submaximal values [13, 75]. Second, despite both the power and generalizability of machine learning approaches and the lack of influence of phylogeny on our results, our model has only been formally validated with data from the 68 species represented in the training dataset (see Fig 1 and S23 Table for the full species list in S1 File). This model should therefore be applied to individuals from other lizard species with caution, if at all. Expansion of the MVPpred model to encompass other species could be achieved by incorporating morphology and performance data pertaining to those species of interest. In conclusion, MVPpred predicts multiple different whole-organism performance traits, including aspects of a multivariate performance trait (jumping ability) with a high degree of accuracy from even sparsely sampled data. Although we do not believe that this approach either is or should become a replacement for rigorous collection of real data where such collection is feasible, our model is nonetheless a clear improvement on existing imputation methods for missing performance data. The ability to accurately impute missing data across species is likely to enable further progress in integrating whole-organism performance and phenomics; understanding variation in form-function relationships; and ultimately in inferring unmeasured performance traits from novel morphologies.

This word file presents the results of species-wise cross validation using the best stacking model, wherein we test a given species’ performance by training the model with data from other species.

These results demonstrate good cross-species predictions where adequate training and testing data are available, suggesting that the model is useful even in the absence of phylogenetic information. (DOCX) Click here for additional data file.

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present. 9 Jun 2021 PONE-D-21-11098 Machine Learning Accurately Predicts the Multivariate Performance Phenotype from Morphology in Lizards PLOS ONE Dear Dr. Lailvaux, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Jul 23 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols . Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols . We look forward to receiving your revised manuscript. Kind regards, Christopher Nice, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for stating the following financial disclosure: The author(s) received no specific funding for this work. At this time, please address the following queries: a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution. b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.” c) If any authors received a salary from any of your funders, please state which authors and which funders. d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. Additional Editor Comments (if provided): Due to unavailability of reviewers for this manuscript, I am providing my own review to augment reviewer 1's comments. In any revision, please address both sets of comments. AE comments: This new approach for imputing missing data is well-presented and should be of high interest to readers involved with phenomics. My only real complaints are 1) that more details of the data used to train the model be presented and 2) that the authors are too succinct - some expansion of ideas and statistical issues or limitations would add value to this manuscript. Below are some elaborations on this theme and minor comments. 146: why "MVPpred"? The authors might justify the name of their approach. 153: change semicolons to commas 195-203: this needs some clarification, I think, with respect to the value of K. Was K=5 the initial approach, but then, given extensive missing data, the range of K was expanded (3-200) and the appropriate K chosen on the basis of MAE? Or is it RMSE? The supplementary figures (using MAE) do not comport with table 1. As an arbitrary example, for sprint speed, K=4 based on MAE from Fig. S6 but is reported in Table 1 as 84. Tables 3 and 4 should be switched in order - the stacking combinations (methods and currently table 4) should precede the results (table 3). "PCC", "MAE", "BAse Layer" and "Meta Layer" should be defined in the legends. Presentation of results: overall, this seems overly succinct and these three short paragraphs do not do justice for the work the authors have done here. For example, in the regression results, jump power is reported because of highest predictive power, and jump acceleration is mentioned at the end (250) but what about overall conclusions? Range of results. In other words, it is difficult to comprehend what the authors' main point is here. In the next section on stacking results (257) do the authors mean that SM2 outperformed the others in ALL cases as indicated by Table 5? No, according to the supplementary tables, the text is correct and the table is mistaken. I think readers would benefit from a fuller exploration of the results. Discussion: as above with the results, the discussion could be broadened to help readers appreciate the full scope. Yes, MVPpred should not be a replacement for collecting data, but, as the authors state in the introduction and the first sentence of the Discussion, such data is logistically difficult to come by, or impossible. This model seems to be a valuable tool as a consequence. But readers might appreciate a discussion of the details. How and why are stacking methods improving results? Why is the SM1 configuration superior in all (or most) features? Are there limitations (beyond inferring causality), or inappropriate uses of this approach? [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This manuscript describes a model for predicting performance of lizards from morphological measurements. It makes two claims: one is that the model is extremely accurate at predicting these performances, and the second is that this is related to phenomics: high-throughput, multivariate phenotyping. The first claim is not supported by the kind of evidence one needs to evaluate it. The second is a non-sequitur. If you don’t measure the phenotype, you are not phenotyping. A method which successfully predicts performance would be an extremely useful tool, and this software pipeline may be such a tool. The problem here is that the two measures reported here do not adequately summarize accuracy. High R^2 is certainly a good thing, but it is not enough without seeing the distribution of the data and the predictions used to validate the model. To take a really simple example, if we have two clusters of jump performances, with one cluster being can’t jump, and the other similar in jumping ability, you can get a very high R^2 without the model being accurate: all it has to do is say one is low, and the other is high. The MAE values are essentially uninterpretable without knowing the mean performance and the units in which they are measured. The proudly reported value of MAE of 1.48 has NO units attached!!! Proportional error of the predictions would be a very meaningful statistic. There are excellent biological reasons to doubt that the repeatabilities of these performance characteristics could be high enough to produce the R^2 values reported. If you run or jump the same lizard and different equations, I would be astonished if repeatabilities could ever approach 98%. The assay for sprint speed is descrbied as doen at an angle of 45 or less. How can it not matter whether you sprint uphill on on the flat? Since this is the case how could an individuals morphology ever predict performance with such high accuracy. In short, the manuscript makes an extraordinary claim, but fails to back it up with meaningful statistical evidence. What is absolutely essential to evaluate the accuracy is that we see the number of species actually measured in each case, as well as the error of each prediction when used in the test data set. Similarly, we really need to know whether the within-species variation is also explained by morphology, or whether this is just the mean that is explained. Since there is no enumeration of studies, I don’t know whether they are predicting the performance of 10 species from that of 58, or predicting 58 from 10 for each performance trait. What are the within-species sample sizes? How are those 2000 individual lizards divided among each trait, and species? Why is there no figure showing measurements and predictions with meaningfull errors for each? If this model is in fact highly accurate by some meaningful measurement, this would be a very important result. It has important implications for phenomics, suggesting the dimensionality of the phenotype as whole is not so very high. Rather than BEING phenomics, such predictive ability would suggest that we do not really need phenomics. Trying to present the results AS phenomics is misguided, but the implications FOR phenomics are very interesting. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: David Houle [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 14 Aug 2021 Response to AE Major Reviews 1. 146: why "MVPpred"? The authors might justify the name of their approach. Response: It is standard in computer science to provide names for machine learning models. In this case, “MVPpred” is a short form for “Multivariate Performance Phenotype Predictor”. 2. 153: change semicolons to commas Response: Changed as suggested. 3. Tables 3 and 4 should be switched in order - the stacking combinations (methods and currently table 4) should precede the results (table 3). "PCC", "MAE", "BAse Layer" and "Meta Layer" should be defined in the legends. Response: Thank you. This issue has been corrected. We also switched the order of tables 1 and 2 for similar reasons. 4. 195-203: this needs some clarification, I think, with respect to the value of K. Was K=5 the initial approach, but then, given extensive missing data, the range of K was expanded (3-200) and the appropriate K chosen on the basis of MAE? Or is it RMSE? The supplementary figures (using MAE) do not comport with table 1. As an arbitrary example, for sprint speed, K=4 based on MAE from Fig. S6 but is reported in Table 1 as 84. Response: This is correct. The initial choice of K was 5; later, as the Editor noted, due to the extensiveness of missing data, the search range for the value of K was expanded to 3-200, and the appropriate value of K was selected based on the RMSE for each feature. With regard to the supplementary figure, we erroneously placed the results from our older experiment, where we compared K with respect to MAE. In the revised supplementary document, we have corrected the error and place the correct figures that show the plot of K versus RMSE (please see revised Fig. S1-S8). Thank you for pointing this out. 5. Presentation of results: overall, this seems overly succinct and these three short paragraphs do not do justice for the work the authors have done here. For example, in the regression results, jump power is reported because of highest predictive power, and jump acceleration is mentioned at the end (250) but what about overall conclusions? Range of results. In other words, it is difficult to comprehend what the authors’ main point is here. In the next section on stacking results (257) do the authors mean that SM2 outperformed the others in ALL cases as indicated by Table 5? No, according to the supplementary tables, the text is correct and the table is mistaken. I think readers would benefit from a fuller exploration of the results. Response: We have expanded on the results as requested. We have supplied the correct supplement in this revision, which is in agreement with the results reported in the manuscript. The confusion regarding SM1/SM2 was a result of uploading an earlier, incorrect supplement – SM2 is indeed the best performing model in all cases. 6. Discussion: as above with the results, the discussion could be broadened to help readers appreciate the full scope. Yes, MVPpred should not be a replacement for collecting data, but, as the authors state in the introduction and the first sentence of the Discussion, such data is logistically difficult to come by, or impossible. This model seems to be a valuable tool as a consequence. But readers might appreciate a discussion of the details. How and why are stacking methods improving results? Why is the SM1 configuration superior in all (or most) features? Are there limitations (beyond inferring causality), or inappropriate uses of this approach? Response: We have expanded upon the results in the revised version as requested. With regard to stacking improving results, traditional machine learning methods such as K-NN, Logistic Regression, Random Forest, SVM, etc., do not perform as well as the stacking method because of the associated generalization error. Specifically, one machine learning method would learn certain information from the dataset whereas, the other would learn different information. The type of information an individual method learns from the data is totally dependent on the specific statistical learning principle based on which it was designed. The SM2 (not SM1 – see above) model has the greatest predictive power in all cases, although it isn’t possible to say why it works better, just that it does. The stacking approach allows us to layer algorithms that do one or more things particularly well on top of each other, allowing us to refine model outputs iteratively within each layer. Because the “no free lunch” theorem in computer science tells us that no single algorithm is best suited to all scenarios and datasets, choosing configurations that are better in some or most cases (in this case, algorithms that optimize several features better than others) is a necessity. As requested, we have also expanded upon the limitations to our approach in the revised manuscript. Response to the Reviewer Major Reviews 1. This manuscript describes a model for predicting performance of lizards from morphological measurements. It makes two claims: one is that the model is extremely accurate at predicting these performances, and the second is that this is related to phenomics: high-throughput, multivariate phenotyping. The first claim is not supported by the kind of evidence one needs to evaluate it. The second is a non-sequitur. If you don’t measure the phenotype, you are not phenotyping. Response: We believe that a lack of clarity on our part in the original manuscript has led to some misunderstanding as to the accuracy of our model, based at least in part on confusion regarding the level at which our model operates and is validated. To be clear: our model interpolates missing data at the level of the individual animal for multiple different performance traits. Furthermore, our model is also tested and validated at the level of the individual, not the species level. Thus, researchers can enter into the MVPpred model whatever morphological measures they have for an individual animal, and the model will accurately impute the resulting 9 maximum performance capacities for that individual. The validation of the model happens at the level of the individual as well. We provide more detail on this procedure below and in the revised paper, which we hope the reviewer finds satisfactory. The same problem underlies the evident confusion regarding phenotyping. Houle et al. (2010) define phenomics as “the acquisition of high-dimensional phenotypic data on an organism-wide scale”. We believe our proposed approach is consistent both with this definition, and with the way in which machine learning has been used to bolster phenotyping in phenomics studies in the past, specifically in the application of this method at the level of the individual animal. Again, we address this point in more detail in our response to the final comment (5) below. 2. A method which successfully predicts performance would be an extremely useful tool, and this software pipeline may be such a tool. The problem here is that the two measures reported here do not adequately summarize accuracy. High R^2 is certainly a good thing, but it is not enough without seeing the distribution of the data and the predictions used to validate the model. To take a really simple example, if we have two clusters of jump performances, with one cluster being can’t jump, and the other similar in jumping ability, you can get a very high R^2 without the model being accurate: all it has to do is say one is low, and the other is high. The MAE values are essentially uninterpretable without knowing the mean performance and the units in which they are measured. The proudly reported value of MAE of 1.48 has NO units attached!!! Proportional error of the predictions would be a very meaningful statistic. Response: Every individual in the dataset was used for both testing and training using K-fold Cross-validation. Note: We were careful to ensure that the same sample being tested was not included in the training of the model. The R2 values reported is defined as (1 – u/v), where u is the residual sum of square ((y_true – y_pred) ** 2).sum() and v is the total sum of squares ((y_true – y_true.mean()) ** 2).sum(). Therefore, R2 in general refers to the correlation between predicted and actual data using all individuals in the dataset. The value of R2 will only be high if and when the predicted values are very close to the actual value, which indicates that the model is highly accurate. We agree that if the imputed missing values are the same as the existing values present in the data sample, we would be introducing redundancy in the data, which could lead to a high R2 value. However, while applying K-nearest neighbor algorithm in our application of missing value imputation, we search for K-nearest neighbor in the entire dataset. Next, a unique value of K for individual morphological performance traits is identified by considering a large search space of 3 – 200. Then, the missing value is imputed taking the average of the values from only the nearest neighbor samples regardless of species type. These steps help us ensure that redundancy is not introduced in the dataset indicating that the R2 scores are not biased. We have provided the average performance values and corresponding standard errors for each feature to facilitate interpretation of the MAEs. We also note that MAE here has units of the performance feature in question. With regard to the distribution of the data among performance traits, while this would be too much to summarize in a manuscript, we note that we are making all of the data freely available for anyone to inspect. Certainly, there are cases where performance data are sparse, but again the imputed data are tested and verified against all of the other data, including cases where those data are present. 3. There are excellent biological reasons to doubt that the repeatabilities of these performance characteristics could be high enough to produce the R^2 values reported. If you run or jump the same lizard and different equations, I would be astonished if repeatabilities could ever approach 98%. The assay for sprint speed is descrbied as doen at an angle of 45 or less. How can it not matter whether you sprint uphill on on the flat? Since this is the case how could an individuals morphology ever predict performance with such high accuracy. In short, the manuscript makes an extraordinary claim, but fails to back it up with meaningful statistical evidence. Response: There seems to be some confusion here as well; the R2 values here refer to the accuracy of the model (i.e. predicted vs actual values; see previous response), NOT the repeatabilities of the empirical data. It is important to note that over the last 40+ years, performance researchers have almost always striven to measure maximum performance and have developed standardized protocols based on measuring a given performance trait multiple times and selecting the maximum value (see Losos et al. 2002 for a detailed description). Consequently, we focus on maximum performance here as well. Although previous studies have shown that such performance measures are in fact often highly repeatable, the extraordinary repeatabilities we report here are a result of the machine learning approach being applied to the entire dataset, not of empirical data collection. With regard to angled tracks sometimes being used to collect sprint data, many lizards, especially those in the genus Anolis, tend to hop on horizontal surfaces rather than run. Performance researchers therefore found that angling the track as described elicits proper sprinting. One might reasonably imagine that lizards sprinting up an angled track would be slower than those sprinting horizontally because sprinting at an angle requires more power, but this is not the case; lizard sprinting is not limited by power output, and sprint speeds do not decrease as angle increases (see Farley 1997; Irschick et al. 2003). We have clarified the above issues in the revised manuscript. 4. What is absolutely essential to evaluate the accuracy is that we see the number of species actually measured in each case, as well as the error of each prediction when used in the test data set. Similarly, we really need to know whether the within-species variation is also explained by morphology, or whether this is just the mean that is explained. Since there is no enumeration of studies, I don’t know whether they are predicting the performance of 10 species from that of 58, or predicting 58 from 10 for each performance trait. What are the within-species sample sizes? How are those 2000 individual lizards divided among each trait, and species? Why is there no figure showing measurements and predictions with meaningfull errors for each? Response: The reviewer is correct that information on species composition and sample sizes should be provided. We have included a figure in the main document (now Figure 1), and a table in the supplement (Table S23) giving both the names and the sample sizes for each of the 68 species in our training dataset. We used K-fold cross-validation, whereby the dataset is randomly subdivided and then those subdivisions tested iteratively against the remaining data, on the entire dataset, making sure that the sample used in testing is not included in the training of the model. Because the individuals are selected for those K subdivisions at random, it isn’t the case that we are using x species to predict the performance of all other species; instead, we are selecting x INDIVIDUALS and testing their performance against that of all other individuals, and those individuals are randomly selected without regard for species identity. Not only are we not using a certain number of species to predict other species, but species identity is not used in the model at all; as we report elsewhere in the ms, neither species identity nor the phylogeny are informative for predicting performance data from morphology. 5. If this model is in fact highly accurate by some meaningful measurement, this would be a very important result. It has important implications for phenomics, suggesting the dimensionality of the phenotype as whole is not so very high. Rather than BEING phenomics, such predictive ability would suggest that we do not really need phenomics. Trying to present the results AS phenomics is misguided, but the implications FOR phenomics are very interesting. Response: Again, we do not believe (and explicitly do not claim) that our approach is a replacement for empirical phenotyping; rather, we view it as a method for imputing such data when they cannot be acquired through more conventional means. This general procedure is already used in phenomics to fill in gaps on data acquired from individuals; our expanded approach enables us to do this for individuals belonging to multiple different species. To be clear, our method does not simply generate a species mean that can be used for comparative or phylogenetic analyses; it allows us to generate missing datapoints at the level of the individual for 68 different lizard species. In this respect, it is entirely consistent with the way that machine learning has been applied to phenomics in the past, just expanded to 68 different species as opposed to only one. Furthermore, we disagree with the reviewer that this method, powerful as it is, either is or should be a replacement for empirical data collection, although we do think it has enormous utility as a supplementary approach. We have revised the manuscript to make this point clearer. References Farley, C.T. (1997) Maximum speed and mechanical power output in lizards. The Journal of Experimental Biology, 200, 2189-2195. Irschick, D.J., Vanhooydonck, B., Herrel, A. & Andronescu, A. (2003) The effects of loading and size on maximum power output and gait characteristics in geckos. Journal of Experimental Biology, 206, 3923-3934. Losos, J.B., Creer, D.A. & Schulte, J.A. (2002) Cautionary comments on the measurement of maximum locomotor capabilities. Journal of Zoology, 258, 57-61. Submitted filename: Response_to_Reviews_PONE-D-21-11098.docx Click here for additional data file. 30 Sep 2021

PONE-D-21-11098R1

Machine Learning Accurately Predicts the Multivariate Performance Phenotype from Morphology in Lizards

PLOS ONE Dear Dr. Lailvaux, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

The reviewers found this revision to be improved and more clear. However, Reviewer 1 in particular is interested in more details regarding model validation / performance evaluation. Given that clarity is at a premium with the introduction of new methods, I ask that you consider these comments carefully. Please submit your revised manuscript by Nov 14 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Christopher Nice, Ph.D. Academic Editor PLOS ONE Journal Requirements: Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. Additional Editor Comments (if provided): [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: (No Response) Reviewer #2: All comments have been addressed Reviewer #3: (No Response) ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes Reviewer #3: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: I Don't Know Reviewer #3: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes Reviewer #3: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The alterations to the manuscript having improved its clarity on many basic points. However, now that I understand what the authors are doing better, it is clear that some of my original questions remain. Tthe authors appear to have misread my comments on measures of model accuracy. Because the evaluation of the accuracy of the model predictions is dubious in many respects, I am still worried that the authors give quite a misleading picture of their model’s performance. First, I want to be clear on a critical point, and that is whether the model is being asked to predict imputed data in the 10-fold cross-validation step. The authors describe imputation, and then fitting of the overall model to the imputed + observed data in some detail. However, the performance evaluation is simply described as splitting the data into subsets, followed by standard cross-validation. There are two potential issues here. I would like to assume that the authors removed the imputed performance values from the 1/10 of the data that is used as the test set. Please make explicit that this is not what you are doing. Only actually observed maximum performance values are legitimate to use in the test data set evaluate the accuracy of the model predictions. While it is fine to use imputed values if they improve model predictions, it is NOT a test of accuracy to use a test data set with imputations to see whether the imputations are predicted by the model. Statements like this one at line 297-298 make me afraid that the authors HAVE made this fundamental error “treadmill endurance was measured for only ~7.8% of individuals but inferred with an accuracy of 0.95 throughout the dataset.” If they have, then all measures of accuracy in this manuscript are completely bogus. The more general issue is that an overall measure of correlation or error is not very informative, particularly in a data set with massively unbalanced sample sizes. As the authors note the data set is heavily weighted to Anolis, and indeed 45% of the specimens are in just three species. This means that whatever individual-wise measure of accuracy is computed is mostly reflecting the model’s performance in those species with the largest sample sizes. For example authors say (lines 323-325) “making it all the more remarkable that our model was able to make accurate predictions even for sparsely sampled taxa.” The problem is that the authors have not calculated the accuracy of predictions on sparsely sampled taxa, just on the overall data set. To make such statements the authors need to report accuracy for each species. The authors single PCC or MAE value is effectively weighted by within species sample size. A more representative overall measure would be an unweighted mean (or median) PCC or MAE value over species. This would help get at another unaddressed issue from my previous review and that is the issue of proportional errors. While it is a great improvement that we now have the overall means of each performance measure, an MAE value might be very small for an organism with a high predicted value, and very large for an organism with a small predicted value. There is still no summary of the actual performance values to enable a reader to evaluate this issue, as the authors make no attempt to do so. The statement on line 295 about MAE is still made without units, and is completely meaningless without thinking in proportional terms. For example the extreme values cited 0.003 meters jump and 1.75 seconds are each close to a 1% error, and not really very different. But is the error low throughout the range of performance values? Interpret all MAE values proportionally. Another cross-validation that would also be informative is to do it species-wise – that is leave out each species from the training data set and ask how well its performance is predicted by the remaining species performance data. This would be an interesting test of the author’s contention that phylogeny does not matter much. If that is true then there should be little cost to cross-species predictions. If cross-species predictions do well within the data set, then this would suggest that the model may indeed be useful when applied to species whose performance has not been measured. I still argue that prediction is not really that relevant to phenomics, except in the way outlined in my previous review. As the lead author on the article cited, perhaps you should give my point of view some credence. Minor comments. Line 105 Cormorants dive from the water surface, not while in flight. Line 262-271. I believe the authors mean to refer readers to Tables S4 and S5 in this paragraph. Reviewer #2: Dear Editor, I have considered the revised manuscript entitled 'Machine Learning Accurately Predicts the Multivariate Performance Phenotype from morphology in Lizards’ by S. Lailvaux, although I did not review the original version. I also read carefully the authors’ extensive responses to the referees' comments and questions and paid special attention to how these comments were addressed (where necessary) in the manuscript (the added track-changes version was very helpful for this). In my opinion, the authors did this revision very thoroughly and respectfully. With regard to the content, I must admit that I am not at all familiar with the statistical (and computer) techniques applied to this massive lizard-dataset. On the other hand, I am sufficiently familiar with ecological/morphological research (in the evolutionary context) to see the enormous potential of this methodology. This is especially the merit, also for the 'mathematical layman', of the very comprehensible introduction and discussion. The only thing I cannot quite assess is what the direct applicability (and thus in a sense the valorisation value) of this method might be to other, new cases (e.g. a study of the link between morphology and performance traits in arthropods). Morphometric and performance data will always be needed to train the routine to make predictions. But how extensive should this training dataset be? How 'lizard' specific is the current procedure (in other words: is the protocol directly applicable to other systems)? Can the classical 'ecologist' apply this method autonomously ... or will the participation of colleagues from computer sciences be necessary? Etc. The authors may wish to provide a perspective in this respect in a short paragraph in the discussion. Reviewer #3: A concise report on the application of a statistical method to morphology>performance data as a way to deal with incomplete datasets. The paper addresses a real problem and provides a robust statistical solution. I've got no major criticisms of the methods or results. I have a few comments that might be considered to help improve, especially, the discussion: line 120: the model only predicts performance for incomplete performance datasets, correct? Or can the model work with incomplete morphological datasets too? Please check the wording here. line 144: ML. Every time I read ML I think maximum likelihood but it is machine learning. I might suggest avoiding this abbreviation altogether the paper already has a lot of abbreviations, so one fewer would make for a bit less mental work for the reader. line 297: I am having a hard time understanding how this is even possible. So if I measure endurance on 7.8% of my samples and then use the model to predict the other 92.2% of samples...how do I 'know' if that prediction is at all accurate? You don't really 'know' what those endurance values are, you only have a prediction based on other traits that is dependent upon very poor samples of 'known' values. Do the authors really believe that if I measured the other 92.2% of species that my measurements would fall within the prediction 95% of the time (or rather does the model tell us that)??? It seems like a huge leap of faith based on very weak underlying sampling. Line 341: I think you've missed a key idea. While endurance (i.e. the ability of muscles to sustain contraction to propel the animal forward at a given speed) is no doubt most closely linked to the cardiovascular and pulmonary systems....it is also TOTALLY dependent upon the limb morphology and body dimension of a given species. Shorter limbed species must cycle their limbs more often to maintain speed, thus taxing their muscles more than a longer limbed species that ran the same distance. Dachsunds will always tire before greyhounds and some (or even a lot) of that is related to their limb shape! Line 366: Ughh...you undercut one of the main benefits of your model...prediction. But it is true that applying this model to other species is iffy. You might consider some text here to explain what type of dataset might be needed to build a model that ultimately COULD be used across other species. Line 334: Feels like a bit of a cop out. I think there is more causality that you can infer here than you give yourself credit for. Or rather, there is more biology here than the paper currently digs into. I realize the point of the paper is to demonstrate and validate the statistical model....but it sure would have been nice to see a bit deeper dive into the biology of how these performance traits trade-off or facilitate, etc. ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: David Houle Reviewer #2: No Reviewer #3: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

2 Dec 2021 Response to First Reviewer The alterations to the manuscript having improved its clarity on many basic points. However, now that I understand what the authors are doing better, it is clear that some of my original questions remain. The authors appear to have misread my comments on measures of model accuracy. Because the evaluation of the accuracy of the model predictions is dubious in many respects, I am still worried that the authors give quite a misleading picture of their model’s performance. First, I want to be clear on a critical point, and that is whether the model is being asked to predict imputed data in the 10-fold cross-validation step. The authors describe imputation, and then fitting of the overall model to the imputed + observed data in some detail. However, the performance evaluation is simply described as splitting the data into subsets, followed by standard cross-validation. There are two potential issues here. I would like to assume that the authors removed the imputed performance values from the 1/10 of the data that is used as the test set. Please make explicit that this is not what you are doing. Only actually observed maximum performance values are legitimate to use in the test data set evaluate the accuracy of the model predictions. While it is fine to use imputed values if they improve model predictions, it is NOT a test of accuracy to use a test data set with imputations to see whether the imputations are predicted by the model. Statements like this one at line 297-298 make me afraid that the authors HAVE made this fundamental error “treadmill endurance was measured for only ~7.8% of individuals but inferred with an accuracy of 0.95 throughout the dataset.” If they have, then all measures of accuracy in this manuscript are completely bogus. - We thank the reviewer for these comments; however, we strongly reject the characterization of our model validation methods as “dubious”. In particular, the author is incorrect to state that “only actually observed maximum performance values are legitimate to use in the test data set”. - It is important here to be clear: none of the approaches that we use for model assessment and validation in this paper are in any way novel, unusual, or controversial within the field of machine learning, and we implement them here in standard ways. However, because we understand the potential for confusion, we lay out the rationale for our approach involving both the data imputation and model validation here in some detail. - The machine model learning occurs in two stages. First, the model learns how best to impute all the missing values via appropriate interpolation. However, this is not done by simply taking an “unintelligent” column-wise (i.e. trait- or feature-wise) mean (which we call k=N where N equals the total instances in the entire dataset), nor by choosing a single nearby “best” value (called k =1). Instead, we used the k-nearest neighbor (KNN) machine-learning approach, which outperforms both these and other simple statistical methods (see Jerez et al. 2010 for an example and validation). In KNN, the value of k indicates how many neighboring datapoints we use for interpolation and averaging. In our case, figures S1 to S8 in the manuscript supplement describe the search for the optimum value of k for each performance trait, which we found to be 165 for jump power (Fig. S1); 57 for bite force (Fig. S3), and so on. We are therefore not simply choosing a nearby datapoint to insert, nor are we always averaging the same number of nearby datapoints; rather, we search for the optimal value of nearby datapoints to average and choose that value of k based on the lowest observed error rate. So the imputation procedure itself is an important component of the model’s accuracy, and choosing appropriate averages to use for the imputation of each missing datapoint is the first stage at which the error variance of the imputed value prediction is reduced. - The second stage of machine learning involves using the entire dataset, including those imputed values, to train the model, using the various algorithms and alsorithm combinations that we describe in the manuscript. It is important that those imputed values are also used, because if we were to discard them the model would be left with very few datapoints from which to learn. So including the imputed values at this and later stages is a key part of the ability of machine learning to produce accurate predictions from sparse datasets – which is the entire purpose of machine learning. - Although the reviewer did not question the imputation procedure, we have included this explanation here to give context to the part of the methodology that the reviewer did question – namely the validation procedure, and the inclusion of those imputed data in the test dataset. To answer that question clearly: we did include the imputed data in the test datset, and again we did this because that is the standard methodology for validating models that make predictions from datasets comprising large amounts of missing data. Excluding the imputed data from the test dataset leaves one with a very sparse test dataset, such that many of the folds in the k-fold cross-validation are left with zero or only a very few datapoints on which to effectively test the model predictions (see new Excel sheet in supplementary data). So including the imputed data ensures that the performance is not artificially low, as would be the case if it were excluded. Furthermore, we have an estimate of the error involved in our model from the very start (i.e. during imputation see Figures S1-S8 again). Those error rates in computing the values are, with the exception of endurance, extremely low, and that error only decreases as the learning algorithms and stacking technique are applied. Consequently, inclusion of the imputed data in the test data does not introduce enormous amounts of error. Finally, it is of note that although we are of course able to evaluate the performance of the final model, the processes of model prediction and validation occur iteratively as part of the training/learning procedure (see Methods sections “Model Performance Evaluation” and “Regression Framework” in the ms), and as such removing imputed data from the model makes little sense. - The above notwithstanding, one can, of course, perform the cross-validation without the imputed data as the reviewer suggests, and also with varying size test folds. Unsurprisingly, test folds containing sparse data perform poorly, again because they contain insufficient datapoints to test the model predictions. However, this does leave us with a moderate number of folds with sufficient original data for testing, which yield consistent and very good results, further supporting our contentions here regarding the accuracy of the model. The more general issue is that an overall measure of correlation or error is not very informative, particularly in a data set with massively unbalanced sample sizes. As the authors note the data set is heavily weighted to Anolis, and indeed 45% of the specimens are in just three species. This means that whatever individual-wise measure of accuracy is computed is mostly reflecting the model’s performance in those species with the largest sample sizes. For example authors say (lines 323-325) “making it all the more remarkable that our model was able to make accurate predictions even for sparsely sampled taxa.” The problem is that the authors have not calculated the accuracy of predictions on sparsely sampled taxa, just on the overall data set. To make such statements the authors need to report accuracy for each species. The authors single PCC or MAE value is effectively weighted by within species sample size. A more representative overall measure would be an unweighted mean (or median) PCC or MAE value over species. This would help get at another unaddressed issue from my previous review and that is the issue of proportional errors. While it is a great improvement that we now have the overall means of each performance measure, an MAE value might be very small for an organism with a high predicted value, and very large for an organism with a small predicted value. There is still no summary of the actual performance values to enable a reader to evaluate this issue, as the authors make no attempt to do so. The statement on line 295 about MAE is still made without units, and is completely meaningless without thinking in proportional terms. For example the extreme values cited 0.003 meters jump and 1.75 seconds are each close to a 1% error, and not really very different. But is the error low throughout the range of performance values? Interpret all MAE values proportionally. -As the reviewer suggests, we have created an excel file to show the results of cross-species prediction. The number of test samples represents the number of samples we have for each species after removing the missing values. As the new table S24 shows, most of the species only have a small number of samples. It is not possible to train and test the model species-wise with such a small number of samples without conducting imputation first. - We have provided the 10-fold cross-validation result with the range (Min and Max) of each performance feature to interpret all the MAE proportionally. These cross validation results are summarized in Table S34, and the error is relatively low considering the range of values for the performance features. Another cross-validation that would also be informative is to do it species-wise – that is leave out each species from the training data set and ask how well its performance is predicted by the remaining species performance data. This would be an interesting test of the author’s contention that phylogeny does not matter much. If that is true then there should be little cost to cross-species predictions. If cross-species predictions do well within the data set, then this would suggest that the model may indeed be useful when applied to species whose performance has not been measured. I still argue that prediction is not really that relevant to phenomics, except in the way outlined in my previous review. As the lead author on the article cited, perhaps you should give my point of view some credence. - As the reviewer suggested, we have calculated species-wise cross-validation with the best stacking model. We test a species’ performance by training the model with other species data. The table is too big to display here, so we attach an excel file as a supplementary document to demonstrate the results. The prediction accuracy is far better for those with more extensive training and testing data. For example, for the performance feature Bite force (Values ranging between 0-109.26 and17.3% missing), the average MAE and PCC are 7.42 and 0.48, respectively. The result clearly suggests that it is possible to have cross-species predictions with a good number of train and test samples, which supports our contention that phylogeny is of little importance within this dataset. - With regard to the issue of phenomics, we are not indifferent to the reviewers’ point of view; the problem from our perspective is that the reviewer simply asserts that our approach holds no relevance to phenomics, but does not explain why. As we noted in our response to the first review and in the manuscript, other phenomic studies have and do use machine learning prediction as a matter of course, albeit on a smaller scale than we propose here; as we see it, our approach is entirely compatible with these and with the definition proposed in Houle et al. (2010). Clearly the reviewer disagrees; but without being privy to the reasons why, we are at a loss as to how to address his concerns. Line 105 Cormorants dive from the water surface, not while in flight. - This should have referred to gannets, which do exhibit this behavior. We have corrected it in the text. Line 262-271. I believe the authors mean to refer readers to Tables S4 and S5 in this paragraph. - The table reference in the manuscript is correct as it stands. Response to Second Reviewer I have considered the revised manuscript entitled ‘Machine Learning Accurately Predicts the Multivariate Performance Phenotype from morphology in Lizards’ by S. Lailvaux, although I did not review the original version. I also read carefully the authors’ extensive responses to the referees’ comments and questions and paid special attention to how these comments were addressed (where necessary) in the manuscript (the added track-changes version was very helpful for this). In my opinion, the authors did this revision very thoroughly and respectfully. With regard to the content, I must admit that I am not at all familiar with the statistical (and computer) techniques applied to this massive lizard-dataset. On the other hand, I am sufficiently familiar with ecological/morphological research (in the evolutionary context) to see the enormous potential of this methodology. This is especially the merit, also for the ‘mathematical layman’, of the very comprehensible introduction and discussion. The only thing I cannot quite assess is what the direct applicability (and thus in a sense the valorisation value) of this method might be to other, new cases (e.g. a study of the link between morphology and performance traits in arthropods). Morphometric and performance data will always be needed to train the routine to make predictions. But how extensive should this training dataset be? How ‘lizard’ specific is the current procedure (in other words: is the protocol directly applicable to other systems)? Can the classical ‘ecologist’ apply this method autonomously ... or will the participation of colleagues from computer sciences be necessary? Etc. The authors may wish to provide a perspective in this respect in a short paragraph in the discussion. - Given that our model uses and makes predictions only on data from lizards, we believe that this model is only relevant to lizards. Indeed, we further adopt the conservative position that this model is best applied only to lizard species for which the model has been validated (i.e. those species included in the model), even though our model also suggests that phylogeny has no affect on model performance (see also response to reviewer 1 above). We do already address these points explicitly in the final two paragraphs of the discussion. Briefly, we view this model as a starting point beyond which we might expand the model to other taxa and modes of locomotion via inclusion of appropriate training data. Response to Third Reviewer line 120: the model only predicts performance for incomplete performance datasets, correct? Or can the model work with incomplete morphological datasets too? Please check the wording here. - Our training dataset comprises both incomplete morphology and performance data, as we note in the methods (lines 171-173). The model is able to deal with both types of missing data because machine learning methods do not distinguish between dependent and independent variables, and rather use the entire dataset for prediction. l ine 144: ML. Every time I read ML I think maximum likelihood but it is machine learning. I might suggest avoiding this abbreviation altogether the paper already has a lot of abbreviations, so one fewer would make for a bit less mental work for the reader. - We have replaced the abbreviation “ML” with “machine learning” throughout as suggested. line 297: I am having a hard time understanding how this is even possible. So if I measure endurance on 7.8% of my samples and then use the model to predict the other 92.2% of samples...how do I ‘know’ if that prediction is at all accurate? You don’t really ‘know’ what those endurance values are, you only have a prediction based on other traits that is dependent upon very poor samples of ‘known’ values. Do the authors really believe that if I measured the other 92.2% of species that my measurements would fall within the prediction 95% of the time (or rather does the model tell us that)??? It seems like a huge leap of faith based on very weak underlying sampling - We understand the reviewer’s skepticism. One of the strengths of machine learning is its ability to predict novel, unmeasured data from a sparse dataset, and that is the reason we use it here. We refer the reviewer to our response to reviewer 1, where we address in detail this concern and the methods by which machine learning imputes missing data. Nonetheless, we agree that it may be best to be conservative in our claims, and consequently we have removed this statement from the manuscript. Line 341: I think you’ve missed a key idea. While endurance (i.e. the ability of muscles to sustain contraction to propel the animal forward at a given speed) is no doubt most closely linked to the cardiovascular and pulmonary systems....it is also TOTALLY dependent upon the limb morphology and body dimension of a given species. Shorter limbed species must cycle their limbs more often to maintain speed, thus taxing their muscles more than a longer limbed species that ran the same distance. Dachsunds will always tire before greyhounds and some (or even a lot) of that is related to their limb shape! - We agree with the reviewer that even endurance performance is likely to depend on underlying morphology, consistent with the ecomorphological paradigm. However, we argue that simple limb dimensions, which this dataset comprises here, do not reflect that variation as they do for a trait such as sprint speed (where speed is stride length x stride frequency, and therefore clearly a function of limb length). - The reviewers’ point regarding interspecific variation is well taken; however, it is also important to note that that intraspecific endurance capacity can and does vary both among and within individuals in a way that, again, is not reflected in simple limb measurements, and that our dataset comprises this intraspecific variation in addition to interspecific variation. Indeed, much of the deterministic value of morphology in terms of endurance is dependent on the distribution of mass both along and among the limbs and other parts of the body, something that again is not captured by simple limb measurements, or even measures of whole body mass. We therefore stand by our original statement in the manuscript, although we have amended it to acknowledge the issue regarding mass distribution. Line 366: Ughh...you undercut one of the main benefits of your model...prediction. But it is true that applying this model to other species is iffy. You might consider some text here to explain what type of dataset might be needed to build a model that ultimately COULD be used across other species. - We address this point explicitly in the discussion, where we state that: “… Yet another possibility presented by our model performance, particularly in its accuracy in predicting performance for novel morphologies, is that an expanded and appropriately trained version of MVPpred could in principle allow for the accurate prediction of performance abilities from the bones of extinct organisms that have no living analogues. Similarly, our model could potentially represent a foundation for expanding this predictive approach to encompass other taxa and modes of locomotion beyond terrestrial lizards.” Our aim is, in fact, to do this as the next step once this proof-of-concept study is accepted. We have inserted an additional statement to this effect. Line 334: Feels like a bit of a cop out. I think there is more causality that you can infer here than you give yourself credit for. Or rather, there is more biology here than the paper currently digs into. I realize the point of the paper is to demonstrate and validate the statistical model....but it sure would have been nice to see a bit deeper dive into the biology of how these performance traits trade-off or facilitate, etc. - We agree with the reviewer that there is a lot more biology to be discussed. However, we think that to do so in this manuscript is to put the cart before the horse. Again, in our view, this manuscript represents a proof-of-concept of the model, and there is a great deal of scope to follow up this paper with the types of studies the reviewer suggests. References: Jerez, M.J., I. Molina, P.J. García-Laencina, E. Alba, N. Ribelles, M. Martin and L. Franco. 2010. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artificial Intelligence in Medicine 50: 105-115. Submitted filename: Response_to_Reviews_Lailvaux_Lizard_Machine_Learning_V2.docx Click here for additional data file. 7 Dec 2021 Machine Learning Accurately Predicts the Multivariate Performance Phenotype from Morphology in Lizards PONE-D-21-11098R2 Dear Dr. Lailvaux, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Christopher Nice, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 12 Jan 2022 PONE-D-21-11098R2 Machine Learning Accurately Predicts the Multivariate Performance Phenotype from Morphology in Lizards Dear Dr. Lailvaux: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Christopher Nice Academic Editor PLOS ONE

39 in total

1. Trade-offs between burst performance and maximal exertion capacity in a wild amphibian, Xenopus tropicalis.

Authors: Anthony Herrel; Camille Bonneaud
Journal: J Exp Biol Date: 2012-06-01 Impact factor: 3.312

2. Functional trade-offs in the limb muscles of dogs selected for running vs. fighting.

Authors: B M Pasi; D R Carrier
Journal: J Evol Biol Date: 2003-03 Impact factor: 2.411

3. Sex-based differences and similarities in locomotor performance, thermal preferences, and escape behaviour in the lizard Platysaurus intermedius wilhelmi.

Authors: Simon P Lailvaux; Graham J Alexander; Martin J Whiting
Journal: Physiol Biochem Zool Date: 2003 Jul-Aug Impact factor: 2.247

4. Performance is no proxy for genetic quality: trade-offs between locomotion, attractiveness, and life history in crickets.

Authors: Simon P Lailvaux; Matthew D Hall; Robert C Brooks
Journal: Ecology Date: 2010-05 Impact factor: 5.499

5. Discordance between morphological and mechanical diversity in the feeding mechanism of centrarchid fishes.

Authors: David C Collar; Peter C Wainwright
Journal: Evolution Date: 2006-12 Impact factor: 3.694

6. An overview of statistical learning theory.

Authors: V N Vapnik
Journal: IEEE Trans Neural Netw Date: 1999

7. Multi-trait Selection, Adaptation, and Constraints on the Evolution of Burst Swimming Performance.

Authors: Cameron K Ghalambor; Jeffrey A Walker; David N Reznick
Journal: Integr Comp Biol Date: 2003-07 Impact factor: 3.326

8. EVOLUTION OF SPRINT SPEED IN LACERTID LIZARDS: MORPHOLOGICAL, PHYSIOLOGICAL, AND BEHAVIORAL COVARIATION.

Authors: Dirk Bauwens; Theodore Garland; Aurora M Castilla; Raoul Van Damme
Journal: Evolution Date: 1995-10 Impact factor: 3.694

9. Explosive jumping: extreme morphological and physiological specializations of Australian rocket frogs (Litoria nasuta).

Authors: Rob S James; Robbie S Wilson
Journal: Physiol Biochem Zool Date: 2008 Mar-Apr Impact factor: 2.247

10. Development and application of a machine learning algorithm for classification of elasmobranch behaviour from accelerometry data.

Authors: L R Brewster; J J Dale; T L Guttridge; S H Gruber; A C Hansell; M Elliott; I G Cowx; N M Whitney; A C Gleiss
Journal: Mar Biol Date: 2018-03-08 Impact factor: 2.573