Literature DB >> 32997716

Comparing a novel machine learning method to the Friedewald formula and Martin-Hopkins equation for low-density lipoprotein estimation.

Gurpreet Singh¹, Yasin Hussain², Zhuoran Xu¹, Evan Sholle³, Kelly Michalak¹, Kristina Dolan¹, Benjamin C Lee¹, Alexander R van Rosendael⁴, Zahra Fatima¹, Jessica M Peña¹, Peter W F Wilson⁵, Antonio M Gotto⁶, Leslee J Shaw¹, Lohendran Baskaran^1,7, Subhi J Al'Aref¹.

Abstract

BACKGROUND: Low-density lipoprotein cholesterol (LDL-C) is a target for cardiovascular prevention. Contemporary equations for LDL-C estimation have limited accuracy in certain scenarios (high triglycerides [TG], very low LDL-C).
OBJECTIVES: We derived a novel method for LDL-C estimation from the standard lipid profile using a machine learning (ML) approach utilizing random forests (the Weill Cornell model). We compared its correlation to direct LDL-C with the Friedewald and Martin-Hopkins equations for LDL-C estimation.
METHODS: The study cohort comprised a convenience sample of standard lipid profile measurements (with the directly measured components of total cholesterol [TC], high-density lipoprotein cholesterol [HDL-C], and TG) as well as chemical-based direct LDL-C performed on the same day at the New York-Presbyterian Hospital/Weill Cornell Medicine (NYP-WCM). Subsequently, an ML algorithm was used to construct a model for LDL-C estimation. Results are reported on the held-out test set, with correlation coefficients and absolute residuals used to assess model performance.
RESULTS: Between 2005 and 2019, there were 17,500 lipid profiles performed on 10,936 unique individuals (4,456 females; 40.8%) aged 1 to 103. Correlation coefficients between estimated and measured LDL-C values were 0.982 for the Weill Cornell model, compared to 0.950 for Friedewald and 0.962 for the Martin-Hopkins method. The Weill Cornell model was consistently better across subgroups stratified by LDL-C and TG values, including TG >500 and LDL-C <70.
CONCLUSIONS: An ML model was found to have a better correlation with direct LDL-C than either the Friedewald formula or Martin-Hopkins equation, including in the setting of elevated TG and very low LDL-C.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2020 PMID： 32997716 PMCID： PMC7526877 DOI： 10.1371/journal.pone.0239934

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Atherosclerotic cardiovascular disease (ASCVD) is the leading cause of worldwide morbidity and mortality [1]. In the United States, annual mortality from ASCVD exceeds 800,000 deaths, while greater than 700,000 new cerebrovascular events occur annually, with an estimated cost of $351 billion [2]. Elevated low-density lipoprotein cholesterol (LDL-C) has been extensively validated as a major risk factor for the development of ASCVD [1]. Reduction in LDL-C has been shown to improve outcomes both within primary and secondary prevention cohorts [3, 4]. Multiple national and international clinical practice and societal guidelines, such as the American Heart Association/American College of Cardiology (AHA/ACC), European Society of Cardiology (ESC) and the Canadian Cardiovascular Society (CCS) consider LDL-C lowering as a primary target for both primary and secondary prevention [5-7]. In addition, contemporary data from novel lipid-lowering drug therapies show improved outcomes with aggressive LDL-C lowering beyond the traditional thresholds advocated for by current guidelines [8-11]. More recently, there has been a growing emphasis on residual cardiovascular risk in the setting of adequately controlled LDL-C levels, especially in the setting of elevated triglycerides [12]. As such, the clinical implications of LDL-C necessitate aiming for the most accurate estimates. Traditionally, LDL-C has been estimated using the Friedewald formula, developed in 1972 on a cohort of 448 patients [13]. The equation estimates LDL-C as (total cholesterol [TC]) − (high-density lipoprotein cholesterol [HDL-C]) − (triglycerides [TG] / 5) in mg/dL [13]. A factor of 5 for triglycerides: very low-density LDL (TG: VLDL-C) was used for ease of computation in an era prior to the currently accepted LDL-C thresholds (Grundy, 2004). The Lipid Research Clinics Prevalence Study provided evidence of significant variance in the TG: VLDL-C ratio amongst individuals [14]. The Friedewald formula is particularly inaccurate for patients with low LDL-C levels or high triglycerides [15, 16]. To overcome these inaccuracies, in 2013, Martin et al. provided the Martin-Hopkins method for LDL-C estimation. The equation is (TC)–(HDL-C)–(TG/adjustable factor), where the adjustable factor stands for the strata-specific median TG: VLDL-C ratios [17]. The Martin-Hopkins method has been validated in multiple national and international trials [18-20]. This novel method has helped re-categorize patients who were previously undertreated and is currently the method used for LDL-C estimation at multiple clinical laboratories [21]. However, the Martin-Hopkins equation was developed based on traditional linear regression analysis, and although it outperforms the Friedewald formula, there remain inaccuracies, especially at lower LDL-C estimates [22]. The accepted reference method for lipoprotein fraction measurement is the beta-quantification (BQ) method, which is possible in a limited setting but not suitable for mass screening due to its cost and labor-intensive nature. Machine learning (ML) utilizes sophisticated mathematical representation for the construction of inferential and predictive models. The use of ML has been shown to improve modeling and outcomes prediction in multiple domains within cardiovascular medicine [23, 24]. In an effort to further improve LDL-C estimation in this era of precision medicine, we derived a novel method for LDL-C estimation from the standard lipid profile using an ML approach based on the random forests algorithm (the Weill Cornell model). We compared the correlation between the Weill Cornell model to measured direct LDL-C along with the Friedewald and Martin-Hopkins methods for LDL-C estimation.

Methods

The study sought to develop and subsequently validate a novel approach for the estimation of the serum LDL-C using ML-based random forests applied to routine cholesterol measurements, then compare its performance to the Friedewald formula and the more contemporary Martin-Hopkins equation.

Study population

The study cohort comprised of a convenience sample of consecutive standard lipid profile samples (with the directly measured components of TC, HDL-C, and TG) as well as corresponding directly measured LDL-C values, performed between August 31st, 2005 and January 31st, 2019 for clinical indications at the New York-Presbyterian Hospital/Weill Cornell Medicine (NYP-WCM) inpatient and outpatient units across New York City and its boroughs. Inclusion criteria included determination of directly measured components of a standard lipid profile (TC, TG, HDL-C) as well as directly measured LDL-C on the same day in order to avoid day-to-day variations in cholesterol particles while comparing the calculated LDL-C value to the direct LDL-C. Further, we excluded lipid profiles with missing values for TC, HDL-C or TG. Data were extracted from the electronic health record (EHR) system using the Architecture for Research Computing in Healthcare (ARCH) program, a suite of tools and services offered by the Research Informatics team within NYP-WCM’s department of Information Technologies & Services [25]. Since this study did not include personally identifiable information (PII), it did not constitute human subjects research and was deemed exempt from Institutional Review Board (IRB) review.

Laboratory testing

Direct serum LDL-C measurement was performed using the Siemens ADVIA Chemistry XPT systems (Tarrytown, NY) at the NYP-WCM clinical laboratory. The clinical laboratory at NYP-WCM is regulated under the New York State Department of Health and is accredited by the College of American Pathologists (CAP). The assay was calibrated every 14 days and quality control measures followed government regulations or accreditation requirements. The assay was linear, measuring values from 8.0–1,670.0 mg/dl (0.21–43.25 mmol/L) with intra-assay coefficients of 0.4–0.5%. The limit of blank (LoB) for the ADVIA Chemistry assay was 0.1 mg/dL while the limit of detection (LoD) was 8.0 mg/dL. Serum total cholesterol, triglycerides and HDL-C were also measured using the Siemens ADVIA Chemistry XPT systems (Tarrytown, NY). The reportable range for total cholesterol was 10–1,350 mg/dl (0.55–74.93 mmol/L), and the assay was calibrated every 60 days, with 3 levels of quality control material that were analyzed twice daily. For triglycerides, similar calibration and quality control standards were used. The reportable range for triglycerides was 10–1,100 mg/dl (0.55–61.05 mmol/L). For HDL-C, the assay was calibrated every 30 days, quality control measures were run twice daily, and the reportable range of HDL-C was 5–230 mg/dL (0.28–12.77 mmol/L). The ADVIA chemistry system assays for total cholesterol, direct LDL-C, triglyceride, and HDL-C are traceable to the National Cholesterol Education Programs / Centers for Disease Control (NCEP/CDC) reference method via patient sample correlation. The Friedewald and Martin-Hopkins estimated LDL-C values were calculated using the established and published formulas [13, 17].

Machine Learning (ML)

ML analysis was performed using the application programming interface (API) of Scikit-learn [26]. A random forest model was constructed in order to predict the LDL-C value based on the values of the measured TC, TG and HDL-C (available through a standard lipid profile sample). The directly measured LDL-C served as the ground truth label. Random forest is a commonly used form of supervised learning that is employed for both classification and regression tasks. Random forests utilize tree representation in order to solve a problem wherein each leaf node corresponds to a class label. This algorithm was employed due to its state-of-the-art accuracy; interpretability; and lastly, its high degree of internal optimization compared with a relatively modest computational cost. Overall, the original dataset was randomly split into a training (80%) and a held-out test set (20%). The training set was further divided into a derivation cohort (80%) and an interval validation cohort (20%), with results reported on the test set in order to confirm the validity of the findings. The model’s hyper-parameters (number of nodes and depth of each tree) were fine-tuned using a randomized search with 10-fold cross-validation. During cross-validation, the training data was divided into equally sized subsets with training occurring on all but one of the subsets while internal validation was performed on the remaining subset. This process was repeated iteratively; i.e. in the case of 10-fold cross-validation, this step was repeated 10 times. Finally, the correlation between direct LDL-C and estimated LDL-C using the developed model (the Weill Cornell model) was evaluated, and subsequently compared to that of the Friedewald formula and Martin-Hopkins equation.

Statistical analysis

Patient-level baseline clinical characteristics were collected for the study cohort using ICD-10 codes (E78.0–4 for hyperlipidemia, I10- I16 for hypertension, and I25.1 or I25.7 for coronary artery disease). Frequencies and proportions were calculated for categorical variables and means with standard deviations were calculated for continuous variables. All clinical data were analyzed on an aggregate basis, and at no point were individual patient comorbidities extracted or viewed. Correlation coefficients were used to compare model performance in predicting LDL-C value for each method. Subgroup analysis was performed with LDL-C and TG levels stratified according to ranges specified by the 2018 ACC/AHA Cholesterol guideline document [27]. Absolute residuals between subgroups across the 3 methods (the Weill Cornell model, Friedewald formula and the Martin-Hopkins equation) and the directly measured LDL-C level were compared using a paired Student’s t-test, which provides both a measure of magnitude difference as well as directionality of the difference between the method and the ground truth direct LDL-C. Finally, LDL-C subgroup reclassification using the Weill Cornell model and compared to Friedewald and Martin-Hopkins method was performed using two by two tables. A one-tailed p-value of less than 0.05 was considered significant. All statistical analysis was performed using R version 3.5.0.

Results

Between August 31st, 2005 and January 31st, 2019, there were 17,500 standard lipid profile samples paired with same-day direct LDL-C measures, performed on 10,936 unique individuals (4,456 female subjects; 40.8%) ranging in age from 1 to 103 (Table 1). 34.1% percent of patients from whom samples were drawn had been diagnosed with hypertension, 38% had been diagnosed with hyperlipidemia, and 12.8% had been diagnosed with coronary artery disease. Within the extracted samples, the mean direct LDL-C was 95.5 mg/dL and the mean triglyceride level was 59 mg/dL (Table 1).

Table 1

Patient-level baseline characteristics of the study cohort.

Clinical Variable	Value
Age in years (mean ± standard deviation)	57.5 ± 16.9
Female (%)	40.8
Mean height (in centimeters)	170.2
Mean weight (in kilograms)	80.2
Hyperlipidemia (%)	38.0
Hypertension (%)	34.1
Coronary artery disease (%)	12.8
Lipid particle	Value in mg/dL (mean ± standard deviation)
Total cholesterol	171.0 ± 62.2
High-density lipoprotein cholesterol (HDL-C)	60.5 ± 3.5
Triglyceride	59.0 ± 33.9
Direct low-density lipoprotein cholesterol (LDL-C)	95.5 ± 64.3

Across all LDL-C levels, the Weill Cornell model exhibited a better correlation with direct LDL-C compared to the Friedewald formula or the Martin-Hopkins equation. Specifically, the correlation coefficient between the estimated and measured LDL-C value was 0.982 for the Weill Cornell model, compared to 0.950 for Friedewald formula and 0.962 for the Martin-Hopkins equation (Fig 1). In subgroup analysis, the Weill Cornell model was consistently superior across subgroups stratified by LDL-C and TG values, including TG >500 and LDL-C <70 (Table 2). Importantly, the magnitude of improvement was highest in the LDL-C >190 mg/dL strata (mean difference of -9.18 mg/dL compared to Friedewald formula and -8.81 mg/dL compared to Martin-Hopkins equation), while the Weill Cornell model was better at very low LDL-C levels (mean difference of -3.82 mg/dL compared to Friedewald and -1.84 mg/dL compared to Martin-Hopkins). Further, the Weill Cornell model showed improved performance compared to the Friedewald and Martin-Hopkins equations across triglyceride subgroups, with the largest magnitude of improvement in the triglyceride range of >500 mg/dL (mean difference of -27.17 mg/dL compared to Friedewald and -4.44 mg/dL compared to Martin Hopkins). Fig 2 shows the scatter plot of estimated LDL-C vs. direct LDL-C stratified by the LDL-C subgroup, while Fig 3 shows the scatter plot stratified by triglyceride values. The correlation coefficient for the Weill Cornell model was 0.933, compared to 0.882 for Friedewald and 0.876 for Martin-Hopkins, in the LDL-C range of >190 mg/dL, while the correlation coefficient was 0.998 for the Weill Cornell model, compared to 0.942 for Friedewald formula and 0.901 for Martin-Hopkins equation, in the triglyceride range of >500 mg/dL.

Fig 1

Scatter plot showing the correlation between estimated and directly measured low-density lipoprotein cholesterol (LDL-C) for the (a) overall cohort, and (b) for each of the LDL estimation models.

Table 2

Comparison of the absolute residuals between estimated LDL-C using the Weill Cornell Model with the Friedewald formula and Martin-Hopkins equation.

		Friedewald Formula		Martin-Hopkins Equation
		Mean Difference (mg/dL)	p Value	Mean Difference (mg/dL)	p Value
Overall		-4.39±7.56	<0.001	-2.93±5.78	<2.2e-16
LDL-C	0–70	-3.82±8.15	<0.001	-1.84±6.35	<0.001
	70–100	-3.72±6.51	<0.001	-2.22±4.21	<0.001
	100–130	-4.12±7.43	<0.001	-2.67±5.30	<0.001
	130–160	-5.02±7.86	<0.001	-3.90±6.62	<0.001
	160–190	-7.49±9.24	<0.001	-6.19±7.50	<0.001
	>190	-9.18±9.77	<0.001	-8.81±9.38	<0.001
TG	0–150	-2.88±5.39	<0.001	-2.67±5.32	<0.001
	150–500	-9.79±10.93	<0.001	-3.87±7.12	<0.001
	>500	-27.17±10.76	0.007	-4.44±7.68	0.17

Fig 2

(A) Scatter plot showing the correlation between the ground truth LDL-C value (direct LDL-C) and estimated LDL-C value, across LDL-C subgroups, using the Weill Cornell model, Friedewald formula and Martin-Hopkins equation. (B) Correlation coefficients for each model for LDL-C subgroups.

Fig 3

Scatter plot showing the correlation between the ground truth LDL-C value (direct LDL) and estimated LDL-C value, across triglyceride subgroups, using the Cornell model, Friedewald formula and Martin-Hopkins method. (B) Correlation coefficients for each model for TGL subgroups. Abbreviations: TGL: triglycerides.

Scatter plot showing the correlation between estimated and directly measured low-density lipoprotein cholesterol (LDL-C) for the (a) overall cohort, and (b) for each of the LDL estimation models. (A) Scatter plot showing the correlation between the ground truth LDL-C value (direct LDL-C) and estimated LDL-C value, across LDL-C subgroups, using the Weill Cornell model, Friedewald formula and Martin-Hopkins equation. (B) Correlation coefficients for each model for LDL-C subgroups. Scatter plot showing the correlation between the ground truth LDL-C value (direct LDL) and estimated LDL-C value, across triglyceride subgroups, using the Cornell model, Friedewald formula and Martin-Hopkins method. (B) Correlation coefficients for each model for TGL subgroups. Abbreviations: TGL: triglycerides. In terms of reclassification, the Weill Cornell model resulted in the improved reclassification of LDL-C values across guideline-determined LDL-C thresholds compared to the Friedewald formula and Martin-Hopkins equation (S1 Table). For instance, there were 18 instances in the validation cohort where the Weill Cornell model correctly predicted an LDL-C in the 0–70 mg/dL range, while the Friedewald formula incorrectly predicted all 18 examples to be in the 70–100 range. Similarly, there were 15 cases where the Weill Cornell model correctly predicted an LDL-C in the 0–70 mg/dL range, while the Martin-Hopkins equation incorrectly predicted all 15 examples to be in the 70–100 range.

Discussion

LDL-C lowering has been a central target of primary and secondary prevention efforts. Furthermore, LDL-C lowering, and subsequent monitoring of LDL-C levels has become a keystone of clinical practice, given the continuous and graded relationship between LDL-C levels and cardiovascular risk, as well as LDL-C lowering and subsequent modulation of incident risk. As a result, LDL-C measurement is ubiquitous within clinical care and accurate assessment is essential for the implementation of individualized treatment plans. In the present investigation, we used random forests in order to develop an ML-based approach for the estimation of serum LDL-C using standard lipid profile samples. We show that our model (the Weill Cornell model) had a better correlation with direct LDL-C when compared to the traditional Friedewald formula and the more contemporary Martin-Hopkins equation. Furthermore, our approach was consistently superior across subgroups stratified by LDL-C and triglyceride levels and resulted in a significant reclassification of LDL-C values across guideline-determined LDL-C thresholds compared to the Friedewald formula and Martin Hopkins equation. We developed and validated an approach that harnesses the power of ML-based algorithms for the estimation of serum LDL-C values. Further, estimated LDL-C using our ML-based model correlated better with direct LDL-C than both the Friedewald formula and Martin-Hopkins equations with very low LDL-C values (less than 70 mg/dL) or elevated triglyceride levels (> 500 mg/dL), which is significant in an era where lower LDL-C targets are sought after in highest-risk individuals using statins and novel non-statin drug therapies. The BQ method, which combines ultracentrifugation with precipitation, is widely accepted as the reference method for lipoprotein fraction measurement. Yet, it is useful in limited settings and is not suitable for mass screening due to its cost and labor-intensive nature [28]. The enzymatic analysis of total cholesterol, triglyceride, and HDL-C is a considerably less costly procedure, and these measurements have been used to estimate the LDL-C using the Friedewald formula, thereby lowering overall costs and improving LDL-C integration within the clinical practice [13]. However, the Friedewald formula, from its inception, was known to be inaccurate in instances where triglyceride levels were greater than 400 mg/dL. In addition, major shortcomings of this approach include the requirement for a fasting specimen. Consequently, if non-fasting samples were used, there would be an overestimation of VLDL-C and underestimation of LDL-C (as a result of the presence of chylomicrons). In addition, due to its reliance on three combined measurements, LDL-C calculation is a product of their variabilities, with the largest effect being from total cholesterol measurements [29]. This variability ranges from 4% in well-standardized lipid laboratories to 12% in routine laboratories according to the National Cholesterol Education Program (NCEP) expert panel [30]. For instance, the LDL-C can often be estimated to be less than 70 mg/dL, despite directly measured levels being at greater than 70 mg/dL [31]. The Martin-Hopkins equation has largely replaced the Friedewald formula, using the same standard lipid measurements as the Friedewald formula but adding a personalized rather than a fixed conversion factor in calculating LDL-C [17]. The new formula is more reliable and can be used in non-fasting patients as it adjusts for triglyceride levels. The Martin-Hopkins method has been validated in multiple national and international trials and has helped re-categorize patients who would be undertreated using previous methods of LDL estimation [18-21]. The Martin-Hopkins equation is certainly more accurate than the Friedewald formula. However, even this improved equation is subject to inaccuracy at lower LDL-C estimates [22]. There has been widespread debate regarding optimal LDL-C targets. Despite the widespread use of statins, ASCVD remains the leading cause of worldwide mortality, while there remains residual cardiovascular risk despite optimal medical therapy [32]. The Pravastatin or Atorvastatin Evaluation and Infection Therapy–Thrombolysis in Myocardial Infarction 22 (PROVE IT-TIMI 22) trial noted residual cardiovascular risk despite lowering LDL-C to 62 mg/dL [33]. The hypothesis that more intensive treatment and lower LDL-C targets provide greater benefit has been further supported by plaque regression data from multiple studies [34-36]. A meta-analysis by the Cholesterol Treatment Trialists (CTT) showed that a 1 mmol/L reduction in LDL-C was associated with an approximately 20% reduction in major cardiovascular events [8]. More recently, the IMPROVE-IT trial assessed the benefit of adding ezetimibe to statin therapy within a secondary prevention cohort, demonstrating that the addition of ezetimibe to statin therapy reduced mean LDL-C by 15.8 mg/dL (53.7 mg/dL in the combined therapy arm compared to 69.5 mg/dL in the statin monotherapy arm) with an associated absolute risk difference of 2% at 7 years, further supporting the “lower the better” argument for lower LDL-C targets [37]. On the other hand, there has been ongoing discussion regarding the long-term safety of lower LDL-C targets. In recent years, this debate has been further pushed into the spotlight after the approval for clinical use of monoclonal antibodies to proprotein convertase subtilisin/kexin type 9 (PCSK9), which achieve reductions of up to 60% in LDL-C levels and at times below 50 mg/dL [38, 39]. A recent meta-analysis of 3,340 patients on background maximally tolerated statin therapy and receiving a PCSK9 inhibitor showed that the incidence of treatment-related neurocognitive adverse event was low (≤ 1.2%) with no significant differences between PCSK9 vs. control groups up to 104 weeks, with a similar finding in the subgroup of patients with LDL-C levels at <25 mg/dL [40]. Clinical practice will continue to evolve as evidence continues to accumulate regarding the beneficial effect of lower LDL-C targets. This, in turn, necessitates precise estimates of LDL-C to enable more accurate and well-informed clinical decision making, adverse event monitoring, and clinical trial design. Machine learning can better generalize with the availability of larger datasets. In clinical cardiology, it has shown to be more proficient at predicting 5-year all-cause mortality than clinical characteristics or coronary computed tomography angiography (coronary CTA) metrics when used separately [41]. Machine learning has been used for segmentation tasks as well, with the goal of establishing the presence of a specific cardiovascular condition as well as a prognostication of outcomes on echocardiography, myocardial perfusion imaging, electrocardiography and coronary CTA [42-45]. It has been also used to answer complex and intricate clinical questions, such as the prediction of inpatient readmissions in heart failure patients [46]. In this analysis, we sought to further exploit the power of machine learning algorithms in order to answer a clinical question that has widespread implications for daily clinical practice. While typical models created using machine learning algorithms have an increasing number of variables, our approach was to simply utilize the standard lipid profile and its three measured components (TC, TG, and HDL-C) in order to estimate the serum LDL-C. Our approach and results provide further proof for the ability of machine learning algorithms to solve both common and complex clinical issues, especially when large volumes of data are available (central illustration).

Central Illustration. Machine learning for the creation of an accurate model for serum LDL-C estimation

Abbreviations: TC: total cholesterol; TG: triglyceride; LDL: low-density lipoprotein; HDL: high-density lipoprotein. This study was subject to some noteworthy limitations. Firstly, the Weill Cornell model was developed using a convenience sample of lipid profile measurements performed at a single-center tertiary care center in New York. While our model was internally validated, external validation is required in order to confirm the generalizability of the Weill Cornell model as well as its accuracy in other patient cohorts. However, the model is extremely portable, and it is reasonable to assume that adoption at other sites could yield similar performance, especially if a model were to be trained on a comparable data set. Secondly, the study cohort may not fully represent a general population since it is likely that there is a specific clinical indication for ordering a direct LDL-C and a standard lipid profile in the same setting. As such, selection bias cannot be excluded from this analysis. Thirdly, the present analysis focused on developing and validating the Weill Cornell model across various LDL-C and TG levels but did not include patient-level analysis to determine the influence of certain clinical characteristics (such as ethnicity, presence of kidney disease, use of lipid-lowering drug therapies, etc.) on model performance. Nevertheless, ML has the ability to continuously update the algorithm as the model is applied to bigger and more diverse datasets, thereby creating a model that is unique in its accuracy, generalizability, and validity across all ethnicities and societies. Fourthly, direct LDL-C was determined using chemical-based methods, and not with the gold standard BQ method, while analysis was limited to correlation with direct LDL-C while true accuracy was not established. Nevertheless, the next step will be to validate the Weill Cornell model on cohorts with LDL-C measured by BQ. Finally, the model developed is not a simple score that can be calculated by a physician on the spot but requires computational processing. However, the existent paradigm involves the calculation of the estimated LDL-C when a standard lipid profile result is obtained after laboratory analysis, while the widespread digitization of healthcare should obviate any obstacle to the implementation of our model. Integration of the Weill Cornell model into EHRs, many of which are already capable of implementing complex computational models for risk prediction and other tasks, could avert this limitation. In summary, we developed the Weill Cornell model for LDL-C estimation using a random forest ML approach trained on measured components of a standard lipid profile (TC, HDL-C, and TG). Further, we observed that the Weill Cornell model correlates better with direct LDL-C than both the Friedewald formula and the Martin-Hopkins equation, with consistently better results across all subgroups, especially LDL-C <70 mg/dL and TG >500 mg/dL. Future research is required in order to validate the Weill Cornell model against LDL-C measured using the reference standard BQ method, with subsequent determination of model accuracy, beyond measures of correlation as shown in the present analysis. Such an approach is of critical importance in an era where accurate LDL-C is required as a result of more aggressive LDL-C lowering using novel and potent lipid-lowering drug therapies. Improved accuracy of LDL-C estimation using the Weill Cornell model results in significant reclassification of LDL-C values across guideline determined LDL-C thresholds (in mg/dL) compared to the (A) Friedewald formula and (B) Martin Hopkins equation. Results are shown for the validation set. (DOCX) Click here for additional data file. (XLSX) Click here for additional data file. 21 Jul 2020 PONE-D-20-07240 Comparing a novel Machine Learning method to the Friedewald formula and Martin-Hopkins equation for Low-density Lipoprotein Estimation PLOS ONE Dear Dr. Al'Aref, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ACADEMIC EDITOR: The authors should address the issues raised by reviewer 1 regarding additional computations (Kappa statistics and TeA) which would strengthen the value of the results. Please submit your revised manuscript by Sep 04 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript: A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols We look forward to receiving your revised manuscript. Kind regards, Simeon-Pierre Choukem Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Thank you for including your competing interests statement; "he authors have declared that no competing interests exist" We note the following; 1) Author Benjamin C. Lee reports receiving consulting fees from Cleerly Inc. 2) Author Leslee J. Shaw reports having an equity interest in Cleerly Inc. and 3) Author Gurpreet Singh is affiliated to GlaxoSmithKline. We note that one or more of the authors are employed by a commercial company: GlaxoSmithKline Please provide an amended Funding Statement declaring this commercial affiliation, as well as a statement regarding the Role of Funders in your study. If the funding organization did not play a role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript and only provided financial support in the form of authors' salaries and/or research materials, please review your statements relating to the author contributions, and ensure you have specifically and accurately indicated the role(s) that these authors had in your study. You can update author roles in the Author Contributions section of the online submission form. Please also include the following statement within your amended Funding Statement. “The funder provided support in the form of salaries for authors [insert relevant initials], but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of these authors are articulated in the ‘author contributions’ section.” If your commercial affiliation did play a role in your study, please state and explain this role within your updated Funding Statement. 2. Please also provide an updated Competing Interests Statement declaring this commercial affiliation along with any other relevant declarations relating to employment, consultancy, patents, products in development, or marketed products, etc. Within your Competing Interests Statement, please confirm that this commercial affiliation does not alter your adherence to all PLOS ONE policies on sharing data and materials by including the following statement: "This does not alter our adherence to PLOS ONE policies on sharing data and materials.” (as detailed online in our guide for authors http://journals.plos.org/plosone/s/competing-interests) . If this adherence statement is not accurate and there are restrictions on sharing of data and/or materials, please state these. Please note that we cannot proceed with consideration of your article until this information has been declared. Please include both an updated Funding Statement and Competing Interests Statement in your cover letter. We will change the online submission form on your behalf. Please know it is PLOS ONE policy for corresponding authors to declare, on behalf of all authors, all potential competing interests for the purposes of transparency. PLOS defines a competing interest as anything that interferes with, or could reasonably be perceived as interfering with, the full and objective presentation, peer review, editorial decision-making, or publication of research or non-research articles submitted to one of the journals. Competing interests can be financial or non-financial, professional, or personal. Competing interests can arise in relationship to an organization or another person. Please follow this link to our website for more details on competing interests: http://journals.plos.org/plosone/s/competing-interests 3. We note that you have indicated that data from this study are available upon request. PLOS only allows data to be available upon request if there are legal or ethical restrictions on sharing data publicly. For information on unacceptable data access restrictions, please see http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. In your revised cover letter, please address the following prompts: a) If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially identifying or sensitive patient information) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. b) If there are no restrictions, please upload the minimal anonymized data set necessary to replicate your study findings as either Supporting Information files or to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. Please see http://www.bmj.com/content/340/bmj.c181.long for guidelines on how to de-identify and prepare clinical data for publication. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. We will update your Data Availability statement on your behalf to reflect the information you provide. 4. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Partly Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: No Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Permit me thank the authors for the huge effort put forth in a subject that is as important as it is relevant. Summarily, this study seeks to derive a novel method that is more ACCURATE in estimating LDL-C using Machine Learning (ML). 1) The study portrays a sound scientific framework, but could be more so if a number of clarifications made: As a limitation, the authors observed that, "direct LDL-C was determined using chemical-based methods, and not with the gold standard BQ method". This raise a question, What adjustment was done that allow the authors to compare a method (FF) that was established based on samples analyzed using the standard BQ method to those of the present study analyzed using a chemical-based method(that has been shown to have inherent inaccuracy when compared to the standard BQ)? https://doi.org/10.1177%2F107424840501000106. Is it possible that the novel formula could be reliable but not necessarily valid? 2) Statistics If the interest is to derive a more ACCURATE method of LDL-C estimation, then it is a distraction to focus a lots of attention on describing the strong positive correlation between the novel and the reference method. Strong correlation does not necessarily mean accuracy. secondly, applying Kappa statistics can help us answer the question on whether an observed agreement is by chance or not by chance. Thirdly, if the observed agreement is not by chance, then assessing allowable total error (TEa) as a benchmark for performance of the novel method will be a great idea. Reviewer #2: I have found the paper original as it attemps to resolve a key problem in the estimation of LDLc in people with cardiovascular diseases and other conditions with similar etiologies. The statistical analyses as well as all the steps of the procedure are appropriate and well described. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Tasha Manases Reviewer #2: Yes: Pr Jules Clement Nguedia Assob [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 26 Aug 2020 Reviewer Comments (Reviewer #1) Permit me thank the authors for the huge effort put forth in a subject that is as important as it is relevant. We would like to express our sincere appreciation to reviewer #1 for taking the time to review our manuscript and for providing comments that have helped us improve the manuscript. Summarily, this study seeks to derive a novel method that is more ACCURATE in estimating LDL-C using Machine Learning (ML). 1) The study portrays a sound scientific framework but could be more so if a number of clarifications made: As a limitation, the authors observed that, "direct LDL-C was determined using chemical-based methods, and not with the gold standard BQ method". This raise a question, What adjustment was done that allow the authors to compare a method (FF) that was established based on samples analyzed using the standard BQ method to those of the present study analyzed using a chemical-based method(that has been shown to have inherent inaccuracy when compared to the standard BQ)? https://doi.org/10.1177%2F107424840501000106. Is it possible that the novel formula could be reliable but not necessarily valid? We thank the reviewer for the excellent comment. We completely agree that an inherent limitation to our approach is that the “ground truth” LDL is not the accepted gold standard BQ method, which unfortunately has limited real life utility given the labor-intensive nature and significant associated cost. We sought to utilize real-world data for a proof-of-concept application in order to highlight the fact that machine learning can even improve upon relatively simplistic equations, but that applies to a relevant and daily aspect of patient care. We also wanted to highlight the fact that even though we developed an initial model with machine learning, a machine learning model can be dynamic and can become more accurate as more data is provided for model development. As such, future work is aimed at testing this model in an external validation cohort to further establish the advantage of using the methodology presented in this paper, as well as to use the data to further validate our model (by incorporating larger data, as well as data where the gold-standard LDL is measured using the BQ method). Additionally, the machine learning model used in this paper can be understood as multiple learners working together. On a fundamental level, this should allow the model to learn the underlying principle equation for calculating LDL levels and minimize the error procured by the model by learning a globally optimal equation. To that end, we have highlighted this limitation in the discussion section: 1. Fourthly, direct LDL-C was determined using chemical-based methods, and not with the gold standard BQ method, while analysis was limited to correlation with direct LDL-C while true accuracy was not established. Nevertheless, the next step will be to validate the Weill Cornell model on cohorts with LDL-C measured by BQ. 2. Future research is required in order to validate the Weill Cornell model against LDL-C measured using the reference standard BQ method, with subsequent determination of model accuracy, beyond measures of correlation as shown in the present analysis. 2) If the interest is to derive a more ACCURATE method of LDL-C estimation, then it is a distraction to focus a lots of attention on describing the strong positive correlation between the novel and the reference method. Strong correlation does not necessarily mean accuracy. We thank the reviewer for the comment. For continuous labels, mean absolute errors (MAE) is a good way to measure model accuracy. In this study, we compared the absolute prediction errors between the Weill Cornell model with Friedewald formula and Martin-Hopkins Equation by paired t test. We found that the Weill Cornell model has smaller absolute errors in both comparisons which implies higher accuracy. In addition, we utilized the matric of correlation coefficient to evaluate the model performance. It can represent the correlation extent on one hand. On the other hand, it equals to the square root of R square which can be interpreted as how much variance of the outcome can be explained by the predictor. So overall, we used two metrics to evaluate the model performance (in terms of accuracy as well as correlation). 3) Secondly, applying Kappa statistics can help us answer the question on whether an observed agreement is by chance or not by chance. We thank the reviewer for the comment. Cohen’s kappa coefficient is a good measurement to evaluate inter-rater reliability, but it is typically used for categorical variables. In the present analysis, we treated LDL values as a continuous value (rather than splitting into categories since in real life the clinician is interested in the actual value rather than the range). 4) Thirdly, if the observed agreement is not by chance, then assessing allowable total error (TEa) as a benchmark for performance of the novel method will be a great idea. We thank the reviewer for the comment. Congruent with the previous comment, we would like to clarify that our aim was to evaluate whether our model can predict an actual value rather than an actual class label. Reviewer Comments (Reviewer #2) I have found the paper original as it attempts to resolve a key problem in the estimation of LDLc in people with cardiovascular diseases and other conditions with similar etiologies. The statistical analyses as well as all the steps of the procedure are appropriate and well described. We would like to express our sincere appreciation to reviewer #2 for taking the time to review our manuscript. Again, we greatly appreciate the reviewer’s comments and hope that we have answered each point to his/her satisfaction. Thank you very much for your consideration of this manuscript for publication. Yours Sincerely, Subhi J. Al’Aref, MD, FACC (Corresponding Author) Submitted filename: Response to Reviewers.docx Click here for additional data file. 16 Sep 2020 Comparing a novel Machine Learning method to the Friedewald formula and Martin-Hopkins equation for Low-density Lipoprotein Estimation PONE-D-20-07240R1 Dear Dr. Al’Aref, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Simeon-Pierre Choukem Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 18 Sep 2020 PONE-D-20-07240R1 Comparing a novel Machine Learning method to the Friedewald formula and Martin-Hopkins equation for Low-density Lipoprotein Estimation Dear Dr. Al’Aref: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Simeon-Pierre Choukem Academic Editor PLOS ONE

44 in total

1. Secondary Use of Patients' Electronic Records (SUPER): An Approach for Meeting Specific Data Needs of Clinical and Translational Researchers.

Authors: Evan T Sholle; Joseph Kabariti; Stephen B Johnson; John P Leonard; Jyotishman Pathak; Vinay I Varughese; Curtis L Cole; Thomas R Campion
Journal: AMIA Annu Symp Proc Date: 2018-04-16

Review 2. Methods for measurement of LDL-cholesterol: a critical assessment of direct measurement by homogeneous assays versus calculation.

Authors: Matthias Nauck; G Russell Warnick; Nader Rifai
Journal: Clin Chem Date: 2002-02 Impact factor: 8.327

3. 2018 AHA/ACC/AACVPR/AAPA/ABC/ACPM/ADA/AGS/APhA/ASPC/NLA/PCNA Guideline on the Management of Blood Cholesterol: Executive Summary: A Report of the American College of Cardiology/American Heart Association Task Force on Clinical Practice Guidelines.

Authors: Scott M Grundy; Neil J Stone; Alison L Bailey; Craig Beam; Kim K Birtcher; Roger S Blumenthal; Lynne T Braun; Sarah de Ferranti; Joseph Faiella-Tommasino; Daniel E Forman; Ronald Goldberg; Paul A Heidenreich; Mark A Hlatky; Daniel W Jones; Donald Lloyd-Jones; Nuria Lopez-Pajares; Chiadi E Ndumele; Carl E Orringer; Carmen A Peralta; Joseph J Saseen; Sidney C Smith; Laurence Sperling; Salim S Virani; Joseph Yeboah
Journal: J Am Coll Cardiol Date: 2018-11-10 Impact factor: 24.094

4. Machine-Learning Algorithms to Automate Morphological and Functional Assessments in 2D Echocardiography.

Authors: Sukrit Narula; Khader Shameer; Alaa Mabrouk Salem Omar; Joel T Dudley; Partho P Sengupta
Journal: J Am Coll Cardiol Date: 2016-11-29 Impact factor: 24.094

5. Effect of two intensive statin regimens on progression of coronary disease.

Authors: Stephen J Nicholls; Christie M Ballantyne; Philip J Barter; M John Chapman; Raimund M Erbel; Peter Libby; Joel S Raichlen; Kiyoko Uno; Marilyn Borgman; Kathy Wolski; Steven E Nissen
Journal: N Engl J Med Date: 2011-11-15 Impact factor: 91.245

Review 6. Implications of recent clinical trials for the National Cholesterol Education Program Adult Treatment Panel III Guidelines.

Authors: Scott M Grundy; James I Cleeman; C Noel Bairey Merz; H Bryan Brewer; Luther T Clark; Donald B Hunninghake; Richard C Pasternak; Sidney C Smith; Neil J Stone
Journal: J Am Coll Cardiol Date: 2004-08-04 Impact factor: 24.094

7. Effect of intensive compared with moderate lipid-lowering therapy on progression of coronary atherosclerosis: a randomized controlled trial.

Authors: Steven E Nissen; E Murat Tuzcu; Paul Schoenhagen; B Greg Brown; Peter Ganz; Robert A Vogel; Tim Crowe; Gail Howard; Christopher J Cooper; Bruce Brodie; Cindy L Grines; Anthony N DeMaria
Journal: JAMA Date: 2004-03-03 Impact factor: 56.272

8. A comparison of methods for the estimation of plasma low- and very low-density lipoprotein cholesterol. The Lipid Research Clinics Prevalence Study.

Authors: D M DeLong; E R DeLong; P D Wood; K Lippel; B M Rifkind
Journal: JAMA Date: 1986-11-07 Impact factor: 56.272

9. Can machine-learning improve cardiovascular risk prediction using routine clinical data?

Authors: Stephen F Weng; Jenna Reps; Joe Kai; Jonathan M Garibaldi; Nadeem Qureshi
Journal: PLoS One Date: 2017-04-04 Impact factor: 3.240

10. No evidence of neurocognitive adverse events associated with alirocumab treatment in 3340 patients from 14 randomized Phase 2 and 3 controlled trials: a meta-analysis of individual patient data.

Authors: Philip D Harvey; Marwan N Sabbagh; John E Harrison; Henry N Ginsberg; M John Chapman; Garen Manvelian; Angele Moryusef; Jonas Mandel; Michel Farnier
Journal: Eur Heart J Date: 2018-02-01 Impact factor: 29.983

2 in total

1. An architecture for research computing in health to support clinical and translational investigators with electronic patient data.

Authors: Thomas R Campion; Evan T Sholle; Jyotishman Pathak; Stephen B Johnson; John P Leonard; Curtis L Cole
Journal: J Am Med Inform Assoc Date: 2022-03-15 Impact factor: 4.497

2. Comparison of a Machine Learning Method and Various Equations for Estimating Low-Density Lipoprotein Cholesterol in Korean Populations.

Authors: Yu-Jin Kwon; Hyangkyu Lee; Su Jung Baik; Hyuk-Jae Chang; Ji-Won Lee
Journal: Front Cardiovasc Med Date: 2022-02-10

2 in total