Literature DB >> 35721829

Prediction of SARS-CoV-2-positivity from million-scale complete blood counts using machine learning.

Gianlucca Zuin^1,2, Daniella Araujo^1,3, Vinicius Ribeiro³, Maria Gabriella Seiler², Wesley Heleno Prieto⁴, Maria Carolina Pintão⁴, Carolina Dos Santos Lazari⁴, Celso Francisco Hernandes Granato⁴, Adriano Veloso¹.

Abstract

Background: The Complete Blood Count (CBC) is a commonly used low-cost test that measures white blood cells, red blood cells, and platelets in a person's blood. It is a useful tool to support medical decisions, as intrinsic variations of each analyte bring relevant insights regarding potential diseases. In this study, we aimed at developing machine learning models for COVID-19 diagnosis through CBCs, unlocking the predictive power of non-linear relationships between multiple blood analytes.
Methods: We collected 809,254 CBCs and 1,088,385 RT-PCR tests for SARS-Cov-2, of which 21% (234,466) were positive, from 900,220 unique individuals. To properly screen COVID-19, we also collected 120,807 CBCs of 16,940 individuals who tested positive for other respiratory viruses. We proposed an ensemble procedure that combines machine learning models for different respiratory infections and analyzed the results in both the first and second waves of COVID-19 cases in Brazil.
Results: We obtain a high-performance AUROC of 90 + % for validations in both scenarios. We show that models built solely of SARS-Cov-2 data are biased, performing poorly in the presence of infections due to other RNA respiratory viruses. Conclusions: We demonstrate the potential of a novel machine learning approach for COVID-19 diagnosis based on a CBC and show that aggregating information about other respiratory diseases was essential to guarantee robustness in the results. Given its versatile nature, low cost, and speed, we believe that our tool can be particularly useful in a variety of scenarios-both during the pandemic and after.

Entities: Chemical

Keywords: Biomarkers; Diagnosis; Viral infection

Year: 2022 PMID： 35721829 PMCID： PMC9199341 DOI： 10.1038/s43856-022-00129-0

Source DB: PubMed Journal: Commun Med (Lond) ISSN： 2730-664X

Introduction

At the end of 2019, the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-Cov-2) appeared in the city of Wuhan, China[1], which led to a global outbreak weeks later[2]. This highly transmissible novel Coronavirus disease was named Coronavirus disease 2019 (COVID-19)[3]. At the time this article is being written, over 400 million cases of COVID-19 infections and over 5.7 million deaths have already been reported worldwide. One of the main challenges for its diagnosis is the list of initial symptoms: fever, dry cough and/or tiredness[4], which are all common in many other respiratory diseases. Currently, the golden-standard tests for SARS-Cov-2 direct detection include the Reverse Transcription Polymerase Chain Reaction exam (or simply, RT-PCR) and the serology count analysis. The first action of the RT-PCR exam is the use of the enzyme reverse transcriptase to transform the RNA of the virus into complementary DNA. RNA is produced from a DNA molecule and presents information with which it is possible to coordinate the production of proteins. With a complementary probe to a particular virus, it is possible to verify whether the molecular content corresponds to that of the suspected infectious agent. However, in particular, for the case of SARS-Cov-2, the RT-PCR is more efficient at the peak of the infectious cycle[5]. This leads to high false-negative occurrences with a sensitivity rate of between 50% and 62% according to[6,7]. Authors in ref. [8] verified instances of over 20% infected individuals with a positive RT-PCR result only after two consecutive false-negative results. Serology exams have been found to reach a sensitivity and specificity rate of 0.95 + but only after 15–28 days of symptom onset[9]. Furthermore, both exams are relatively expensive and results take longer to process when compared with other kinds of laboratory tests, such as the complete blood count. CBCs are extensively used for general individual diagnosis[10]. As a low-cost test that measures analyte levels of the white and red series in the blood, it is a useful tool to support medical decisions, as intrinsic variations of analytes can bring relevant insights regarding potential diseases. Patients with most kinds of infectious diseases have noticeable changes in their CBC tests. However, proving that these results can be interpreted as sufficient to support a particular diagnosis is a considerably more difficult task, as changes in analyte values could be easily confounded for different diseases’ patterns. In analyzing complete blood counts of individuals with COVID-19 infection in isolation, we find some changes to be quite characteristic of the disease[11-13]. This implies that machines, which can detect patterns not easily noticeable by humans, could be employed for automatic detection and preliminary screening of the disease. Indeed, many models have been proposed for automated COVID-19 diagnosis through CBCs and omics data. We argue that the detection performance of these models is possibly biased—or overestimated—as many patterns are not unique to SARS-Cov-2. The performance of these models will likely drop significantly as the prevalence of other respiratory viruses increases. This work employs a dataset collected between 2016 and 2021 containing exams of individuals who underwent blood tests in conjunction with RT-PCR exams throughout Brazil, both for COVID-19 and for other pathologies like Influenza-A or H1N1. More specifically, our dataset includes individuals who underwent a CBC at an interval of 60 days before or after a RT-PCR test. For 2020 and 2021 we collected laboratory data for 900,220 unique individuals, 809,254 CBCs, and 1,088,385 RT-PCR tests, of which 21% (234,466) were positive and less than 0.2% (1679) were inconclusive. This work does not investigate demographic, prognostic, or clinical data, such as ethnicity, hospitalization, or symptomatology, as these fall out of laboratory scope. We propose modeling the task as a binary classification problem and analyzing two distinct timeframes: one considering the early pandemic stage, namely the first wave of COVID-19 cases in Brazil; and a second stage after November 2020, when the second wave of COVID-19 started, and when we saw the emergence of a new variant of concern, P1, which eventually led to the health system collapse in the capital state of Amazonas in late December[14,15]. One of the key highlights of our proposed approach is the analysis of other RNA respiratory viruses. We also collected 120,807 CBCs from 2016 to 2019 of 16,940 individuals who tested positive for Influenza-A, Influenza-B or H1N1, as well as other respiratory viruses, and additionally 307,978 unlabeled CBCs. In particular, these additional CBCs included exams from the 2016 H1N1 surge in São Paulo[16], during which the population developed similar hygienic habits to the ones recommended in 2020, like social distancing and the use of masks, although at a minor scale. To the best of the authors’ knowledge, this is the most extensive and comprehensive COVID-19-related dataset to date. We follow the guidelines provided by the IJMEDI checklist[17] regarding applying machine learning to medical data, allowing for both higher quality work and an easier reproducibility and understanding of results. Our analysis focused on patients older than 18 years. We believe more experiments are necessary to assert performance for children and teens under 18 years old, but data regarding these age groups was also present in all training and test sets. Throughout our experiments, we train an ensemble of machine learning models on this million-scale dataset to predict Sars-CoV-2-positivity. To guarantee the correct labeling of training instances, we focus on the CBC results as close to the first positive result as possible. Our analysis shows that the additional data from other RNA respiratory viruses is a fundamental aspect for properly screening COVID-19. In the absence of such information, models are prone to confound SARS-Cov-2 with other respiratory viruses or infections. This finding corroborates with many studies that raised concerns regarding bias in COVID-19 research[18-20]. We also demonstrate the necessity of maintaining a model as up-to-date as possible to allow any machine learning model to keep up with the different stages of a pandemic surge. Our model retains high-performance values across multiple evaluation scenarios and on simulations with varying prevalences of COVID-19, properly differentiating Sars-CoV-2 from other confounding viruses, thus demonstrating the robustness of our approach.

Methods

Data

The Fleury database structure was created on 10/1997 using an InterSystems Caché and Ensemble, version 1.4 (Caché, InterSystem, 2018; https://docs.intersystems.com/; November 2020), a high-performance architecture that is commonly used to develop software applications for healthcare management (Cambridge MA). The database was built using standard healthcare industry practices to ensure accuracy, completeness, and security of data collected. The results of the laboratory tests are automatically inserted into a Microsoft SQL database after verification of the RT-PCR output. Within a few seconds, data are replicated to the Cache Database—Intersytems—for permanent storage. Once stored in the database, the result is made available for patients. All users have a username and password, maintained by AD Windows (Active Directory). All registry changes to the database are tracked through a log and are restricted to users with high-level administrative permissions. Information is kept secure through a separate network firewall, accessed only by authorized persons within the Fleury Group’s domains. Data stored in this database has been used previously in several clinical studies before theCOVID-19 outbreak[21-26]. This project was submitted, evaluated, and approved by the Research Ethics Committee (CEP) of Grupo Fleury (CAAE: 33790820.3. 0000.5474), duly qualified by the National Research Ethics Committee (CONEP) of the National Health Council of Brazil. The Research Ethics Council (CEP) is an interdisciplinary and independent collegiate of public relevance, consultative, deliberative, and of educational character, created to defend the interests of research participants in their integrity and dignity as well as to contribute research development within highest ethical standards. By decision of the CEP, since this project uses retrospective and anonymized data, there is no need to apply an e Free and Informed Consent Term (TCLE) to participating patients. The CBC measurements were obtained from EDTA-K3 collected peripheral blood samples analyzed by the Automated Hematology Analyzer XT or XN series from Sysmex (Sysmex Corporation, Kobe, Japan). In total, 72 pieces of equipment are distributed in 36 laboratories over the country. Red blood cells (RBC) and platelets were counted and sized by direct current impedance with hydrodynamic focusing and heath flow direct current (DC) detection was used. The hematocrit was determined from the RBC pulse height. The hemoglobin was measured using sodium lauryl sulfate spectrophotometry. CBCs also include the physical features of the RBC: Mean corpuscular volume (MCV) is a measurement of the average size of red blood cells; Mean corpuscular hemoglobin (MCH) is a calculated measurement of the average amount of hemoglobin; Mean corpuscular hemoglobin concentration (MCHC) is a calculated measurement of the average concentration of hemoglobin; and Red cell distribution width (RDW) is a measurement of the variation in RBC size. The white blood cells (WBC) and six-part differential were determined by fluorescence flow cytometry. Specifically, the WBC subpopulations were separated based on cell complexity (side-scattered fluorescent intensity), cell size (forward scattered light), and fluorescence signal (side fluorescent light). Quality control is performed daily using three control levels (high, normal, and low) for each parameter. Measurements are analyzed using the InsightTM Interlaboratory Quality Assessment Program for Sysmex hematology analyzers, where data from users worldwide are compared. To guarantee equivalence and reproducibility of our analysis and enable the use of common reference intervals for different measurement procedures[27], harmonization of equipment is performed in accordance with the Clinical and Laboratory Standards Institute’s (CLSI) guidelines[28]. Results are accepted if the percentage difference is less than 50% of the total error for each parameter, which allows us to devise reference values for each measurement[29,30].

Complete blood count and model features

A complete blood count (or simply, CBC) is a common blood test used for a variety of reasons, including the detection of disorders and infections. A CBC test measures several components and features in the blood, including RBC, which carry oxygen; Hemoglobin, the oxygen-carrying protein in red blood cells; Hematocrit, the proportion of red blood cells to the fluid component; WBC, which fight infection (i.e., Monocytes, Lymphocytes, Eosinophils, Basophils, Neutrophils); and Platelets, which help with blood clotting. Abnormal increases or decreases in cell counts may indicate an underlying biological process taking place, like inflammation or immune response. Also, values such as the Neutrophil-Lymphocyte ratio, Platelet-Monocyte ratio, or the Platelet-Lymphocyte ratio are recognized as inflammatory markers[31]. Table 1 shows analyte means and standard deviations, as well as the employed units of measure in each of our cohorts. We can easily identify some patterns that might help us in sorting COVID-19 infected patients from the remaining ones. We can also clearly perceive that the distributions for each gender are slightly different. This is to be expected, as it is known that CBC values vary with age and gender[32]. However, introducing an explicit gender variable into our model could entail bias. To avoid this, we instead normalize each analyte by the corresponding gender and age reference values devised by Grupo Fleury, thus building a unified model that considers CBC analyte values regardless of gender.

Table 1

Mean and standard deviation for all considered cell counts in each cohort. N = 1,138,728 CBCs.

Analyte	Covid-19 (+)	Covid-19 (-)	Influenza (+)	Other Viruses (+)	Entire Data
Male patients
RBC (10¹²/L)	5.06 ± 0.52	4.21 ± 0.98	4.73 ± 0.60	3.67 ± 0.87	4.28 ± 0.96
Hemoglobin (g/dl)	14.9 ± 1.4	12.4 ± 2.8	14.0 ± 1.7	10.8 ± 2.5	12.6 ± 2.7
Hematocrit (%)	43.8 ± 4.0	36.8 ± 7.9	41.0 ± 4.9	31.7 ± 7.3	37.4 ± 7.7
MCV (fL)	86.8 ± 4.7	88.1 ± 6.4	87.0 ± 6.7	86.9 ± 8.0	88.0 ± 6.2
MCH (pg/cell)	29.5 ± 1.9	29.6 ± 2.3	29.6 ± 2.3	29.6 ± 2.6	29.5 ± 2.2
MCHC (g/dL)	34.1 ± 1.1	33.6 ± 1.4	34.0 ± 1.1	34.1 ± 1.4	33.6 ± 1.4
RDW (%)	13.0 ± 1.0	14.3 ± 2.2	13.6 ± 1.2	15.1 ± 2.1	14.1 ± 2.2
WBC (10⁹/L)	6.07 ± 2.37	8.07 ± 3.81	6.96 ± 2.81	5.87 ± 4.69	8.02 ± 3.81
Monocytes (10⁹L)	0.66 ± 0.29	0.68 ± 0.35	0.75 ± 0.37	0.66 ± 0.46	0.66 ± 0.34
Lymphocytes (10⁹L)	1.40 ± 0.72	1.67 ± 1.05	1.23 ± 0.92	1.25 ± 1.40	1.54 ± 0.99
Eosinophils (10⁹/L)	0.07 ± 0.09	0.18 ± 0.20	0.07 ± 0.10	0.10 ± 0.16	0.15 ± 0.20
Basophils (10⁹/L)	0.02 ± 0.02	0.03 ± 0.02	0.02 ± 0.01	0.02 ± 0.02	0.03 ± 0.02
Neutrophils (10⁹/L)	3.92 ± 2.22	5.53 ± 3.50	4.90 ± 2.57	4.08 ± 3.93	5.64 ± 3.57
Platelets (10⁹/L)	195.7 ± 56.7	222.0 ± 102.3	182.9 ± 63.6	145.8 ± 115.6	222.7 ± 99.9
Female patients
RBC (10¹²/L)	4.57 ± 0.44	4.03 ± 0.75	4.62 ± 0.67	3.75 ± 0.78	4.05 ± 0.75
Hemoglobin (g/dl)	13.3 ± 1.2	11.8 ± 2.1	13.6 ± 1.9	11.0 ± 2.1	11.8 ± 2.1
Hematocrit (%)	39.8 ± 3.4	35.4 ± 6.1	40.3 ± 5.4	32.8 ± 6.4	35.6 ± 6.0
MCV (fL)	87.3 ± 5.0	88.3 ± 6.3	87.7 ± 6.7	87.9 ± 8.1	88.3 ± 6.2
MCH (pg/cell)	29.2 ± 2.0	29.3 ± 2.3	29.7 ± 2.3	29.4 ± 2.7	29.3 ± 2.2
MCHC (g/dL)	33.5 ± 1.0	33.2 ± 1.3	33.8 ± 1.2	33.5 ± 1.4	33.2 ± 1.3
RDW (%)	13.1 ± 1.1	14.2 ± 2.1	13.7 ± 1.3	14.9 ± 2.1	14.1 ± 2.1
WBC (10⁹/L)	5.87 ± 2.40	8.03 ± 3.71	7.11 ± 3.15	6.62 ± 4.63	7.84 ± 3.66
Monocytes (10⁹/L)	0.56 ± 0.24	0.62 ± 0.32	0.70 ± 0.35	0.61 ± 0.43	0.60 ± 0.31
Lymphocytes (10⁹/L)	1.54 ± 0.80	1.85 ± 1.05	1.36 ± 0.95	1.54 ± 1.40	1.78 ± 1.02
Eosinophils (10⁹/L)	0.06 ± 0.08	0.16 ± 0.18	0.075 ± 0.11	0.09 ± 0.18	0.15 ± 0.18
Basophils (10⁹/L)	0.02 ± 0.01	0.03 ± 0.02	0.01 ± 0.01	0.02 ± 0.02	0.03 ± 0.02
Neutrophils (10⁹/L)	3.68 ± 2151.51	5.39 ± 3.36	4.94 ± 2.97	4.56 ± 3.81	5.29 ± 3.34
Platelets (10⁹/L)	222.6 ± 63.0	249.2 ± 101.4	185.0 ± 69.1	188.9 ± 123.8	248.4 ± 100.4

Mean and standard deviation for all considered cell counts in each cohort. N = 1,138,728 CBCs. Specifically, we perform normalization by employing the reference ranges as a pivot. Let R be the reference values of an analyte, the general formula scaling features is given aswhere x is an original value, is the normalized value, R(x∣sex = s, age = a) describes the reference values for x given the sex s and age a of a patient, and Ω and O represent the lower and upper bounds respectively. For example, supposing a male adult presents a 5.0 millions/mm3 RBC and knowing that the reference values lie in the range [4.30 − 5.70], we first subtract 4.30 from 5.0 and divide the result by 1.4 (the difference between the maximum and minimum reference values), thus obtaining the normalized 0.5 RBC count. Consequently, normalized values above 1 represent abnormally high cell counts. Likewise, normalized values below 0 represent abnormally low counts. Our model analyzes normalized cell counts and their corresponding pairwise ratios as potential features for building our models. The performance of machine learning methods are heavily dependent on the choice of features on which they are applied[33]. For this reason, much of the current effort in deploying such algorithms goes into the design of preprocessing pipelines and data transformations that result in a representation of data that can support effective machine learning[33-35]. The process of using available features to create additional ones to improve model performance is often called ’feature engineering’, a predominantly human-intensive and time-consuming step that is central to the data science workflow. It is a complex exercise, performed in an iterative manner with trial and error, and mostly driven by domain knowledge[36]. Recently, many studies have shown the benefits of automatizing this process by creating candidate features in a domain-independent and data-driven manner followed by an effective method of feature selection. This way it is possible not only to improve model correctness but also to discover powerful new features and processes that could be additional candidates for domain-specific studies[36-38]. We avoid potential spurious correlations by confirming that all selected features present a strictly non-zero impact on model output after n-fold cross-validation.

Inclusion–exclusion criteria

The scale of our dataset allows us to produce high-quality training sets and massive validation sets. Table 2 provides the gender and RT-PCR results distribution employed for training and evaluating our models. In addition to SARS-Cov-2, Influenza-A, Influenza-B, and Influenza-H1N1, our dataset also comprehends a variety of other viruses, including Coronavirus OC43, Human Metapneumovirus A, Adenovirus, Parainfluenza 1, Coronavirus HKU1, Enterovirus B, Parainfluenza 2, Coronavirus NL63, Respiratory Syncytial Virus A, Mycoplasma pneumoniae, Respiratory Syncytial Virus B, Rhinovirus, Human Metapneumovirus B, Coronavirus 229E, Chlamydophila pneumoniae, Bordetella pertussis, Parainfluenza 3, Bocavirus, and Parainfluenza 4. We argue that taking this variety of confounding viruses into consideration is of utmost importance to learn models that are specific for COVID-19.

Table 2

Entire dataset, training sets, and validation sets for the two waves that occurred during the Brazilian COVID-19 outbreak.

	CBC (+)	CBC (−)
Gender	COVID-19 (+)	COVID-19 (−)	Influenza-A (+)	Influenza-B (+)	Influenza-H1N1 (+)	Other viruses (+)
Entire data
Male	11.3%	34.0%	46.7%	46.5%	48.4%	59.5%
	(122,793)	(369,787)	(3160)	(1384)	(4108)	(20,107)
Female	10.3%	44.4%	53.3%	53.5%	51.6%	40.5%
	(111,673)	(482,453)	(3604)	(1588)	(4380)	(13,691)
Training set: first wave data
Male	12.9%	9.8%	4.2%	2.1%	6.0%	12.8%
	(5859)	(4469)	(1895)	(975)	(2742)	(5825)
Female	12.1%	15.2%	4.9%	2.8%	6.9%	10.3%
	(5527)	(6918)	(2223)	(1214)	(3118)	(4656)
Validation set: first wave data
Male	4.9%	37.6%	1.0%	<0.1%	1.4%	2.3%
	(5808)	(44,637)	(1113)	(188)	(1660)	(2710)
Female	4.7%	43.4%	1.1%	<0.1%	1.6%	1.7%
	(5647)	(51,550)	(1343)	(134)	(1842)	(2028)
Training set: second wave data
Male	25.9%	10.5%	2.0%	1.0%	3.1%	6.3%
	(24,104)	(9770)	(1895)	(975)	(2742)	(5825)
Female	24.2%	15.1%	2.3%	1.3%	3.3%	5.0%
	(22,404)	(14,088)	(2223)	(1214)	(3118)	(4656)
Validation set: second wave data
Male	4.5%	38.9%	0.4%	<0.1%	0.6%	1.0%
	(11,860)	(101,655)	(1113)	(188)	(1660)	(2710)
Female	4.3%	48.1%	0.5%	<0.1%	0.7%	0.8%
	(11,021)	(125,776)	(1343)	(134)	(1842)	(2028)

Training sets were obtained after applying the inclusion–exclusion criteria to the entire data and downsampling the COVID-19(-) class in the training sets to account for class unbalance. We considered October 1st as the split point between the first and second wave data to eliminate possible incubation periods before the start of the second wave in early November. As such, validation for the first wave encompasses data from late June to late September, and validation for the second wave ranges from early October to late February. N = 1,138,728 CBCs.

Entire dataset, training sets, and validation sets for the two waves that occurred during the Brazilian COVID-19 outbreak. Training sets were obtained after applying the inclusion–exclusion criteria to the entire data and downsampling the COVID-19(-) class in the training sets to account for class unbalance. We considered October 1st as the split point between the first and second wave data to eliminate possible incubation periods before the start of the second wave in early November. As such, validation for the first wave encompasses data from late June to late September, and validation for the second wave ranges from early October to late February. N = 1,138,728 CBCs.

Safe labeling

It is worth mentioning that CBCs and RT-PCRs are part of different exam batteries, and are therefore often collected on different dates for the same individual. Thus, an important decision is the ideal time frame between the collection of a CBC and that of the RT-PCR test used to validate its label. It is challenging to validate the precise moment the infection has initiated considering the lack of information concerning the onset of symptoms. We also observed abnormalities in the CBCs associated with recovered individuals. These differences could be related to drug usage and/or other therapies, or be due to symptoms that persist even after the virus has been eliminated. In this context, we have the hypothesis that CBCs, even when associated with a positive RT-PCR, may be affected by treatment-related effects. Figure 1 shows the concentration distribution of some analytes along with the disease progression time frame. The lower the ratio between white blood cells (WBC) and red blood cells (RBC), the higher is the probability of the individual being positive for COVID-19. Additionally, we observed that the lowest value for this ratio lies on day 0. Since our working dataset consists of patients who went to one of Grupo Fleury’s laboratories to undertake an exam, we hypothesize that the search for an RT-PCR, in particular for patients who obtained a confirmatory diagnosis of COVID-19, might be associated with the start of symptoms onset, explaining this particular pattern. We did not observe similar behavior for other evaluated viruses, perhaps due to the relative difference in public awareness/concern regarding SARS-Cov-2 and Influenza infections.

Fig. 1

Analytes average progression through COVID-19 disease course.

Average values of the most impactful analytes along with the disease time frame, from 30 days before the first positive RT-PCR result up to 30 days after. N = 120,726 patients.

Analytes average progression through COVID-19 disease course.

Average values of the most impactful analytes along with the disease time frame, from 30 days before the first positive RT-PCR result up to 30 days after. N = 120,726 patients. Furthermore, we also observed that most analytes tend to present abnormal values for up to 30 days. This might be related to the natural evolution of COVID-19 onto the inflammatory stage, the effects of treatments, or even long-lasting effects on patients’ immunological systems. We concluded that the safest and most effective gap to use for labeling CBCs with RT-PCRs outcomes’ is the 24-h window centered on the first positive RT-PCR result of an individual, with the remaining frames being highly uncertain about a positive diagnosis, and thus discarded.

Removing gender and age biases

Supplementary Figure 1 presents the age distribution of each pathology subset. We verify a small prevalence in male positive COVID-19 cases and Female positive Influenza. To address this, we sub-sampled the training sets to remove possible biases that could jeopardize learning and validated unsampled data to properly verify model behavior in real-world scenarios.

Removing possible false-negative cases

Another point of attention is the possible existence of false-negative results for RT-PCR exams. In particular, we often see cases of the same individual having negative results interspersed with two or more positive results. Therefore, it is also necessary to carry out a preprocessing step to guarantee authenticity of negative labels and to ensure that the model is as faithful as possible to the real scenario of COVID-19, and not to the limitations of the RT-PCR exam. We filter out any negative RT-PCR results issued after the first positive RT-PCR result, thus focusing our analysis on pre-covid individuals and those on the preliminary stages of the infection. We also consider individuals that never had any contact with SARS-Cov-2 in our negative cohort, namely individuals with exams dating before 2020.

Outbreak waves

Table 2 also shows training and validation sets for the two waves that occurred during the COVID-19 outbreak in Brazil. The training set for the first wave comprises labeled CBCs acquired until 26 June 2020, whilst its validation set comprises labeled CBCs acquired between 27 June 2020 and 05 September 2020. The training set for the second wave comprises labeled CBCs acquired until 30 September 2020, whilst its validation set comprises labeled CBCs acquired between 01 October 2020 and 28 February 2021. Both training and validation sets contain data corresponding to viruses other than SARS-Cov-2: the training sets contain instances from 2016 to 2018, while the validation sets contain instances from 2019.

Statistics and reproducibility

Our main objective is to demonstrate that training a model directly on COVID-19 data is not enough to guarantee robustness if multiple respiratory infections are present, as might be expected to occur in a possible COVID-19 endemic scenario. This is true even in the case of a massive dataset such as the one we employed for our study. Thus, we built resilient models for a pre-selected case of core confounding viruses and showed that we can retain similar COVID-19 detection performance in a scenario containing the prevalence of COVID-19 as well as achieving high discriminatory figures in low-prevalence scenarios with an abundance of other respiratory infections. Furthermore, we also demonstrate that the model indeed learns useful relationships between CBC patterns and other respiratory infections. To ensure the relevance of the results, we assess the statistical significance of our measurements through a pairwise t-test[39] with p-value ≤ 0.05 and through 5-fold cross-validation.

Model training

Our models were trained with the objective of distinguishing CBCs (+) from CBCs (−) (refer to Table 2). We followed a stacking procedure, that is, the training stage consists of creating multiple specialized models for each of the viruses considered (i.e., COVID-19, Influenza-A, Influenza-B, Influenza-H1N1, and other viruses) and then combining their outputs to obtain a final prediction about the target disease. We divided the training samples into two equally sized batches. The first one was used to train the specialized models and the second one to train the final stacked model. Each specialized model only had access to label information regarding the corresponding virus, and the stacking model employs CBC (+) and CBC (−) labels. Both specialized models as well as the final stacking model were trained with lightGBM[40], a fast implementation of a tree-based gradient boosting technique. We employed the SHAP algorithm[41-43] to obtain an interpretation of the model’s prediction, allowing us not only to have a probability that a specific CBC is associated with a positive RT-PCR for COVID-19 but also an explanation consisting of the feature importance leading to the model decision. We assessed performance by calculating AUROC, sensitivity and specificity in the validation sets as well as running 5-fold cross-validation in the training sets. Supplementary Figure 2 illustrates the proposed approach’s pipeline. We performed extensive grid-search for hyperparameter tuning for all the aforementioned models. Our final models employ 100 Gradient-Boosted Decision Trees estimators with a maximum tree depth of 50 and a maximum number of leaves of 50. The learning rate was set to 2e−1 optimizing the binary cross-entropy function.

Selecting specialized models

Not all CBC analytes are relevant features for differentiating the base targets (i.e., each virus), and some features may be detrimental to the task. To find a set of relevant features, we represent the model space as a directed acyclic graph (DAG) in which each node represents a distinct feature subset, and vertex A → B is connected if B can be reached by simple feature addition from A, thus representing a transitive reduction of the more complex combinatorial complete model space. This modeling approach presents two desirable properties: the first being that any vertex is reachable from the model, the second being that, for any feature set path, there exists a topological ordering, an ordering of all vertices into a sequence such that for every edge, the start vertex occurs earlier in the sequence than the ending vertex of the edge. These properties imply a partial ordering of the graph starting from the root node, which allows us to search it in an orderly manner. We apply the A* algorithm[44], employing as heuristic the AUROC of the model represented by the feature set of a given vertex. We hypothesize that there exists a set of optimal feature expansions that lead to the best-performing models for each specific base task. This allows us to search the N! combinatorial space of feature subsets to select the best performing specialized models.

Learning the final model

Our stacking definition extends all previously related COVID-19 learning approaches by building specialized models targeted at confounding viruses. When building the final model, we can expect to learn prediction relationships between COVID-19 and other respiratory infections. For example, in a scenario of a moderately high chance of Influenza, we would need an exceedingly high COVID-19 probability to confirm a positive infection hypothesis.

Table 3

COVID-19 endemic and pandemic simulations. AUROC, Specificity and Sensitivity, and the respective confidence intervals for different COVID-19 prevalence simulations under 95% confidence.

COVID-19 prevalence	AUROC	Specificity	Sensitivity
1%	0.928 ± 0.093	0.875 ± 0.018	0.913 ± 0.152
2%	0.881 ± 0.117	0.877 ± 0.024	0.812 ± 0.250
3%	0.917 ± 0.046	0.874 ± 0.016	0.873 ± 0.099
4%	0.922 ± 0.037	0.882 ± 0.033	0.896 ± 0.087
5%	0.918 ± 0.046	0.874 ± 0.012	0.879 ± 0.104
6%	0.909 ± 0.041	0.874 ± 0.032	0.857 ± 0.116
7%	0.910 ± 0.024	0.883 ± 0.018	0.840 ± 0.083
8%	0.904 ± 0.054	0.879 ± 0.036	0.849 ± 0.102
9%	0.907 ± 0.046	0.872 ± 0.025	0.871 ± 0.085
10%	0.896 ± 0.059	0.871 ± 0.025	0.848 ± 0.118
20%	0.916 ± 0.029	0.866 ± 0.025	0.878 ± 0.045
30%	0.906 ± 0.021	0.871 ± 0.018	0.862 ± 0.059
40%	0.911 ± 0.016	0.871 ± 0.024	0.873 ± 0.032
50%	0.913 ± 0.032	0.886 ± 0.030	0.863 ± 0.028
60%	0.901 ± 0.015	0.868 ± 0.031	0.852 ± 0.037
70%	0.906 ± 0.021	0.867 ± 0.033	0.858 ± 0.035
80%	0.902 ± 0.040	0.869 ± 0.074	0.854 ± 0.025
90%	0.911 ± 0.030	0.889 ± 0.081	0.864 ± 0.022

N = 30 simulations with 20,000 unique patients each.

44 in total

1. From Local Explanations to Global Understanding with Explainable AI for Trees.

Authors: Scott M Lundberg; Gabriel Erion; Hugh Chen; Alex DeGrave; Jordan M Prutkin; Bala Nair; Ronit Katz; Jonathan Himmelfarb; Nisha Bansal; Su-In Lee
Journal: Nat Mach Intell Date: 2020-01-17

2. Assembly and evaluation of an inventory of guidelines that are available to support clinical hematology laboratory practice.

Authors: C P M Hayward; K A Moffat; T I George; M Proytcheva
Journal: Int J Lab Hematol Date: 2015-05 Impact factor: 2.877

3. The need to separate the wheat from the chaff in medical informatics.

Authors: Federico Cabitza; Andrea Campagner
Journal: Int J Med Inform Date: 2021-06-02 Impact factor: 4.046

4. Development, evaluation, and validation of machine learning models for COVID-19 detection based on routine blood tests.

Authors: Federico Cabitza; Andrea Campagner; Davide Ferrari; Chiara Di Resta; Daniele Ceriotti; Eleonora Sabetta; Alessandra Colombini; Elena De Vecchi; Giuseppe Banfi; Massimo Locatelli; Anna Carobene
Journal: Clin Chem Lab Med Date: 2020-10-21 Impact factor: 3.694

5. Latent bias and the implementation of artificial intelligence in medicine.

Authors: Matthew DeCamp; Charlotta Lindvall
Journal: J Am Med Inform Assoc Date: 2020-12-09 Impact factor: 4.497

6. No association between vitamin D status and COVID-19 infection in São Paulo, Brazil.

Authors: Cynthia M Álvares Brandão; Maria Izabel Chiamolera; Rosa Paula Mello Biscolla; José Viana Lima; Cláudia M De Francischi Ferrer; Wesley Heleno Prieto; Pedro de Sá Tavares Russo; José de Sá; Carolina Dos Santos Lazari; Celso Francisco H Granato; José Gilberto H Vieira
Journal: Arch Endocrinol Metab Date: 2021-03-19 Impact factor: 2.309

Review 7. Machine learning in cardiovascular magnetic resonance: basic concepts and applications.

Authors: Tim Leiner; Daniel Rueckert; Avan Suinesiaputra; Bettina Baeßler; Reza Nezafat; Ivana Išgum; Alistair A Young
Journal: J Cardiovasc Magn Reson Date: 2019-10-07 Impact factor: 5.364

8. Lymphopenia predicts disease severity of COVID-19: a descriptive and predictive study.

Authors: Li Tan; Qi Wang; Duanyang Zhang; Jinya Ding; Qianchuan Huang; Yi-Quan Tang; Qiongshu Wang; Hongming Miao
Journal: Signal Transduct Target Ther Date: 2020-03-27

9. The continuing 2019-nCoV epidemic threat of novel coronaviruses to global health - The latest 2019 novel coronavirus outbreak in Wuhan, China.

Authors: David S Hui; Esam I Azhar; Tariq A Madani; Francine Ntoumi; Richard Kock; Osman Dar; Giuseppe Ippolito; Timothy D Mchugh; Ziad A Memish; Christian Drosten; Alimuddin Zumla; Eskild Petersen
Journal: Int J Infect Dis Date: 2020-01-14 Impact factor: 3.623

10. Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal

Authors: Laure Wynants; Ben Van Calster; Gary S Collins; Richard D Riley; Georg Heinze; Ewoud Schuit; Marc M J Bonten; Darren L Dahly; Johanna A A Damen; Thomas P A Debray; Valentijn M T de Jong; Maarten De Vos; Paul Dhiman; Maria C Haller; Michael O Harhay; Liesbet Henckaerts; Pauline Heus; Michael Kammer; Nina Kreuzberger; Anna Lohmann; Kim Luijken; Jie Ma; Glen P Martin; David J McLernon; Constanza L Andaur Navarro; Johannes B Reitsma; Jamie C Sergeant; Chunhu Shi; Nicole Skoetz; Luc J M Smits; Kym I E Snell; Matthew Sperrin; René Spijker; Ewout W Steyerberg; Toshihiko Takada; Ioanna Tzoulaki; Sander M J van Kuijk; Bas van Bussel; Iwan C C van der Horst; Florien S van Royen; Jan Y Verbakel; Christine Wallisch; Jack Wilkinson; Robert Wolff; Lotty Hooft; Karel G M Moons; Maarten van Smeden
Journal: BMJ Date: 2020-04-07

1 in total

1. Prognosing the risk of COVID-19 death through a machine learning-based routine blood panel: A retrospective study in Brazil.

Authors: Daniella Castro Araújo; Adriano Alonso Veloso; Karina Braga Gomes Borges; Maria das Graças Carvalho
Journal: Int J Med Inform Date: 2022-07-27 Impact factor: 4.730

1 in total