Literature DB >> 31914054

Health outcome prediction using multiple perturbations.

Abstract

Public health workers and medical practitioners are frequently required to make predictions regarding various health outcomes. However, a prediction with nearly 100% certainty is seldom possible.If a person has a health outcome of concern or is in the process of developing the outcome, many attributes of that person may undergo subtle changes-the perturbations. We propose a method, namely "prediction using multiple perturbations" and investigate its asymptotic properties when the number of attributes tends to infinity. This is a proof-of-concept study.The proposed method can predict the health outcome of a person to near certainty if personal data with billions or trillions of attributes can be collected and 4 conditions (described subsequently in this paper) are met.Collecting personal data with billions or trillions of attributes may someday become possible in the current era of big data. If such information can be obtained, theoretically we can predict the health outcome of a person to near certainty.

Entities: Chemical

Mesh：

Year: 2020 PMID： 31914054 PMCID： PMC6959947 DOI： 10.1097/MD.0000000000018664

Source DB: PubMed Journal: Medicine (Baltimore) ISSN： 0025-7974 Impact factor: 1.817

Introduction

Public health workers and medical practitioners are frequently required to make predictions regarding various health outcomes. For example, they may be required to predict whether a 65-year-old woman with a 50-year history of alcohol drinking, betel nut chewing, and cigarette smoking will develop oral cancer in 1 year. They may also be required to predict whether a young man without previous medical history will survive an emergency operation for a ruptured dissecting aortic aneurysm. Many risk or prognostic factors have been found for nearly every health outcome. On the basis of such factors, a risk prediction model (e.g., the Framingham score for cardiovascular risk[ or a disease-staging system [e.g., the International Federation of Gynecology and Obstetrics staging for cervical cancer prognosis])[ can be constructed to make accurate predictions. However, a prediction with approximately 100% certainty is seldom possible. We report a new avenue for health outcome prediction. The method hinges on collecting multiple “perturbations”[ of the health outcome of interest. Notably, if a person has the health outcome of concern (e.g., clinically diagnosed liver cancer) or is in the process of developing the outcome (e.g., a small malignant liver tumor not yet manifested clinically), many attributes of that person may undergo subtle changes (referred to as the “perturbed” attributes in this paper). For example, the person's physical/emotional characteristics, physiological/biochemical profiles, various “omics” (e.g., epigenomics, transcriptomics, proteomics, metabolomics, and exposomics), behavior patterns, social activities, and data pertaining to that person. Notably, changes induced by the health outcome may be nondeterministic (i.e., only vary in probability), and the magnitudes of the changes may be minuscule. We utilize such health big data and propose a method, namely “prediction using multiple perturbations” (PUMP). This is a proof-of-concept study. In this paper, we investigate the asymptotic properties of PUMP when the number of attributes tends to infinity.

Methods

Training samples

To train a PUMP to predict a health outcome (Y), we need a case sample (people with Y; with a sample size of n1 indexed by j) and 2 independent control samples (people without Y; with a sample size of n0,1 indexed by k1, and a sample size of n0,2 indexed by k2, respectively). Information on a total of m attributes (indexed by i) is accrued, for all people in the case sample () and the 2 control samples ( and ). The attributes are assumed to be binary (0 or 1); otherwise, we censor values that are excessively high to an (arbitrarily defined) high limit and those that are excessively too low to a low limit, followed by mapping all values to the unit interval. The point of doing such censorship is to bound the ranges of all attributes to between zero and one: , for all i, j, k1, and k2. Next, we calculate the mean attributes for the 3 training samples: , respectively, for i = 1, 2,…, m.

The prediction method

To make a prediction for a new person, we accrue the information of the corresponding attributes pertaining to the person: , for i = 1, 2,…, m. We calculate a perturbation score at each and every attribute for the new person: , for i = 1, 2,…, m. These scores are then averaged across the attributes to yield . We predict the new person to eventually develop Y if his/her average score is larger than a certain threshold value (a small positive number near zero, to be discussed later); otherwise, we predict the person to not develop Y.

Ethical review

This paper is a methodological study and does not involve the enrollment of study subjects. Ethical approval is not necessary.

Results

As the number of attributes tends to infinity (m → ∞), the probability of a correct prediction using the proposed method tends to 1 if the following 4 conditions are met: (C1) An attribute may or may not be perturbed by the process of developing the health outcome. For a perturbed attribute, the mean for the people with the outcome and the mean for the people without the outcome flank the mean for those in the process of developing the outcome. For a non-perturbed attribute, the means are identical for all 3 types of people. (C2) The number of perturbed attributes has an asymptotic growth rate no lower than that of the total number of attributes. (C3) The mean of a perturbed attribute for the people in the process of developing the outcome is bounded away from the mean for those without the outcome. (C4) An attribute correlates with a bounded number of other attributes, for the people with the outcome, the people in the process of developing the outcome, or the people without the outcome. Note that (C1), (C2), and (C3) above concern the relation between the many attributes to be collected and the health outcome under concern, whereas (C4) above concerns the relation between the attributes themselves.

A proof that the probability of a correct prediction tends to one

Let the mean (variance) of the ith attribute be denoted by for the people with Y, for the people in the process of developing Y, and for the people without Y, respectively. Because all attributes are between zero and one inclusive, we have that , respectively, for all i. For attributes that are bounded between zero and one, the variances are the largest when they are Bernoulli distributed (either 0 or 1, but not between). Therefore, we have that , and , respectively, for all i. From C1, let indicate that the ith attribute is a perturbed attribute or , and otherwise (). The number of perturbed attributes is , which is a function of m. From C2, there exist positive constants π and M such that for all . From C3, there exists a positive constant ξ such that for all i with . Because of the independence between and (the former and the latter being based on different people), the means and variances of the perturbation score at the ith attribute for a new person are and respectively, where is a function of the means of the attributes and the sample sizes of the training data. From C4, assume that any attribute can correlate with at most C other attributes, for either the people with Y, the people in the process of developing Y, or the people without Y. Then, we have that , for all i. We now calculate the mean of the average score for the new person as a function of m. We have that for all m if the new person is without Y, and that for all m ≥ M if the new person is developing Y. As for the variance of the average score for the new person, irrespectively of whether he/she is without Y or developing Y we have that for all m. For any t in and any m ≥ M, the probability of a correct prediction for a person without Y is and the probability of a correct prediction for a person developing Y is . A simple numerical analysis shows that . Therefore, we see that as the number of attributes tends to infinity (m → ∞), the probability of a correct prediction tends to one.

Probability of a correct prediction and number of attributes needed

Assuming , we have that . Let , the probability of a correct prediction is then (assuming m ≥ M). Figure 1 shows that as the number of attributes increases, the lower bound for the probability of a correct prediction increases.

Figure 1

Lower bound for the probability of a correct prediction (red: corresponding to Scenario I in Table 1; yellow: Scenario II; blue: Scenario III; gray: Scenario IV; orange: Scenario V; green: Scenario VI; purple: Scenario VII; brown: Scenario VIII). To control a false positive rate (the probability of a wrong prediction for a person without the outcome) no larger than α (0 < α < 1) and a false negative rate (the probability of a wrong prediction for a person developing the outcome) no larger than β (0 < β < 1), we can set the threshold value at . The total number of attributes needed is then (assuming ). Table 1 shows that an extremely large number of attributes (m[0.01,0.01]) is required to control the false positive and the false negative rates both no larger than 0.01. In Fig. 1, the lower bound for the probability of a correct prediction is larger than zero for (assuming m[1,1] ≥ M).

Table 1

Numbers of attributes needed to control a false positive rate (the probability of a wrong prediction for a person without the outcome) and a false negative rate (the probability of a wrong prediction for a person developing the outcome) both no larger than 0.01 under various scenarios.

Discussion

Conventional asymptotic analysis assumes the number of subjects to tend to infinity. Hall et al[ and Ahn et al[ proposed an alternative approach assuming the number of “dimensions” (corresponding to “attributes” in this paper), instead, to tend to infinity. Previously, we built on this alternative asymptotic to develop new methods, respectively, for detecting weak associations,[ detecting and correcting the bias of unmeasured factors,[ and testing treatment effects in randomized controlled trials.[ Along this line of inquiry, in this paper we propose a new avenue for health outcome prediction. The C1 condition is the fundamental assumption of our method; it stipulates that the natural course of the health outcome of concern should produce perturbations as such. The C1 condition also implies that there is an intermediate state between the absence and the presence of the outcome: the outcome-developing state. For some health outcome that develops very quickly, the C1 condition my not apply. The C2 condition signifies that collecting numerous attributes does not always help; only the perturbed attributes count as the informative “signals” and the non-perturbed attributes are the uninformative “noises.” To meet this condition, as the number of attributes tends to infinity, the proportion of the perturbed attributes (signal prevalence) must be no smaller than a certain positive value. The C3 condition requires that the perturbations should be non-negligible if small; the perturbation magnitude (signal strength)—as measured by the deviation of the mean of those developing the outcome from that of those without the outcome, relative to the distance between the high and low limits set up for that attribute—must be no smaller than a certain positive value. Finally, the C4 condition stipulates that diverse types of attributes must be collected to minimize the correlations between them; as the number of attributes tends to infinity, the number of other attributes that an attribute correlates with must be bounded. The current version of the PUMP only accepts binary attributes. Additional work is needed to expand the range of applicability to include categorical/continuous attributes, and those attributes that my change over time either because of the outcome-developing processes or by their own nature. The PUMP is also very naive; it does not care to differentiate signals (the perturbed attributes) from noises (the non-perturbed attributes) before taking them all in. Methods to select attributes for the PUMP need further development. From an artificial intelligence perspective, the PUMP is the simplest “machine learner” which takes in one layer of attributes, performs a simple linear combination of them, and outputs an average perturbation score. Recent “deep learners” allow multiple processing layers and complex nonlinear combinations between the input attributes and each and every node in the deep layers.[ Further studies along this line are also warranted. From Table 1, we see that the current version of the PUMP requires an extremely large number of attributes. Future updates of the PUMP envisioned above may reduce the number of attributes needed for a near perfect prediction to a practically feasible level. At present, it is still not possible to collect billions or trillions of attributes of a person. But moving into this big data era,[ we are gradually closing that gap. This paper shows that if such personal big data can be obtained and the C1–C4 conditions are met, theoretically we can use a PUMP to predict the health outcome of a person to near certainty.

Author contributions

This is a single-authorship paper by WCL. Conceptualization: Wen-Chung Lee. Data curation: Wen-Chung Lee. Formal analysis: Wen-Chung Lee. Funding acquisition: Wen-Chung Lee. Investigation: Wen-Chung Lee. Methodology: Wen-Chung Lee. Project administration: Wen-Chung Lee. Resources: Wen-Chung Lee. Software: Wen-Chung Lee. Supervision: Wen-Chung Lee. Validation: Wen-Chung Lee. Visualization: Wen-Chung Lee. Writing – original draft: Wen-Chung Lee. Writing – review & editing: Wen-Chung Lee.

9 in total

1. FIGO staging classifications and clinical practice guidelines in the management of gynecologic cancers. FIGO Committee on Gynecologic Oncology.

Authors: J L Benedet; H Bender; H Jones; H Y Ngan; S Pecorelli
Journal: Int J Gynaecol Obstet Date: 2000-08 Impact factor: 3.561

Review 2. Deep learning.

Authors: Yann LeCun; Yoshua Bengio; Geoffrey Hinton
Journal: Nature Date: 2015-05-28 Impact factor: 49.962

Health outcome prediction using multiple perturbations.

Introduction

Methods

Training samples

The prediction method

Ethical review

Results

A proof that the probability of a correct prediction tends to one

Probability of a correct prediction and number of attributes needed

Discussion

Author contributions

1. FIGO staging classifications and clinical practice guidelines in the management of gynecologic cancers. FIGO Committee on Gynecologic Oncology.

Review 2. Deep learning.

3. The inevitable application of big data to health care.

4. Prediction of coronary heart disease using risk factor categories.

5. Big data and new knowledge in medicine: the thinking, training, and tools needed for a learning health system.

Review 6. From big data analysis to personalized medicine for all: challenges and opportunities.

7. Detecting a weak association by testing its multiple perturbations: a data mining approach.

8. Detecting and correcting the bias of unmeasured factors using perturbation analysis: a data-mining approach.

9. A test for treatment effects in randomized controlled trials, harnessing the power of ultrahigh dimensional big data.