Literature DB >> 35832137

A New Nonparametric Multivariate Control Scheme for Simultaneous Monitoring Changes in Location and Scale.

Jin Yue1,2, Liu Liu1,2.   

Abstract

Real-time monitoring of the breast cancer index is becoming increasingly important. It can help create advances in the diagnosis and treatment of breast cancer. In today's modern medical processes, simultaneously monitoring changes in observations in terms of location and scale are convenient for the implementation of control schemes but can be challenging. In this paper, we consider a new nonparametric control scheme for monitoring location and scale parameters in multivariate processes. The proposed method is easy to implement, and the performance of the proposed control procedure is discussed. Then, we compare the proposed scheme with some competing methods. Simulation results show that the proposed scheme can efficiently detect a range of shifts. The proposed chart can trigger an alert and timely discover the change of the breast cancer index.
Copyright © 2022 Jin Yue and Liu Liu.

Entities:  

Mesh:

Year:  2022        PMID: 35832137      PMCID: PMC9273427          DOI: 10.1155/2022/3385825

Source DB:  PubMed          Journal:  Comput Math Methods Med        ISSN: 1748-670X            Impact factor:   2.809


1. Introduction

Control schemes play an important role in biosurveillance studies [1-9]. Control schemes have been frequently used for fault detection in quality control with products and health-care monitoring [10-14]. A process should be monitored using statistical means to determine whether a shift occurs, and action should be taken once the process is considered out-of-control (OC) [15-18]. Many researchers have discussed and proposed many useful charts, such as Shewhart charts [19, 20], cumulative sum (CUSUM) charts [21-30], and exponentially weighted moving average (EWMA) charts [31-38], to detect whether there is a change in quality characteristics in a process. These proposed control schemes can be used for data analysis, including control and forecasting, which are useful for fault diagnosis in practice. Most charts require that these observations be univariate and typically assume that these observations follow a normal distribution. Unfortunately, the assumption of multivariate normality is unrealistic in most cases and would lead to a poor performance if underlying assumptions are invalid. Nonparametric control charts are important in manufacturing and service sectors when samples of observations are nonnormal. Some control schemes are used to monitor high-dimensional processes when we know little about the underlying distribution [39-42]. Most control schemes are designed to monitor location parameters. For example, Liu and Singh [43] introduced several multivariate rank tests based on data depth. Liu [44] used the concept of data depth to propose several new control charts to monitor multivariate process. Data depth provides an efficient metric of the process' performance without using parametric assumptions. In addition, Zou et al. [45] provided a multivariate spatial rank for monitoring high-dimensional processes with unknown parameters. For detecting the location changes in nonparametric multivariate processes, we also recommend the discussions by [46, 47]. To detect the changes in the location and scale of observations simultaneously, several monitoring methods are proposed in the literature, including Mukherjee and Chakraborti [48] and Chowdhury et al. [49]. Recently, Mukherjee and Marozzi [50] consider the sum of the squares of standardized Wilcoxon and the Bradley statistics for monitoring high-dimensional processes with unknown parameters which is advantageous in simultaneous monitoring of multiple aspects. Recently, some schemes have been proposed to monitor the changes in location and scale simultaneously using a single chart. Performance advantages of these charts have been clearly established [51]. Lepage [52] discussed a nonparametric two-sample test for location and dispersion. Based on Lepage [52], Mukherjee and Marozzi [51] introduced new circular-grid charts for simultaneous monitoring of process location and process scale based on Lepage-type statistics. Meanwhile, Mukherjee and Marozzi [53] investigated a new single distribution-free Phase-II CUSUM procedure based on the Cucconi statistic for simultaneously monitoring changes in location and scale parameters of a process. In addition, Mukherjee and Sen [54] discussed a distribution-free (nonparametric) Shewhart-Lepage scheme for simultaneous monitoring of location and scale parameters using an adaptive strategy. Li et al. [55] and Shi et al. [56] provided powerful control schemes aimed at simultaneously monitoring the location and the scale parameters of any continuous process. Moreover, Zafar et al. [57] proposed a new parametric memory-type charting structure based on progressive mean under max statistic for the joint monitoring of location and dispersion parameters. Song et al. [58] introduced distribution-free adaptive Shewhart-Lepage-type schemes for simultaneous monitoring of location and scale parameters using information about symmetry and tail weights of the process distribution. Huang et al. [59] proposed a new statistical process monitoring scheme with a double-sampling plan for simultaneously monitoring location and scale shifts. Bai and Li [60] considered monitoring ordinal categorical factors for monitoring which considers shifts in the location or scale parameters of latent variables. For multivariate processes, Cheng and Shiau [61] proposed a distribution-free phase I monitoring scheme for both location and scale parameters based on the multisample Lepage statistic. Although these literatures contain many control schemes for monitoring location and scale parameters simultaneously, much less focus has been placed on control strategies that simultaneously monitor location and scale parameters in multivariate processes. In this study, we propose a useful and easy-to-implement control scheme for simultaneously monitoring location and scale parameters, which is based on nonparametric location and scale hypothesis testing. Reference samples are denoted as phase I data streams, and test samples are denoted as phase II data streams. One problem is that the size of phase II increases with the number of data streams. Considering this issue, we performed hypothesis testing repeatedly with each new data stream. Thus, the amount of phase II data became a constant for each acquisition time. The remainder of this paper is organized as follows: In Section 2, we review nonparametric hypothesis testing in detail. In Section 3, we propose a new scheme based on a hypothesis testing statistic for monitoring location and scale parameters. Then, we discuss the proposed method's performance and validity. In Section 4, we perform a simulation-based comparison to compare the proposed chart with other existing charts. In Section 5, breast cancer data are investigated to describe the performance of the proposed chart. Lastly, we briefly draw conclusions in Section 6.

2. Review of Nonparametric Hypothesis Testing

Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution, considering reference sample {X1,, X2,, ⋯, X} of size m and test sample {Y1,, Y2,, ⋯, Y} of size n. Thus, null hypothesis H0 : μ1 = μ2, σ12 = σ22 versus alternative hypothesis H1 : μ1 ≠ μ2 or σ12 ≠ σ22, where μ1 is the location parameter of reference sample; μ2 is the location parameter of test sample; σ12 and σ22 are the scale parameters of the reference and test samples, respectively. We can use a reasonable statistical decision procedure to reject the null hypothesis H0. In real situations, it is difficult for us to identify the exact distribution of data streams. Therefore, nonparametric hypothesis testing is also introduced, which does not consider the distribution of the original data. For hypothesis testing about the location parameter, Mood [62] proposed the median test, which is based on the rank of each datum. Considering the interaction between the reference and test samples, Wilcoxon [63] and Mann and Whitney [64] introduced the Mann-Whitney-Wilcoxon statistic. In addition, rank-based nonparametric hypothesis testing of scale parameter is used in the literature [65-67].

2.1. Methods for Location Detection

In general, people often check whether there is a change for a given location parameter in a process. We often use the t-statistic under the assumption that the distribution is normal. However, there is a risk in using the t-statistic with unknown population distributions. Thus, some distribution-free statistics have been developed. Brown-Mood median testing is a useful nonparametric method. However, the bilateral test does not yield satisfactory results when m ≠ n. To use more information about the relative size of the reference sample and test sample, the Wilcoxon rank-sum test was developed. We assume that a reference sample of size m and test sample of size n are given, and we let N = m + n. Considering the pooled sample {X1,, X2,, ⋯, X, Y1,, Y2,, ⋯, Y} at time t, Mann and Whitney [64] developed the Mann-Whitney statistic as follows: Therefore, the Wilcoxon rank-sum statistic is where W2, = ∑R, and R is the rank of Y in the pooled sample {X1,, X2,, ⋯, X, Y1,, Y2,, ⋯, Y}. E(W2,|H0) = n(N + 1)/2. It can be seen that [68] Under the null hypothesis, we also calculate the approximate normal statistic when the sample N is sufficiently large.

2.2. Methods for Scale Detecting

A location parameter typically describes the position of a distribution, and a scale parameter is also an important characteristic that describes a distribution. When the distribution of observations is unknown, some distribution-free methods are typically used. Given a two-phase independent sample {X1,, X2,, ⋯, X} ~ F(μ1, σ12) and {Y1,, Y2,, ⋯, Y} ~ F(μ2, σ22). We assume that the location parameters of the two samples are equal (μ1 = μ2). Based on the Mann-Whitney statistic, Siegel and Tukey [65] proposed the Siegel-Tukey statistic. The implementation design of this statistic consists of the following steps: (1) mix the two samples {X1,, X2,, ⋯, X, Y1,, Y2,, ⋯, Y} in ascending order, Q(1),, Q(2),, ⋯, Q(; (2) assign the rank R′ of Q(1),, Q(2),, ⋯, Q( as shown in Table 1; and (3) calculate the S = ∑(R′ − n(n + 1)/2); R′ represents the rank of Y.
Table 1

Rank of Q(1),, Q(2),, ⋯, Q(.

Data Q (1),t Q (2),t Q (3),t Q (4),t Q (m + n − 3),t Q (m + n − 2),t Q (m + n − 1),t Q (m + n),t
Rank14587632
Mood [62] also provided a useful test statistic for scale parameters. As before, we consider two sequences of {X1,, X2,, ⋯, X} ~ G(μ1, σ12) and {Y1,, Y2,, ⋯, Y} ~ G(μ2, σ22), where μ1 = μ2. The Mood statistic can be described as follows: where R is the rank of Y, i = 1, 2, ⋯, n, in sample {X1,, X2,, ⋯, X, Y1,, Y2,, ⋯, Y} of size N( = m + n). For m, n⟶+∞ and m/N⟶ constant C. Additionally [68], Filgner and Killeen [69] also introduced a test statistic for scale parameters that is based on the absolute rank. The statistic is defined as R is the rank of V in pooled sample {V1,, V2,, ⋯, V, V1,, V2,, ⋯, V}, where V = |X − M|, V = |Y − M|. M represents the median of the sample {X1,, X2,, ⋯, X, Y1,, Y2,, ⋯, Y}. F has the distribution of Wilcoxon's rank-sum statistic under the null hypothesis. Therefore,

3. Proposed Monitoring Strategy

We assume that there are m-independent observations from an unknown multivariate continuous distribution with dimensionality p. We assume that independent observations, X, follow the model below: where μ0 and μ1 are the in-control (IC) location vector and the OC location vector, respectively; Σ0 and Σ1 represent the IC covariance matrix and the OC covariance matrix, respectively, where (μ0, Σ0) ≠ (μ1, Σ1); τ represents an unknown change point; and G(·) is an unknown continuous distribution function. In phase I, we assume that the IC sample of size m is given at time t, R = {X1,, X2,, ⋯, X, ⋯, X} where X = {X1,, X2,, ⋯,X}′, i = 1, 2, ⋯, m. In phase II, T = {Y1,, Y2,, ⋯, Y} of size n is obtained. After the phase I sample R is analyzed, the phase II sample T is monitored. Inspired by Mukherjee and Marozzi [50] for multivariate processes, we consider the p-dimension statistic of the Euclidean distance of new observations and the mean vector of phase I data, X, i = 1, 2, ⋯, m. That is, and , where . Now, a univariate phase II sequence is obtained, {D1,, D2,, ⋯, D, D1,, D2,, ⋯, D}. Then, a Shewhart-type chart for monitoring location changes that is based on the Wilcoxon rank-sum statistic (i.e., S-W chart) can be constructed. The statistic of the S-W chart is with upper control limit (UCL) and lower control limit (LCL) where L is an unknown constant. The Shewhart-type chart can be constructed based on three other types of hypothesis statistics for the scale parameter. The S-ST chart (i.e., the Shewhart-type chart based on the Siegel-Tukey statistic) is calculated using with and . The S-MD chart (i.e., the Shewhart-type chart based on the mood statistic) is given as follows: with and . The S-FK chart (i.e., the Shewhart-type chart based on the Filgner-Killeen statistic) is given by with , and . We then use the average run length (ARL) to evaluate the performance of these methods. ARL is the number of points that, on average, will be plotted on a control chart before an OC condition occurs. If the process is IC, ARL0 = 1/α; otherwise, ARL1 = 1/(1 − β) when the process is OC. In addition, α is the probability of a type I error occurring, and β is the probability of a type II error occurring. Therefore, we typically fix IC ARL, which is denoted as ARL0, and compare the OC ARL, which is denoted as ARL1. A small ARL1 is considered better. Figure 1 shows the OC ARL of the S-ST, S-MD, and S-FK charts. We let m = 50, n = {5,10,20}, and p = 4 under the multivariate Gaussian distribution with expectations μ0 and the variance matrix, Σ0. For a fair comparison, we set ARL0 = 500 for all control schemes. Figure 1 shows the OC ARL of the three Shewhart-type schemes when detecting scale parameters. Figure 1 shows that the S-MD chart's performance is better than the other charts when detecting a range of scale shifts.
Figure 1

Comparison of the three Shewhart-type schemes when detecting changes in scale.

When calculating the Mahalanobis distance, the sample population must exceed the sample dimension; otherwise, the inverse matrix of the population sample covariance matrix obtained does not exist. Thus, the Mahalanobis distance sometimes fails to meet practical requirements. It is also not appropriate to simply use the Euclidean distance to reduce the dimensionality of high-dimensional data, because this process would equate the differences between different data attributes (i.e., the dimensions of each index or variable). The standardized Euclidean distance is an improvement strategy that can overcome the shortcoming of the simple Euclidean distance. Since the distribution of each dimension component of the data is different, the first to “standardize” each component to the associated mean and variance are equal. Mukherjee and Marozzi [50] consider the sum of the squares of standardized Wilcoxon and Bradley statistics for monitoring high-dimensional processes with unknown parameters. Inspired by Mukherjee and Marozzi [50], we combine the idea of control schemes and hypothesis testing to propose an effective control scheme that simultaneously monitors expectation and variance. Based on this analysis, we propose an alternative control scheme, whose statistic is as follows: with The term asymptotic distribution is used in the sense of convergence in law when m⟶∞ and n⟶∞ with the ratio m/N constant [52]. Under H0, the statistics Z and Z are uncorrelated for all m and n. Since, for all m and n, Thus, we have Equality (14) is the product of E(W2,|H0) and E(MD|H0). Therefore, It is obvious that Under H0, and with m⟶∞, n⟶∞, and the ratio m/N constant.

4. Performance Evaluation

In this section, we compare the performances of these charts with different reference sample sizes m and test sample sizes n when shifts occur. We assume that the tth future observation, X, is collected over time using the following multivariate model: where μ0 = (0, 0, 0, 0), μ1 = (0, 0, δ, δ), and Σ0 represents the 4 × 4 identity matrix. We let τ = 50 and dimensionality p = 4. Table 2 shows the OC ARL of these charts. Table 3 presents the OC ARL of these charts when there is a correlation between variables: where
Table 2

OC ARL values of these charts for various m and n when zero-state ARL0 = 500 with the IC distribution N(μ0, Σ0).

m ShiftsS-WS-MDProposedS-WS-MDProposedS-WS-MDProposed
50 n = 5 n = 10 n = 20
δ = 0.5, σ = 285.5161.269.163.7169.64848.625142.8
δ = 0.5, σ = 425.158.116.615.162.311.810.8129.79.9
δ = 1, σ = 226.371.725.31667.815.816.9180.416.5
δ = 1, σ = 49.828.99.66.536.16.45.5905.3

100 n = 5 n = 20 n = 50
δ = 0.5, σ = 269.112.657.622.9110.721.217.7180.716.2
δ = 0.5, σ = 416.934.412.25.627.84.64.872.14.1
δ = 1, σ = 220.742.818.57.340.56.98114.87.3
δ = 1, σ = 47.914.67.53.111.732.547.42.5

200 n = 5 n = 20 n = 50
δ = 0.5, σ = 258.396.749.816.762.714.27.810.67.3
δ = 0.5, σ = 414.923.39.84.1153.32.510.62.5
δ = 1, σ = 21928.318.74.4154.13.520.33.3
δ = 1, σ = 46.710.56.12.34.32.324.62
Table 3

OC ARL values of these charts for various m and n when zero-state ARL0 = 500 with the IC distribution N(μ0, Σ2).

m ShiftsS-WS-MDProposedS-WS-MDProposedS-WS-MDProposed
50 n = 5 n = 10 n = 20
δ = 0.5, σ = 296182.670.763.5189.953.35327449.9
δ = 0.5, σ = 424.768.419.315.972.712.611.214710.9
δ = 1, σ = 231.585.730.219.3101.518.819193.318.8
δ = 1, σ = 411.230.8117.140.66.86.1103.16.1

100 n = 5 n = 20 n = 50
δ = 0.5, σ = 273140.967.723.933.222.822.592.222.3
δ = 0.5, σ = 418.439.5145.633.25.13.992.23.7
δ = 1, σ = 227.84724.15.913.85.88.2143.28.2
δ = 1, σ = 417.716.522.14.814.44.84.556.64.2

200 n = 5 n = 20 n = 50
δ = 0.5, σ = 265.9120.357.916.27916.28.877.88.4
δ = 0.5, σ = 415.527.811.3412.23.42.513.42.5
δ = 1, σ = 218.934.817.64.716.64.32.923.82.7
δ = 1, σ = 47.512.67.52.44.72.425.52
The Weibull type of distributional changes for detecting general distributional changes is shown in Table 4, where Weibull(θ1, θ2) represents the Weibull distribution with the shape parameter θ1 and the scale parameter θ2. The IC distribution is Weibull(1, 1), and the OC distribution is Weibull(1, 1 + δ). We also consider the three types of general changes (multivariate t with 3 df, multivariate exponential, and multivariate gamma distributions) in Table 5. Tables 2–5 show that the proposed method performs well for detecting a range of shifts.
Table 4

OC ARL values of these charts for various m and n when zero-state ARL0 = 500 with the IC distribution Weibull(1, 1).

m δ S-WS-MDProposedS-WS-MDProposedS-WS-MDProposed
50 n = 5 n = 10 n = 20
25.510.343.76.32.82.65.22.3
42.32.522.12.1222.12
62.12.12222222

100 n = 5 n = 20 n = 50
257.23.82.32.82.122.42
42.22.22222222
6222222222

200 n = 5 n = 20 n = 50
24.65.83.42.22.32.2222
42.22.12222222
6222222222
Table 5

OC ARL values of these charts for various n when m = 100 and zero-state ARL0 = 500 under other types of distribution.

τ TypeS-WS-MDProposedS-WS-MDProposedS-WS-MDProposed
50 n = 5 n = 20 n = 50
143.1126.137.510.2103.29.44.5252.16.9
213.364.48.14.75.52.83.610.72.4
311.8597.74.75.52.83.510.32.4

100 n = 5 n = 20 n = 50
143129.641.110.1104.89.424.4238.96.7
212.765.47.74.85.82.73.410.52.4
312.554.57.94.65.62.73.610.42.4

200 n = 5 n = 20 n = 50
142.9132.939.110.7109.79.24.7243.36.9
213.461.58.14.85.72.83.410.22.3
31258.884.75.72.83.510.62.4

1: multivariate t with 3 df distribution; 2: multivariate gamma distribution; 3: multivariate exponential distribution.

5. Illustration

5.1. Data Source

To describe the proposed method, we analyze a real clinical case. Samples arrive periodically as Dr. Wolberg reports in his clinical cases. The database therefore reflects this chronological grouping of the data. For each of the 599 clinical cases, several clinical features were observed or measured. Quantitative attributes including clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. The datasets are publicly available in the “Breast Cancer Wisconsin (Original) Data Set” of the UCI Machine Learning Repository and can be downloaded from the website http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29. Breast cancer screening is an important strategy to allow for early detection and ensure a greater probability of having a good outcome in treatment. More details about these datasets can be related to [70-73]. In this work, we aim to monitor the Breast Cancer Wisconsin Data Set and identify whether there is a shift in a process.

5.2. Data Analysis

A quantile-quantile (Q-Q) plot of each index, including 599 historical observations, is shown in Figure 2, which highlights that the normality assumption is invalid, which leads us to reject the null hypothesis that the data are normally distributed. Thus, we use the proposed distribution-free control scheme to monitor the breast cancer data.
Figure 2

Corresponding normal Q-Q plot of the breast cancer data.

We let m = 100 and n = 5. We use the 1–350 IC data to find the control limits of the S-W chart, S-MD chart, and proposed chart. For a fair comparison, the IC ARL of all control charts is set equal to 400, and the remaining 249 breast cancer data are monitored. The curves of the S-W and S-MD charts of the monitored banknote authentication data are shown in Figure 3, which indicates that the S-W chart produces a false alarm when the process is IC; conversely, the S-MD chart produces no OC signal when the process is OC. Figure 4 shows the proposed chart for monitoring breast cancer data and shows that the statistic of the proposed chart falls out of the control limits after 353 observations. Compared with the S-W and S-MD charts, the proposed chart can detect a shift more accurately and earlier than the other charts.
Figure 3

(a) S-W chart for monitoring breast cancer data. (b) S-MD chart for monitoring breast cancer data.

Figure 4

The proposed chart for monitoring breast cancer data.

6. Conclusions and Discussion

This paper provided a new control scheme for detecting location and scale changes. Inspired by Mukherjee and Marozzi [50], we proposed an effective control chart that simultaneously monitors changes in both location and scale. In this paper, Breast Cancer Wisconsin Data Sets are provided by using the proposed method. Spectral analysis is also reviewed and conducted to investigate the periodicities of shorter time series, and then, nonlinear least squares fitting is used for fitting analysis. The real-data example shows that the proposed scheme performed well for detecting process changes. In this study, we mainly considered the standard Euclidean distance to reduce the dimensionality of high-dimensional data; the other methods of dimensionality reduction still need to be investigated in more detail.
  9 in total

1.  A risk-adjusted Sets method for monitoring adverse medical outcomes.

Authors:  O A Grigg; V T Farewell
Journal:  Stat Med       Date:  2004-05-30       Impact factor: 2.373

2.  Probability tables for individual comparisons by ranking methods.

Authors:  F WILCOXIN
Journal:  Biometrics       Date:  1947-09       Impact factor: 2.571

3.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology.

Authors:  W H Wolberg; O L Mangasarian
Journal:  Proc Natl Acad Sci U S A       Date:  1990-12       Impact factor: 11.205

4.  Risk-adjusted survival time monitoring with an updating exponentially weighted moving average (EWMA) control chart.

Authors:  Stefan H Steiner; Mark Jones
Journal:  Stat Med       Date:  2010-02-20       Impact factor: 2.373

5.  Monitoring the evolutionary process of quality: risk-adjusted charting to track outcomes in intensive care.

Authors:  David A Cook; Stefan H Steiner; Richard J Cook; Vern T Farewell; Anthony P Morton
Journal:  Crit Care Med       Date:  2003-06       Impact factor: 7.598

6.  Identification of hot and cold spots in genome of Mycobacterium tuberculosis using Shewhart Control Charts.

Authors:  Sarbashis Das; Priyanka Duggal; Rahul Roy; Vithal P Myneedu; Digamber Behera; Hanumanthappa K Prasad; Alok Bhattacharya
Journal:  Sci Rep       Date:  2012-03-02       Impact factor: 4.379

7.  Differential Effects of Insulin and IGF1 Receptors on ERK and AKT Subcellular Distribution in Breast Cancer Cells.

Authors:  Rive Sarfstein; Karthik Nagaraj; Derek LeRoith; Haim Werner
Journal:  Cells       Date:  2019-11-23       Impact factor: 6.600

Review 8.  Are Circulating Tumor Cells (CTCs) Ready for Clinical Use in Breast Cancer? An Overview of Completed and Ongoing Trials Using CTCs for Clinical Treatment Decisions.

Authors:  Fabienne Schochter; Thomas W P Friedl; Amelie deGregorio; Sabrina Krause; Jens Huober; Brigitte Rack; Wolfgang Janni
Journal:  Cells       Date:  2019-11-08       Impact factor: 6.600

9.  A Novel 3D Scaffold for Cell Growth to Asses Electroporation Efficacy.

Authors:  Monica Dettin; Elisabetta Sieni; Annj Zamuner; Ramona Marino; Paolo Sgarbossa; Maria Lucibello; Anna Lisa Tosi; Flavio Keller; Luca Giovanni Campana; Emanuela Signori
Journal:  Cells       Date:  2019-11-19       Impact factor: 6.600

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.