Literature DB >> 36173994

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools.

Abstract

Identifying differentially expressed genes is difficult because of the small number of available samples compared with the large number of genes. Conventional gene selection methods employing statistical tests have the critical problem of heavy dependence of P-values on sample size. Although the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods, the reason why they worked so well is unclear. In this study, we aim to understand this reason in the context of projection pursuit (PP) that was proposed a long time ago to solve the problem of dimensions; we can relate the space spanned by singular value vectors with that spanned by the optimal cluster centroids obtained from K-means. Thus, the success of PCA- and TD-based unsupervised FE can be understood by this equivalence. In addition to this, empirical threshold adjusted P-values of 0.01 assuming the null hypothesis that singular value vectors attributed to genes obey the Gaussian distribution empirically corresponds to threshold-adjusted P-values of 0.1 when the null distribution is generated by gene order shuffling. For this purpose, we newly applied PP to the three data sets to which PCA and TD based unsupervised FE were previously applied; these data sets treated two topics, biomarker identification for kidney cancers (the first two) and the drug discovery for COVID-19 (the thrid one). Then we found the coincidence between PP and PCA or TD based unsupervised FE is pretty well. Shuffling procedures described above are also successfully applied to these three data sets. These findings thus rationalize the success of PCA- and TD-based unsupervised FE for the first time.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36173994 PMCID： PMC9521941 DOI： 10.1371/journal.pone.0275472

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Introduction

In genomic sciences, selecting a limited number of differentially expressed genes (DEGs) among as many as several tens of thousands of genes is a critical problem. Unfortunately, this is a very difficult task as the number of genes, N, is usually much larger than the number of available samples, M. However, as this is not a mathematically solved problem, it has most frequently been tackled empirically using statistical test-based feature selection strategies [1, 2]. Despite huge efforts along this direction, these statistical test-based feature selection strategies cannot be said to work well. Selection of biologically informative genes including DEGs is essentially performed as follows (For simplicity, and M samples are composed of multiple classes having an equal number of samples). Suppose that we have properties attributed to M samples. We would like to relate a matrix form of some omics data, e.g., gene expression profiles, to . The overall purpose is to derive whose absolute values represent the importance of the ith gene. The first and the most popular strategy outside genomic sciences is a regression strategy that requires minimization of resulting in The regression approach, Eq (2), is less popular in genomic sciences than in other scientific fields, possibly because of N ≫ M, which always results in exactly ( − X)2 = 0 with an infinitely large number of . Thus, it is useless to select a limited number of important features among the total N features. Although adding the regulation term of L2 norm to Eq (1) as with the positive constant λ > 0 enables selection of a unique by minimizing Eq (3) as because it does not satisfy ( − X)2 = 0 anymore, it is not an ideal solution. Although the solution using the Moore-Penrose Pseudoinverse [3] might be better as it satisfies ( − X)2 = 0 under the condition of , it is unclear whether is a good constraint from the biological viewpoint. Adding the regulation term of L1 norm [4] to Eq (1) can yield at most M variables, which is not effective when N ≫ M, because variables larger than M might be biologically informative and should not be neglected. Moreover, addition of L1 norm is known to be a poor strategy when X is not composed of independent vectors, which are very common in genomic science. The second strategy is a projection strategy that is equivalent to the maximization of and is employed in PCA- and TD-based unsupervised FE (see below). Through the concept of projection pursuit [5] (PP), it is understood that seeking the projection vector maximizes interestingness which is Eq (8) in this study. As H(X) is a function of , it is also denoted as I(), which is called projection index. I() can be any other function, but its selection should be decided such that the biologically most meaningful results are obtained. Upon obtaining that maximizes I(), we can select i having a larger absolute b as mentioned above. In the framework of PP, in a high dimensional system, almost all have finite projections [6]. Thus, the only the point is if it is accidental or biologically meaningful. In genomic science, projection strategy, Eq (7), is also unpopular. Although the reason for the unpopularity of the projection strategy, Eq (7), is unclear, this may be explained by the ignorance of the contribution perpendicular to , , where is a unit vector parallel to and is defined as /||. Nevertheless, in contrast to the regression strategy requiring the computation of (XX)−1, Eq (7) can be always computable even if N ≫ M, which is a great advantage of the projection strategy when compared with the regression strategy. Instead of these two strategies, feature selection based on statistical tests [1, 2] is more popular in genomic sciences as mentioned above. They try to identify genes whose expression is significantly distinct between classes. Despite its popularity, feature selection based on statistical tests has critical problems; in particular, significance is heavily dependent on sample size, M. Even in the case of a small distinction, more significant results are obtained when more samples are considered; this is not applicable biologically because determination of whether gene expression between classes differs significantly should not be a function of sample size. To compensate this heavy sample dependence of significance, other criteria such as fold change between classes are often employed. Thus, feature selection based on statistical tests is at best, the best among the worst approaches. If better strategies can be employed, there will be no reason to employ strategies based on statistical tests. Despite the unpopularity of projection strategy, it was sometimes evaluated as more effective [7, 8] than the standard feature selection strategy based on statistical tests. Thus, it can be a candidate strategy that can be replaced with feature selection based on statistical tests. In this paper, we try to understand why PCA-based unsupervised FE and TD-based unsupervised FE [3] are effective in feature selection based on projection strategy, since PCA-like as well as TD-like methods were successfully applied in other fields, too [9-11]. We consider the cases biomarker identification of kidney cancer [12] as well as SARS-CoV-2 infection problem [13]; in these studies, despite unsuccessful results obtained by conventional feature selection based on statistical tests, TD-based unsupervised FE identified biologically reasonable genes (for more details about how PCA- and TD-based unsupervised FE are superior to statistical test-based feature selection tools in these specific examples, see these previous studies [12, 13]).

Materials and methods

Sample R cods is available in https://github.com/tagtag/peoj.

Expression profiles

mRNA, miRNA, and gene expression profiles in the first, second, and third data sets can be downloaded from TCGA as well as GEO. Their availability is described in detail in previous studies [12, 13].

Excluding low expressed miRNAs, mRNAs, and genes

To draw Figs 1(B), 2(B) and 3(B), low expressed miRNAs, mRNAs, and genes were screened out. For this, we rank them using ∑ |x|, ∑ |x|, ∑ |x| and only selected the top ranked ones.

Fig 1

Histogram of raw P-values computed using the null distribution generated by shuffling when miRNAs in the first data set were considered.

(A) All miRNAs (B) Top 500 most expressive miRNAs.

Fig 2

Histogram of raw P-values computed using the null distribution generated by shuffling when the mRNAs in the first data set were considered.

(A) All mRNAs (B) Top 3000 most expressive mRNAs.

Fig 3

Histogram of raw P-values computed using the null distribution generated by shuffling when genes in the third data set were considered.

(A) All genes (B) Top 2780 most expressive genes.

Histogram of raw P-values computed using the null distribution generated by shuffling when miRNAs in the first data set were considered.

(A) All miRNAs (B) Top 500 most expressive miRNAs.

Histogram of raw P-values computed using the null distribution generated by shuffling when the mRNAs in the first data set were considered.

(A) All mRNAs (B) Top 3000 most expressive mRNAs.

Histogram of raw P-values computed using the null distribution generated by shuffling when genes in the third data set were considered.

(A) All genes (B) Top 2780 most expressive genes.

QQplot

QQplot [14] was used to visualize the coincidence between two distributions that do not always have same number of elements. The qqplot function implemented in R [15] was employed to draw QQplots (Figs 4 and 5) in this study.

Fig 4

QQplot between P-values computed by TD-based unsupervised FE and projection (A) mRNA in the first data set (B) miRNA in the first data set (C) mRNA in the second data set (D) miRNA in the second data set.

Fig 5

QQplot of P-values between TD-based unsupervised FE and PP (the third data set).

Null distribution

The null distributions used for computing P-values in Figs 1–3 and 6 were generated by gene order shuffling as follows. First, the order of i was shuffled within each x or within each x and that of k was shuffled within each x. Thus, the order of mRNAs, miRNAs, and genes was shuffled such that they differed between samples. Then SVD or TD was applied to x or x and u2 and u2 from SVD and u5 from TD were generated one hundred times. The null distributions were composed of the generated singular value vectors and P-values were computed.

Fig 6

Histogram of raw P-values computed using the null distribution generated by shuffling when the second data set were considered.

(A) All miRNAs (B) All mRNAs.

Histogram of raw P-values computed using the null distribution generated by shuffling when the second data set were considered.

(A) All miRNAs (B) All mRNAs.

Results

Fig 7 shows the work flow of this study. In PP, the projection direction is predefined by in a supervised manner while if we do not want to set projection directions in advance we can use those determined by PCA or TD, which we call unsupervised FE. There are some advantages of PCA and TD, which are not shared with PP. For example, projection directions not related to the label may have additional information. In that case, PCA and TD can capture what PP cannot. PCA and TD can be applicable even if pre-defined is not provided. Thus, PCA and TD have more potential to be applied to wide range of data sets that PP.

Fig 7

Discussion of work flow used in this study.

Tensor decomposition (HOSVD) was applied to tenors and using obtained singular value vectors assumed to obey Gaussian distribution, P-values are attributed to genes. The genes associated with adjusted P-values less than 0.01 are selected. P-values are also computed by shuffling and the genes associated with adjusted P-values less than 0.1 are well coincident with the genes selected by HOSVD. The correspondence between singular value vectors and K-means applied to unfolded matrices is also discussed.

Discussion of work flow used in this study.

PCA-based unsupervised FE

Before starting to rationalize PCA- and TD-based unsupervised FE, we briefly summarize how they work. The purpose of PCA- and TD-based unsupervised FE is to select biologically sound features (typically genes) based on the given omics data such as gene expression profiles, in an unsupervised manner. In this subsection, we introduce PCA-based unsupervised FE; TD-based unsupervised FE is an advanced version of PCA-based unsupervised FE and will be introduced in the next subsection. Suppose that we have gene expression data in a matrix form, for N genes measured across M samples. First, we need to standardize X as ∑x = 0 and as we will attribute principal component (PC) scores to genes whereas PC loading will be attributed to samples. The ℓth PC score attributed to the ith gene, u, can be obtained as the ith component of the ℓth eigenvector, , of a gram matrix , where X is a transpose matrix of X, as where λ is the ℓth eigenvalue. Further, the ℓth PC score attributed to the jth sample, v, can be obtained as the jth component of the vector defined as Notably, is also an eigenvector of the covariance matrix, because PCA-based unsupervised FE works as follows. First, we need to identify the of interest. The of interest depends on the problem. It might be the one coincident with the samples cluster, or the one with monotonic dependence on some external parameter such as time. After identifying the of interest, we try to attribute P-values to genes assuming that the components of the corresponding follow a normal distribution where P[> x] is the cumulative χ2 distribution that the argument is larger than x and σ is the standard deviation. Computed P-values are adjusted based on the BH criterion [3] and features associated with adjusted P-values less than a specified threshold value can be selected. The reason for the proper working of such a simple procedure is explained later. Finally, we would like to emphasize the equivalence between singular value decomposition (SVD) and PCA. Suppose we have the SVD of X as It is straight forward to show where = (u, u, ⋯, u) and = (v, v, ⋯, v). Thus, SVD and PCA are mathematically equivalent problems.

TD-based unsupervised FE

TD-based unsupervised FE works quite similar to PCA-based unsupervised FE. Instead of PCA, we apply TD to , that is, for example, the expression of the ith gene measured in the kth tissue of the mth person (even though we consider a three-mode tensor here, extension to the higher mode tensor is straightforward). To obtain TD, we specify the higher-order singular decomposition [3] (HOSVD) as where is a core tensor, and , , are singular value matrices. After identifying the and of interest, for instance, the distinction between healthy controls and patients as well as tissue specific expression, we seek ℓ3 associated with G(ℓ1ℓ2ℓ3) having the largest absolute value given as ℓ1, ℓ2. Then using the identified ℓ3, we attribute P-values to the ith feature as in the case of PCA-based unsupervised FE, where is the standard deviation. Computed P-values are adjusted based on the BH criterion and features associated with adjusted P-values less than a specified threshold value can be selected. The reason for the proper working of such a simple procedure is explained later.

Rationalization of PCA- and TD-based unsupervised FE

To explain why PCA- and TD-based unsupervised FE work rather well, we consider two recent works [12, 13], in which the superiority of PCA- and/or TD-based unsupervised FE over conventional statistical methods was shown; in these studies, conventional statistical test-based methods failed to select a reasonable number of genes whereas TD-based unsupervised FE successfully selected a biologically reasonable restricted number of genes. In the first study [12], two independent sets of data including the mRNA and miRNA expression of kidney cancer and normal kidney were analyzed in an integrated manner using PCA as well as TD-based unsupervised FE.

The first data set

The first data set comprised M = 324 samples including 253 kidney tumors and 71 normal kidney tissues. The expression of N mRNAs and K miRNAs was formatted as matrices as and , respectively. The three mode-tensor was generated as As the data were too large to be loaded into the memory available in a standard stand-alone server, it was impossible to obtain TD Instead, we generated and SVD was applied to x as to obtain and approximately. Missing singular value vectors attributed to mRNA and miRNA samples were approximately recovered using the equations respectively. Although we do not intend to insist that these approximations are precise enough, we decided to employ them as since they turned out to work well empirically. After investigating the obtained and , we realized that ℓ1 = ℓ3 = 2 are coincident with the distinction between tumors and normal tissues; therefore, we attributed P-values to mRNA and miRNA using u2 and u2, respectively with the equations These P-values were corrected by the BH criterion and we selected 72 mRNAs and 11 miRNAs associated with adjusted P-values less than 0.01, respectively.

The second data set

The second data set comprised M = 34 samples including 17 kidney tumors and 17 normal kidney tissues. The same procedures applied to the first data set were also applied to the second data set and we selected 209 mRNAs and 3 miRNAs associated with adjusted P-values less than 0.01, respectively. Although various biological evaluations were performed for mRNAs and miRNAs selected using the first data set, the most remarkable achievement was that all three miRNAs selected using the second data set were included in the 11 miRNAs selected using the first data set, and there were as many as 11 common mRNAs selected between the first and second data sets. If we consider that there are as many as several hundred miRNAs and a few tens of thousand mRNAs available, these overlaps are a great achievement as these two data sets are completely independent of each other.

Comparisons with PP

To understand why such simple procedures can work well in the framework of PP, we replaced the singular value vectors attributed to samples with projections. For this, we applied PP as mentioned above. where M, M are the numbers of normal tissues and cancer samples, respectively, and M + M = M. Then we applied PP as Since bs and bs are expected to play the roles of u2 and u2 in Eqs (25) and (26), respectively, we used the absolute values of b and b to select mRNAs and miRNAs that are presumably coincident with the distinction between tumors and normal tissues. P-values are attributed to mRNA and miRNA as These P-values are corrected by the BH criterion and we selected 78 mRNAs and 13 miRNAs for the first data set and 194 mRNAs and 3 miRNAs for the second data set, associated with adjusted P-values less than 0.01, respectively. We try to estimate the coincidence of genes between TD and PP; Tables 1–4 list the comparisons of genes between TD-based unsupervised FE and PP, Eqs (30) or (31) and demonstrate a high coincidence with each other. Fig 4 show the comparisons of P and P between TD-based unsupervised FE and PP, Eqs (30) or (31). It is obvious that smaller P-values used for gene selection as well as the overall distributions of P-values are coincident between TD-based unsupervised FE and PP, Eqs (30) or (31).

Table 1

Confusion matrix of selected mRNAs between TD-based unsupervised FE and PP in the first data set.

P-value computed by Fisher’s exact test is 1.90 × 10−149.

		PP
		adjusted P_i > 0.01	adjusted P_i < 0.01
TD based unsupervised FE	adjusted P_i > 0.01	19447	17
	adjusted P_i < 0.01	11	61

Table 4

Confusion matrix of selected miRNAs between TD based unsupervised FE and PP in the second data set.

P-value computed by Fisher’s exact test is 1.87 × 10−7.

		PP
		adjusted P_k > 0.01	adjusted P_k < 0.01
TD based unsupervised FE	adjusted P_k > 0.01	316	0
	adjusted P_k < 0.01	0	3

Confusion matrix of selected mRNAs between TD-based unsupervised FE and PP in the first data set.

P-value computed by Fisher’s exact test is 1.90 × 10−149.

Confusion matrix of selected miRNAs between TD-based unsupervised FE and PP in the first data set.

P-value computed by Fisher’s exact test is 2.76 × 10−23.

Confusion matrix of selected mRNAs between TD-based unsupervised FE and PP in the second data set.

P-value computed by Fisher’s exact test is 0.0 within numerical accuracy (i.e., smaller than the possible smallest number given numerical accuracy).

Confusion matrix of selected miRNAs between TD based unsupervised FE and PP in the second data set.

P-value computed by Fisher’s exact test is 1.87 × 10−7.

Equivalence between K-means and PCA

To understand these excellent and unexpected coincidences between TD-based unsupervised FE and PP, we first considered the relationship between PCA and PP and later related it with TD. PCA was known to be equivalent to K-means [3]; the space spanned by centroids of optimal sample clusters can be reproduced by the PC score attributed to the features. Suppose that we have which is the value of the ith feature of the jth sample. M samples are supposed to be clustered into S clusters. The centroid of sth cluster, is defined as where , C is a set of js that belong to the sth cluster, n is the size of the sth cluster. Here we define the projection of any vector onto the centroid subspace as where where ⊗ is the Kronecker product. S is also known to be represented as where is which take non-zero values only when the jth sample belongs to the sth cluster. K-means is an algorithm to find clusters that minimize Minimization of J is known to be equivalent to the maximization of TrS, which means the trace of matrix S. It is known that where is the vector whose components are ℓth PC scores attributed to the features and eigenvector of the gram matrix as If we compare Eq (35) with Eq (38), we can notice that corresponds to , and PCA can give us an optimal centroid subspace, S, even without realizing the clusters by K-means, i.e., in a fully unsupervised manner. At first, when the clusters are the solution of K-means, the centroid subspace can be represented by the PC score which can also be expressed by X . is clearly coincident with y defined in Eq (27). This means that PP employing as should result in projection onto the centroid subspace when y is coincident with the clusters. Here we define y, Eq (27), such that it can represent distinction between tumors and normal tissues, which should be detected by K-means. This explains why TD-based unsupervised FE works well and why PP can be replaced with TD-based unsupervised FE. To our knowledge, this is the first rationalization on why TD- and PCA-based unsupervised FE work well. One might wonder whether the above explanation is applicable to PCA while TD was applied to the first and second data sets. This gap can be explained as follows. Tensor x, was generated as the product of x and x. Suppose these two are decomposed as If then This means that if , the SVD of x gives u and u that are obtained when SVD is applied to x and x. Here is highly correlated with [12]. This is coincident with the requirement . As SVD is equivalent to PCA, this might explain why TD-based unsupervised FE works well even though the above rationalization is applied only to PCA.

The third data set

Next, we would like to extend the above discussion to TD. Therefore, we consider a third data set analyzed in another study [13] where we performed in silico drug discovery for SARS-CoV-2 by applying TD-based unsupervised FE to the gene expression profiles of human cell lines infected with SARS-CoV-2. The third data set comprises five cell lines infected with either mock (control) or SARS-Cov-2, including three biological replicates. It is formatted as tensor, , that represents the expression of the ith gene of the jth cell line from the infected (k = 1) or control (k = 2) group in the mth biological replicate. HOSVD was applied to x and we got In this study, we selected ℓ1 = 1, ℓ2 = 2, ℓ3 = 1 based on biological discussions. We then realized that G(5, 1, 2, 1) has the largest absolute value given ℓ1 = 1, ℓ2 = 2, ℓ3 = 1. Thus, u5 was used to attribute P-values to gene i using and the obtained P-values were corrected using the by BH criterion; further, 163 genes associated with adjusted P-values less than 0.01 were selected. We now relate TD to the above discussion about PCA. Because of the HOSVD algorithm, can also be obtained by applying SVD to the unfolded matrix, . Here 30 columns correspond to one of 30 combinations of j, k, m. Here we select ℓ1 = 1, ℓ2 = 2, ℓ3 = 1 so that the gene expression is independent of the cell lines and biological replicates and has opposite signs between the control and infected cells. Thus, two clusters are expected, each of which corresponds to either the control or infected cell lines. The reason why ℓ4 = 5 is selected is simply because u5 is composed of the centroid subspace coincident with two clusters. Thus, in this sense, the above discussion about PCA can be directly applied to this result. To confirm this, y was taken to be such that it represented the distinction between k = 1 and k = 2 (i.e. that between infected and control cell lines), where and . Then PP was performed as P-values were attributed to genes as and 155 genes associated with corrected P-values less than 0.01 were selected, where b is expected to play a role of u5 in Eq (46). Table 5 lists high coincidence of selected genes between TD-based unsupervised FE and PP. Fig 5 shows the overall coincidence of distributions of P-values between TD-based unsupervised FE and PP. Thus, why TD based unsupervised FE can work well is explained by the ability of singular value vectors to generate a centroid subspace of clusters coincident with control and infected cell lines.

Table 5

Confusion matrix of selected genes between TD-based unsupervised FE and PP in the third data set.

P-value computed by Fisher’s exact test is 1.40 × 10−241.

		PP
		adjusted P_i > 0.01	adjusted P_i < 0.01
TD based unsupervised FE	adjusted P_i > 0.01	21582	52
	adjusted P_i < 0.01	60	103

Confusion matrix of selected genes between TD-based unsupervised FE and PP in the third data set.

P-value computed by Fisher’s exact test is 1.40 × 10−241. One might wonder why TD is needed if can be computed by applying SVD to the unfolded matrix. To understand this, we compared v5( obtained by applying SVD to an unfolded matrix, and corresponding to u5 as well as u1u2u1 with y. While u1u2u1 is well coincident with y, v5( is not (Fig 8). Thus, we need to apply TD to x to obtain singular value vectors attributed to samples, which are coincident with two clusters but cannot be obtained when SVD is applied to an unfolded matrix.

Fig 8

Comparisons between y and either v5( or u1u2u1.

Red straight lines indicate linear regressions.

Comparisons between y and either v5( or u1u2u1.

Red straight lines indicate linear regressions.

Rationalization of threshold P-values

As we have successfully shown that TD as well as PCA are equivalent to PP that aims to maximize projection onto the subspace centroid of clusters coincident with the desired distinction (cancer vs. normal tissue or control vs. infected cell lines), we would next like to rationalize the P-values computed by the χ2 distribution and threshold values of P = 0.01, which have long been employed to select DEGs with PCA- and TD-based unsupervised FE. Because distribution of projection in the infinite sample number limits is proven to be always Gaussian [6], this null hypothesis might seem reasonable. Nonetheless, the individual distribution of gene expression is far from Gaussian and is rather close to negative signed binomial distribution and when the number of samples is not large enough, the distribution of projection does not converge with a Gaussian distribution at all. Thus, a more straightforward rationalization is needed. Therefore, we generated a null distribution by shuffling i in each sample and recomputed the singular value vectors, (for mRNA in the first and the second data sets), (for miRNA in the first and the second data sets), and (for genes in the third data set). Then P-values were recomputed using the generated null distribution and were corrected using the BH criterion to obtain genes associated with significant adjusted P-values. In the following, we apply the shuffling to three data sets, the first, the second, and the third data set, and select genes using P-values obtained by shuffling. Coincidence of selected genes and distribution of P-values between PCA or TD and shuffling is estimated. These evaluations enable us to discuss the suitability of threshold P-values. Fig 1(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when the miRNAs in the first data set were considered. As it is obvious that there are too many P-values near 1, we excluded some miRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Fig 1(B) shows the histogram of raw P-values computed to be restricted to the top 500 more expressive miRNAs; this seems more coincident with the null distribution. We then found that twelve miRNAs are associated with adjusted P-values less than 0.1. Table 6 lists the comparison of selected miRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although the threshold P-values differ between the two, the selected miRNAs are quite coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 6

Confusion matrix of selected miRNAs between TD-based unsupervised FE and shuffling in the first data set.

P-value computed by Fisher’s exact test is 1.28 × 10−21.

		shuffling
		adjusted P_k > 0.1	adjusted P_k < 0.1
TD based unsupervised FE	adjusted P_k > 0.01	488	1
	adjusted P_k < 0.01	0	11

Confusion matrix of selected miRNAs between TD-based unsupervised FE and shuffling in the first data set.

P-value computed by Fisher’s exact test is 1.28 × 10−21. Fig 2(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when mRNAs in the first data set were considered. As it is obvious that there are too many P-values near 1, we excluded some mRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Fig 2(B) shows the histogram of raw P-values computed to be restricted to the top 3000 more expressive mRNAs; this seems more coincident with the null distribution. We then found that 69 mRNAs are associated with adjusted P-values less than 0.1. Table 7 lists the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between the two, the selected mRNAs are quite coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 7

Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the first data set.

P-value computed by Fisher’s exact test is 2.69 × 10−137.

		shuffling
		adjusted P_i > 0.1	adjusted P_i < 0.1
TD based unsupervised FE	adjusted P_i > 0.01	2928	0
	adjusted P_i < 0.01	3	69

Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the first data set.

P-value computed by Fisher’s exact test is 2.69 × 10−137. Fig 6(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when miRNAs in the second data set were considered. As it is unlikely to get significant P-values, we did not select miRNAs associated with significant P-values. Fig 6(B) shows the histogram of raw P-values computed for mRNAs in the second data set; there are no peaks around P = 1. We then found that 262 mRNAs are associated with adjusted P-values less than 0.1. Table 8 lists the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between the two, the selected mRNAs are well coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 8

Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the second data set.

P-value computed by Fisher’s exact test is 0.0 within numerical accuracy (i.e., smaller than the possible smallest number given numerical accuracy).

		shuffling
		adjusted P_i > 0.1	adjusted P_i < 0.1
TD based unsupervised FE	adjusted P_i > 0.01	33736	53
	adjusted P_i < 0.01	0	209

Confusion matrix of selected mRNAs between TD-based unsupervised FE and shuffling in the second data set.

P-value computed by Fisher’s exact test is 0.0 within numerical accuracy (i.e., smaller than the possible smallest number given numerical accuracy). Fig 3(A) shows the histogram of raw P-values computed using the null distribution generated by shuffling one hundred times when considering the genes in the third data set. As there were too many P-values less than 0.2, we excluded some mRNAs with low values to obtain a P-value distribution more coincident with the null distribution. Fig 3(B) shows the histogram of raw P-values computed to be restricted to the top 2780 more expressive mRNAs; this seems more coincident with the null distribution. We then found that 48 mRNAs are associated with adjusted P-values less than 0.1. Table 9 lists the comparison of selected mRNAs between TD-based unsupervised FE and the null distribution generated by shuffling. Although threshold P-values differ between two, selected mRNAs are well coincident. A threshold P-value 0.01 was empirically employed for PCA- and TD-based unsupervised FE as it often gave us biologically reasonable results. P = 0.01 in a Gaussian distribution is assumed as the null hypothesis corresponding to P = 0.1 when the null distribution is generated by shuffling. Although this discrepancy must be fulfilled in the future, we conclude that their performances are quite similar.

Table 9

Confusion matrix of selected genes between TD-based unsupervised FE and shuffling in the third data set.

P-value computed by Fisher’s exact test is 5.00 × 10−63.

		shuffling
		adjusted P_i > 0.1	adjusted P_i < 0.1
TD based unsupervised FE	adjusted P_i > 0.01	2617	0
	adjusted P_i < 0.01	115	48

Confusion matrix of selected genes between TD-based unsupervised FE and shuffling in the third data set.

P-value computed by Fisher’s exact test is 5.00 × 10−63.

Discussion

In the previous section, we explained why PCA- and TD-based unsupervised FE work well (because singular value vectors correspond to projection onto the centroid subspace obtained by K-means) and how the criterion to select genes associated with adjusted P-values less than 0.01, which was computed assuming the null hypothesis that singular value vectors obey Gaussian distribution, is empirically coincident with another criterion to select the genes associated with adjusted P-values less than 0.1, which are computed assuming the null distribution generated by shuffling. There are many points to be discussed. In the above example, we only dealt with the case wherein only two clusters could be distinguished in a one-dimensional space (i.e., only one singular value vector). Considering cases with more clusters might be challenging, projections onto subspace centroids do not have a one-to-one correspondence with singular value vectors as the coincidence between the projection to the subspace centroid and singular value vectors stands only between the spaces spanned by them, and not between themselves. Despite this, TD- and PCA-based unsupervised FE applied to more than two classes is known to work rather as well as in the case with only two clusters [16]. On the contrary, although we could only discuss cases with a finite number of clusters, PCA- and TD-based unsupervised FE are also known to work in detecting parameter dependence, e.g., time development [17, 18]. Extending the discussion here to regression analysis without any clusters will be the next step. One might also wonder whether we need TD if singular value vectors attributed to genes are common between TD and PCA. At first, in the integrated analysis of mRNA and miRNA, TD-based unsupervised FE could outperform PCA-based unsupervised FE [12]. Similarly, TD-based unsupervised FE outperformed PCA-based unsupervised FE in the integrated analysis of gene expression and DNA methylation [19]. Thus, TD-based unsupervised FE is required when integrated analysis is targeted. Even when no integrated analysis was targeted, TD based unsupervised FE can give singular value vectors that are more coincident with biological clusters (Fig 8). Thus, despite the apparent equality of singular value vectors attributed to genes between TD and PCA, TD-based unsupervised FE is a more useful strategy than PCA-based unsupervised FE. Although we did not clearly denote this, conventional gene selection strategies based on statistical tests are known to fail when applied to the first, second, and third data sets [12, 13]; they always selected too many or too few genes, mRNAs, and miRNA, which is in contrast to TD-based unsupervised FE that could always select a restricted number of genes, from tens to hundreds. One might also wonder why we did not employ the null distribution generated by shuffling instead of the un-justified Gaussian distribution, with PCA- and TD-based unsupervised FE. As can be seen above, employment of null distribution generated by shuffling is not straightforward; in some cases, e.g, the first and the third data sets mentioned above, we needed to exclude low expressed genes manually whereas this was not required for the second data set. No miRNAs that were significantly expressed distinctly between controls and cancers in the second data sets were detected with the null distribution generated by shuffling. In addition, the number of low expressed genes to be removed cannot be decided uniquely. On the contrary, the criterion that genes associated with adjusted P-values less than 0.01 assuming the null hypothesis that singular value vectors obey a Gaussian distribution is more robust. This often can give a restricted number of genes without excluding low expressed genes. Although why this works so well must be explored in the future, it is an empirically more useful strategy than the null distributions generated by shuffling. One may also wonder why we did not employ the centroid subspace, S, instead of singular value vectors if these two are equivalent for optimal clusters and the meaning of centroid subspace is easier to understand compared to singular value vectors. At first, we needed to apply K-means which often fail in unbalanced data sets composed of clusters with a very distinct number of samples. Next, K-means always identifies the primary cluster. Nevertheless, in the case of SARS-CoV-2 (the third data set), distinction between infected cell lines and control cell lines was detected using the fifth singular value vectors whose contribution will probably be neglected by K-means because of its too small contribution. In addition, singular value vectors can be computed in a fully unsupervised manner that does not require any labeling. Considering these advantages, it is reasonable to use singular value vectors instead of a centroid subspace despite its apparent usefulness. Further, as the y used to compute projection is decided manually, even if some biological features that y assumes, such as clusters, do not exist, can be computed. This might result in wrong conclusions. However, if there are no clusters at all, because no corresponding singular value vectors attributed to samples and coincident with y are obtained, we can have an opportunity to realize any misunderstanding. Thus, usage of singular value vectors but not projection might be advantageous. One might also wonder why other more frequently used TD such as CP decomposition [3] have not been employed instead of HOSVD. This might be understood as follows. In the above description, we could relate the singular value vectors obtained by HOSVD to the centroid subspace, because singular value vectors attributed to genes are common between HOSVD and PCA. This equivalence will be broken if HOSVD is replaced with other TDs. When we invented TD-based unsupervised FE, though we also tested other TDs [3], HOSVD always outperformed other TDs when used for feature selections. The equivalence of HOSVD and PCA might explain why HOSVD could outperform other popular TDs as a feature selection tool. Another possible concern is that only one hundred times shuffling was performed for the computation in Figs 1 to 3 whereas we considered P-values equal to 0.01; nevertheless, it is not problematic at all because of the following two reasons. First of all, the P-values we considered were not raw P-values but corrected P-values. Thus total number of probabilities computed are much larger than one hundred. Since the numbers of computed P-values are as many as those of mRNAs and miRNAs, they are as many as 103 or 104. Thus, the number of shuffling, one hundred, is not directly related to P-values of 0.01 at all. Second, individual P-values are not related to the number of shuffling at all; what we have performed was to generate P-values whose number is equal to that of miRNAs or mRNAs, i.e., 103 or 104. Thus, individual P-values can take much smaller values than 0.01, say 10−3 and 10−4 for miRNAs and mRNAs, respectively. Increasing or decreasing the number of shuffling does not affect the absolute values of P-values at all. The number of shuffling is only related to the reproducibility; if we can compute P-values based upon only one shuffling, it might heavily fluctuate. On the other hand, if we take average of P-values over one hundred shuffling, their outcome is expected to be more stable. The purpose of taking average over one hundred shuffling is simply because of stability of outcome. Apparent relationship between P = 0.01 and one hundred times shuffling does not make any sense. In conclusion, even if we take P = 0.01 as a threshold for one hundred times shuffling, it is not a problem at all. Based upon the studies presented in the above, we emphasize that the usages of PCA or TD based unsupervised FE are recommended, since generally we do not know to which direction we project the data sets. PCA and TD turned out to have ability to give the directions of projections in an unsupervised manner. When projections directions are trivial, e.g., distinction between two classes, PCA and TD can correctly give us the directions. Even if the data sets are more complicated, we can employ higher mode tensors to tackle more complicated data sets. PCA and TD based unsupervised methods will be promising methods. 23 Aug 2022

PONE-D-22-20332

Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools

PLOS ONE Dear Dr. Taguchi, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Please submit your revised manuscript by Oct 07 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Chi-Hua Chen, Ph.D. Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. Please update your submission to use the PLOS LaTeX template. The template and more information on our requirements for LaTeX submissions can be found at http://journals.plos.org/plosone/s/latex. 3. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 4. Thank you for stating the following in the Acknowledgments Section of your manuscript: "This work was supported by KAKENHI [grant numbers 19H05270, 20H04848, and 20K12067] to YT and Institutional Fund Project (IFPIP) from the Ministry of Education and King Abdulaziz University (DSR), Jeddah, Saudi Arabia [grant number IFPIP: 924-611-1442] to TT." We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: "The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript." Please include your amended statements within your cover letter; we will change the online submission form on your behalf. 5. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. 6. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The paper analyzes the reason why the recently proposed principal component analysis (PCA) and tensor decomposition (TD)-based unsupervised feature extraction (FE) has often outperformed these statistical test-based methods in the context of projection pursuit that was proposed a long time ago. Some findings in this paper rationalize the success of PCA- and TD-based unsupervised FE for the first time. I have the following suggestions for this manuscript. Other comments can see the attached file. Reviewer #2: General Comments: Is the paper new, technically correct, and relevant? Yes, the paper is new and technically sounds. Results somehow does support the methodology, but needed to be more cleared by the author in case of properties of the data. Is the paper well organized? The paper is properly organized, good literature review, suitable motivation and clear explanation on results are positive points to that. Is the abstract concise? Yes, but I think it needs to be rephrased after revision to add some comments about any artifacts or negative points in the method, if exist. Is the introduction motivating? Yes, Introduction section is motivating. Are the methodology, results, and conclusions completely developed? No, they need to be modified and developed according to the technical comments. Are there language, mathematics, reference, or style errors? There is no mathematical, reference or style error. Technical Comments: Are the codes available for this research? As I found, there is no code available for this study, e. g. in Github. If the authors could make the codes available, the manuscript could be much better evaluated, not only for reviewers, but also for possible readers. When it is not possible to upload the code for public access, such as in Github, could they be provided for reviewer for better assessment of the study? The study is comprehensive and requires large time to be read carefully and being reviewed. The theoretical background has been well explained in details, and the experiments and related models are presented and the algorithm in Fig. 1 is also well presented. I think more explanation about the steps and the parameters in Fig. 1 is required. The result comparison parts are well organized and presented. The display way is good. But quantitative evaluation is somehow too much that one can get lost in that. I think it would be better that you add more explanation to that. How did you evaluate the final result? How did you consider to finally selection a methodology for the most complicate problem? What about when the models are more complex? The introduction section is a nice one. It is architected very beautifully, while written fully academic and comprehend. I assume that any change in the introduction section is not necessary, but one of the important tasks after publishing a study is to increase its chance to be seen by the most possible number of researchers, so I would like to give two recommendations. First, to get your published study in the list of searched for papers based on keywords, I propose to increase variety of your keywords. In my viewpoint, they do not cover the whole topic of the study and are not widely searched words. I propose to add at least the keyword “data analysis”. Second, one of the methods in the publisher’s website that brings a publication on to the researchers, is based on the similar publications that they have read before. So, the more you cite similar publication, the more the chance that the search engine in the publisher website propose your paper to the researcher. Besides of that, it will also complete your introduction section. As another advantage, it rises new ideas to the researchers by combining various methods, or resolving drawback of one seen paper by reading the similar one, or extending the methodology to a fully automatic one. So, based on these points, I would like to ask to cite to the following similar publication in the manuscript which used PCA and feature selection for deep learning, but in different field of study. The first proposed publication is: Shahbazi, A., Soleimani Monfared, M., Thiruchelvam, V., Ka Fei, T., Babasafari, A.A., (2020). Integration of knowledge-based seismic inversion and sedimentological investigations for heterogeneous reservoir. Journal of Asian Earth Sciences. The second publication for citation is: Khayer, K., Kahoo, A.R., Soleimani Monfared, M., Tokhmechi, B., and Kavousi, K., (2022). Target-Oriented Fusion of Attributes in Data Level for Salt Dome Geobody Delineation in Seismic Data. Natural resource research, and the other publication could be: Khayer, K., Kahoo, A.R., Soleimani Monfared, M., and Kavouosi, K., (2022). Combination of seismic attributes using graph-based methods to identify the salt dome boundary. Journal of Petroleum Science and Engineering. 215, Part A, 110625, The abstract focusses mainly on the general problem and ignores the other items of the abstract such as the methodology, good introduction, results and conclusion. The authors should explain what limitations did they find out about the proposed method. Best Regard ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Submitted filename: PONE-D-22-20332_reviewer.docx Click here for additional data file. Submitted filename: Comments-PONE-D-22-20332.pdf Click here for additional data file. 6 Sep 2022 See attached Submitted filename: Replies_to_reviewers.docx Click here for additional data file. 19 Sep 2022 Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools PONE-D-22-20332R1 Dear Dr. Taguchi, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Chi-Hua Chen, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: This manuscript has enriched the content of the article and enhanced the readability of the article through modification, but there are still some small problems. 1. It is suggested that the paragraphs of the full paper should be aligned at both ends, which may make the article look more beautiful. 2. In line 201 on page 8, u3k does not exist in (26). 3. In line 284 on page 11, a sentence uses two verbs, “P-values were attributed to genes as... 155 genes associated with corrected P-values less than 0.01 were selected, bi is expected to play a role of u5i in eq. (46).” 4. Please check the references carefully. For example, reference [3], [10], [12], [14], [15], [16], and [19] etc. Reviewer #2: Dear Authors; I have read your response and edited manuscript carefully and I was pleased with your answers and the way of developing the research and the manuscript. So, I have no further comment for you. Best Regards ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No ********** 20 Sep 2022 PONE-D-22-20332R1 Projection in genomic analysis: A theoretical basis to rationalize tensor decomposition and principal component analysis as feature selection tools Dear Dr. Taguchi: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Professor Chi-Hua Chen Academic Editor PLOS ONE

Table 2

Confusion matrix of selected miRNAs between TD-based unsupervised FE and PP in the first data set.

P-value computed by Fisher’s exact test is 2.76 × 10−23.

		PP
		adjusted P_k > 0.01	adjusted P_k < 0.01
TD based unsupervised FE	adjusted P_k > 0.01	812	2
	adjusted P_k < 0.01	0	11

Table 3

Confusion matrix of selected mRNAs between TD-based unsupervised FE and PP in the second data set.

P-value computed by Fisher’s exact test is 0.0 within numerical accuracy (i.e., smaller than the possible smallest number given numerical accuracy).

		PP
		adjusted P_i > 0.01	adjusted P_i < 0.01
TD based unsupervised FE	adjusted P_i > 0.01	33781	8
	adjusted P_i < 0.01	23	186

10 in total

7. Principal component analysis based unsupervised feature extraction applied to budding yeast temporally periodic gene expression.

Authors: Y-H Taguchi
Journal: BioData Min Date: 2016-06-29 Impact factor: 2.522

8. Tensor decomposition-based and principal-component-analysis-based unsupervised feature extraction applied to the gene expression and methylation profiles in the brains of social insects with multiple castes.

Authors: Y-H Taguchi
Journal: BMC Bioinformatics Date: 2018-05-08 Impact factor: 3.169

9. Identification of miRNA signatures for kidney renal clear cell carcinoma using the tensor-decomposition method.

Authors: Ka-Lok Ng; Y-H Taguchi
Journal: Sci Rep Date: 2020-09-16 Impact factor: 4.379

10. A new advanced in silico drug discovery method for novel coronavirus (SARS-CoV-2) with tensor decomposition-based unsupervised feature extraction.

Authors: Y-H Taguchi; Turki Turki
Journal: PLoS One Date: 2020-09-11 Impact factor: 3.240

10 in total