Literature DB >> 25002826

Machine learning methods in the computational biology of cancer.

Abstract

The objectives of this Perspective paper are to review some recent advances in sparse feature selection for regression and classification, as well as compressed sensing, and to discuss how these might be used to develop tools to advance personalized cancer therapy. As an illustration of the possibilities, a new algorithm for sparse regression is presented and is applied to predict the time to tumour recurrence in ovarian cancer. A new algorithm for sparse feature selection in classification problems is presented, and its validation in endometrial cancer is briefly discussed. Some open problems are also presented.

Entities: Chemical Disease Gene Species

Keywords: LASSO algorithm; algorithm; cancer biology; compressed sensing; elastic net; machine learning; support vector machines

Year: 2014 PMID： 25002826 PMCID： PMC4032557 DOI： 10.1098/rspa.2014.0081

Source DB: PubMed Journal: Proc Math Phys Eng Sci ISSN： 1364-5021 Impact factor: 2.704

Introduction

The objectives of this Perspective paper are to review some recent advances in sparse feature selection for regression and classification, and to discuss how these might be used in the computational biology of cancer. One of the motivations for writing this paper is to present a broad picture of some recent advances in machine learning to the more mathematically inclined within the cancer biologist community, and to apply some of these techniques to a couple of problems. Full expositions of these applications will be presented elsewhere. In the other direction, it is hoped that the paper will also facilitate the entry of interested researchers from the machine learning community into cancer biology. In order to understand the computational aspects of the problems described here, a basic grasp of molecular biology is sufficient, as can be obtained from standard references, for example Northrop & Connor [1] and Tözeren & Byers [2]. Cancer is the second leading cause of death in the USA [3]. It is estimated that in the USA in 2013 there will be 1 660 290 new cases of cancer in all sites, and 589 350 deaths [4]. In the UK, in 2011 there were 331 487 cases of cancer, and 159 178 deaths; both are the latest figures available [5]. Worldwide, cancer led to about 7.6 million deaths in 2008 [6]. It is interesting to note that, whether in developed countries such as the USA and the UK or worldwide, cancer accounts for roughly 13% of all deaths [6]. One of the major challenges faced by cancer researchers is that no two manifestations of cancer are alike, even when they occur in the same site. One can paraphrase the opening sentence of Leo Tolstoy's Anna Karenina and say that ‘Normal cells are all alike. Every malignant cell is malignant in its own way.’ Thus, cancer would be an ideal target for ‘personalized medicine’, in which therapy is custom-tailored to each patient. Unfortunately, our current level of understanding of the disease does not permit us to develop truly personalized therapies for every individual patient. Therefore, it is necessary to settle for an intermediate approach, which might be described as ‘patient stratification’. In this approach, diverse manifestations of a particular type of cancer are grouped into a small number of classes, wherein the manifestations are broadly similar within each class and substantially different between classes. Then attempts can be made to develop therapeutic regimens that are tailored for each class. Until recently, grouping of cancers has been attempted first through the site of the cancer, and then through histological considerations, that is, the microscopic anatomy of the cells comprising the tumour, and other parameters that can be measured by a physical examination of the tumour. For example, lung cancer is divided into two broad categories, namely small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC), where the prognosis for the latter is decidedly better than for the former. Then, NSCLC is divided into three subtypes known as adenocarcinoma, squamous cell carcinoma and large-cell carcinoma. All of these subtypes are defined on the basis of histology. But this is not the only possible approach. It is also possible to define the subtypes on the basis of the molecular-level properties of the cancer tumour. For instance, there are four major types of breast cancer, known as luminal A, luminal B, non-luminal and basal type. These subtypes are defined based on the expression levels of the genes oestrogen receptor, progesterone receptor and HER2, also known as ERBB2, being either high or low. The basal-like subtype, also known as the triple negative subtype owing to the fact that all three genes are expressed at very low levels, constitutes about 20% of breast cancer cases and has the worst prognosis. For the other three subtypes, there are some proved therapies that work reasonably well; but this is not so for triple negative subtypes. The above subtyping illustrates the type of challenges faced by a mathematically trained person when studying computational biology. For instance, given that there are three genes being studied, and that the expression level of each can be either high or low, a mathematician/engineer might think that there are 23=8 possible subtypes. In reality however, as stated above, there are only four subtypes, and some of the possible combinations do not seem to occur sufficiently frequently.[1] The therapies for the various subtypes are quite different. Therefore, it is important to be able to ascertain to which subtype a patient belongs, before commencing therapy. This is one possible application of machine learning. During the past decade, attempts have been made to collect the experimental data generated by various research laboratories into central repositories such as the Gene Expression Omnibus [8] and the Catalogue of Somatic Mutations in Cancer (COSMIC) [9]. However, the data in these repositories are often collected under widely varying experimental conditions. Moreover, the standard of reporting and documentation is not always uniform. To mitigate this problem, there are now some massive public projects underway for generating vast amounts of data for all the tumours that are available in various tumour banks, using standardized sets of experimental protocols. Among the most ambitious are The Cancer Genome Atlas, usually referred to by the acronym TCGA [10], which is undertaken by the National Cancer Institute, and the International Cancer Genome Consortium, referred to also as ICGC [11], which is a multi-country effort. In the TCGA data, molecular measurements are available for almost all tumours, and clinical annotations are also available for many tumours. With such a wealth of data becoming freely available, researchers in the machine learning community can now aspire to make useful contributions to cancer biology without the need to undertake any experimentation themselves. However, in order to carry out meaningful research, it is essential to have a close collaboration with one or more biologists. The style of exposition in the biological literature is quite different from that in mathematical books and papers, and the author's experience has been that simply conversing with expert biologists is the fastest way to become familiar with the subject.[2] Unlike in mathematics, in biology it is not possible to derive everything from a few fundamental axioms and/or principles; instead, one is confronted with a bewildering variety of terms and names, all of which have to be mastered (memorized?) in parallel. One example, as mentioned above in connection with breast cancer, is that the names ERBB2, HER2 and HER2/Neu all refer to the same gene. Also, while it is not necessary to perform experiments oneself, it is absolutely crucial to understand the nature of the experiments, so that one is aware of the potential sources of error and the level of reliability of specific types of molecular measurements. Owing to space limitations, in this paper only two out of the many possible applications of machine learning to cancer are addressed, namely sparse regression and sparse classification. Other topics such as network inference and modelling tumour growth are mentioned very briefly in passing towards the end of the paper. Now, we briefly state the class of problems under discussion in this paper. This also serves to define the notation used throughout. Let m denote the number of tumour samples that are analysed, and let n denote the number of attributes, referred to as ‘features’, that are measured on each sample. Typically, m is of the order of a few dozen in small studies, ranging up to several hundreds for large studies such as the TCGA studies, while n is of the order of tens of thousands. There are 20 000 or so genes in the human body, and in whole genome studies, and the expression level of each gene is measured by at least one ‘probe’, and sometimes by more than one. The ‘raw’ expression level of a gene corresponds to the amount of messenger RNA that is produced and is therefore a non-negative number. However, the raw value is often transformed by taking the logarithm after dividing by a reference value, subtracting a median value, dividing by a scaling constant and the like. As a result, the numbers that are reported as gene expression levels can sometimes be negative numbers. Therefore, it is best to think of gene expression levels as real numbers. Other features that are measured include micro-RNA (miRNA) levels, methylation levels and copy number variations, all of which can be thought of as real-valued. There are also binary features such as the presence or absence of a mutation in a specific gene. In addition to these molecular attributes, there are also ‘labels’ associated with each tumour. Let y denote the label of tumour i, and note that the label depends only on the sample index i and not the feature index j. Typical real-valued labels include the time of overall survival after surgery, time to tumour recurrence or the lethality of a drug on a cancer cell line. Typical binary labels include whether a patient had metastasis (cancer spreading beyond the original site). In addition, it is also possible for labels to be ordinal variables, such as ‘poor responder’, ‘medium responder’ and ‘good responder’. Often these ordinal labels are merely quantized versions of some other real-valued attributes. For instance, the previous example corresponds to a three-level quantization of the time to tumour recurrence. In general, the labels refer to clinical outcomes, as in all of the above examples. Usually, each sample has multiple labels associated with it. However, in applications, the labels are treated one at a time, so it is assumed that there is only one label for each sample, with y denoting the label of the ith sample. Moreover, for simplicity, it is assumed that the labels are either real-valued or binary. Thus, the measurement set can be thought of as an m×n matrix X=[x], where x is the value of feature j in sample i. The row vector x, denoting the ith row of the matrix X, is called the feature vector associated with sample i. Similarly, the column vector x denotes the variation of the jth feature across all m samples. Throughout this paper, it is assumed that , that is, that each measurement is a real number. Binary measurements such as the presence or absence of mutations are usually handled by partitioning the sample set into two groups, corresponding to the two labels. For the purposes of incorporating binary labels into numerical computation, the labels are taken as ±1, the so-called bipolar case. It does not matter which abstract label is mapped into +1 and which abstract label is mapped into −1. If y is bipolar, the associated problem is called ‘classification’, whereas if y is real the associated problem is called ‘regression’. In either case, the objective is to find a function or such that y is well approximated by f(x).

Regression methods

The focus in this section is on the case where the label y is a real number. Therefore, the objective is to find a function such that f(x) is a good approximation of y for all i. A typical application in cancer biology would be the prediction of the time for a tumour to recur after surgery. The data would consist of expression levels of tens of thousands of genes on around a hundred or so tumours, together with the time for the tumour to recur for each patient. The objective is to identify a small number of genes whose expression values would lead to a reliable prediction of the recurrence time. Cancer is a complex, multi-genic disease, and identifying a small set of genes that appear to be highly predictive in a particular form of cancer would be very useful. Explaining why these genes are the key genes would require constructing gene regulatory networks (GRNs). While this problem is also amenable to treatment using statistical methods, it is beyond the scope of this paper. Towards the end of this section, the tumour recurrence problem is studied using a new regression method.

Some well-established algorithms

Throughout this section, attention is focused on linear regressors, with f(x)=xw−θ, where is a weight vector and is a threshold or bias. There are several reasons for restricting attention to linear regressors. From a mathematical standpoint, linear regressors are by far the most widely studied and the best understood class of regressors. From a biological standpoint, it makes sense to suppose that the measured outcome is a weighted linear combination of each feature, with perhaps some offset term. If one were to use higher order polynomials, for example, then biologists would rightly object that taking the product of two features (say two gene expression values) is unrealistic most of the time.[3] Other possibilities include pre-processing each feature x through a function such as x↦e/(1+e), but this is still linear regression in terms of the processed values. As explained earlier, often the measured feature values x are themselves processed values of the corresponding ‘raw’ measurements. In traditional least-squares regression, the objective is to choose a weight vector and a threshold θ so as to minimize the least-squared error This method goes back to Legendre and Gauss and is the staple of researchers everywhere. Let denote a column vector of all ones, with the subscript denoting the dimension. Then, where If the matrix has full column rank of n+1, then it is easy to see that the unique optimal choice is given by In the present context, the fact that m4] Different constraints lead to different problem formulations. An excellent and very detailed treatment of the various topics of this section can be found in Hastie et al. [12, ch. 3]. Suppose we minimize the least-squared error objective function subject to an ℓ2-norm constraint on w. This approach to finding a unique set of weights is known as ‘ridge regression’ and is usually credited to Hoerl & Kennard [13]. However, several of the key ideas are found in a much earlier paper by the Russian mathematician Tikhonov [14]. In ridge regression, the problem is reformulated as where t is some prespecified bound. In the associated Lagrangian formulation, the problem becomes one of the minimizing objective function where λ is the Lagrange multiplier. Because of the additional term, the (1,1)-block of the Hessian of J, which is the Hessian of J with respect to w, now equals λI+XX, which is positive definite even when m Another possibility is to choose a solution w that has the fewest number of non-zero components, that is, a regressor that uses the fewest number of features. Define If p≥1, this is the familiar ℓ-norm. If p<1, this quantity is no longer a norm, as the function w↦∥w∥ is no longer convex. However, as p↓0, the quantity ∥w∥ approaches the number of non-zero components of w. For this reason, it is common to refer to the number of non-zero components of a vector as its ‘ℓ0-norm’ even though ∥⋅∥0 is not a norm at all. Moreover, it is known [15] that the problem of minimizing ∥w∥0 is NP-hard. A very general formulation of the regression problem is to minimize where is a norm known as the ‘regularizer’. This problem is analysed at a very high level of generality in Negabhan et al. [16], where the least-squares error term is replaced by an arbitrary convex ‘loss’ function. In the interests of simplicity, we do not discuss the results of Negabhan et al. [16] in their full generality and restrict the discussion to the case where the loss function is quadratic as in (2.3). In Tibshirani [17], it is proposed to minimize the least-squared error objective function subject to an ℓ1-norm constraint on the weight vector w. In Lagrangian formulation, the problem is to minimize where λ is the Lagrange multiplier. The acronym ‘LASSO’ is coined in Tibshirani [17] and stands for ‘least absolute shrinkage and selection operator’. The LASSO penalty can be rationalized by observing that ∥⋅∥1 is the convex relaxation of the ‘ℓ0-norm’. The behaviour of the solution to the LASSO algorithm depends on the choice of the upper bound t. A detailed analysis of the Lagrangian formulation (2.4) and its dual problem is carried out in Osborne et al. [18]. It is shown there that, if the Lagrange multiplier λ in (2.4) is sufficiently large, say λ>λ, then the only solution to the LASSO minimization problem is w=0. Moreover, the threshold λ is not easy to estimate a priori. An optimal solution is defined to be ‘regular’ in Osborne et al. [18, definition 3.3] if it satisfies some technical conditions. In every problem, there is at least one regular solution. Moreover, every regular optimal weight vector has at most m non-zero entries (see Osborne et al. [18, theorem 3.5]). In many applications, some of the columns of the matrix X are highly correlated. For instance, if the indices j and k correspond to two genes that are in the same biological pathway, then their expression levels would vary in tandem across all samples. Therefore, the column vectors x and x would be highly correlated. In such a case, ridge regression tends to assign nearly equal weights to each. At the other extreme, LASSO tends to choose just one among the many correlated columns and to discard the rest; which one gets chosen is often a function of the ‘noise’ in the measurements. In biological datasets, it is reasonable to expect that expression levels of genes that are in a common pathway are highly correlated. In such a situation, it is undesirable to choose just one among these genes and to discard the rest; it is also undesirable to choose all of them, as that would lead to too many features being chosen. It would be desirable to choose more than one, but not all, of the correlated columns. This is achieved by the so-called ‘elastic net’ (EN) algorithm, introduced in Zou & Hastie [19], which is a variation of the LASSO algorithm. In this algorithm, the penalty aims to constrain, not the ℓ1-norm of the weight w, but a weighted sum of its ℓ1-norm and ℓ2-norm squared. The problem formulation in this case, in Lagrangian form, is to choose w so as to minimize where μ∈(0,1). Note that if μ=0, then the EN algorithm becomes the LASSO, whereas with μ=1, the EN algorithm becomes ridge regression. Thus, the EN algorithm provides a bridge between the two. Note that the penalty term in the EN algorithm is not a norm, owing to the presence of the squared term; hence, the EN algorithm is not covered by the very thorough analysis in Negabhan et al. [16]. A useful property of the EN algorithm is brought out in Zou & Hastie [19, theorem 1].

Theorem 2.1

Assume that y,X,λ are fixed, and let denote the corresponding minimizer of (2.5). Assume without loss of generality that y is centred, that is, y=0, and that the columns of X are normalized such that ∥x∥2=1 for all j. Let j,k be two indices between 1 and n, and suppose that . Then, As one can always ensure that by replacing x by −x if necessary, (2.6) states that if the columns x and x are highly correlated, then the corresponding coefficients in the regressor are nearly equal. Unlike in the LASSO algorithm, there do not seem to be many results on the number of non-zero weights that are chosen by the EN algorithm. It can and often does happen that the number of features chosen is larger than m, the number of samples. However, as explained above, this is often seen as a desirable feature when the columns of the matrix X are highly correlated, as they often are in biology datasets. By now both LASSO and EN can be viewed as well-established algorithms. A search of the Pubmed database of the National Library of Medicine with strings ‘LASSO cancer’ or ‘EN cancer’ results in about 200 entries for the former and several dozen entries for the latter. Note that these numbers are an order of magnitude less than the corresponding numbers for the support vector machine (SVM), discussed in §2b. Many of the papers citing the LASSO algorithm do not directly apply the algorithm to cancer data; instead, they propose some variant of the algorithm and claim to show that their variant outperforms the standard LASSO algorithm. A surprisingly large number of these variants propose non-convex objective functions (such as the ‘ℓ-norm’ with p<1). Given that, in convex optimization, every local optimum is also a global optimum, whereas this is not so in the case of non-convex optimization, it is difficult to imagine what benefits if any are conferred by replacing the convex objective function JLASSO with a non-convex objective function. But there are many such papers to be found in the literature. In the case of the EN algorithm, a typical application is found in Lee et al. [20] that addresses the problem of identifying some genes to delineate advanced versus early stage colorectal cancer. In this study, 1192 known or putative cancer genes found from Network [21] and COSMIC [9] constitute the feature set on 197 samples. As expected, the EN algorithm chooses a large number of features, which are then rank-ordered to determine the key genes. An interesting paper [22] compares all the three methods discussed here, namely ridge regression, LASSO and EN, on several datasets both synthetic and real, including a lung adenocarcinoma dataset. Not surprisingly, ridge regression assigns a non-zero weight to all 1310 features, whereas EN assigns zero weights to only 43 features, thus resulting in no significant reduction in the number of features chosen. The paper does not clearly mention how many features are retained by the LASSO algorithm.

Some recent algorithms and open problems

Next, we discuss several versions of the problem formulation in (2.3) corresponding to diverse choices of the penalty norm , culminating in some open problems that are relevant to biological applications. The ‘pure’ LASSO algorithm tries to choose as few distinct features as possible in the regressor. However, it may be worthwhile to partition the set of features into g groups G1,…,G, and then choose a regressor that selects elements from as few distinct groups as possible, without worrying about the number of features chosen. This is achieved by the so-called group LASSO (GL) algorithm introduced in Bakin [23] and Lin & Zhang [24]. Let n:=|G| for l=1,…,g. In the grouped LASSO algorithm, the objective function is where is determined from w by setting w=0 for all j∉G. It is clear that, depending on the relative sizes of the various groups, one weight vector can have more non-zero components than another, and yet the number of distinct groups to which these non-zero components belong can be smaller. In the limiting case, if the number of groups is taken as n and each group is taken to consist of a singleton set, then the grouped LASSO reduces to the standard LASSO algorithm. A further variation is the so-called sparse GL (SGL) algorithm introduced in Friedman et al. [25] and Simon et al. [26], where the objective is simultaneously to choose features from as few distinct groups as possible, and within the chosen groups, choose as few features as possible. The objective function in the SGL algorithm is where as always μ∈[0,1]. The above formulations of the GL and SGL norms are based on the assumption that the various groups do not overlap. However, in some biological applications it makes sense to permit overlapping group decompositions. Specifically, at a first level of approximation a GRN can be modelled as a directed acyclic graph, wherein the root nodes can be interpreted as master regulator genes, and directed paths can be interpreted as biological pathways. In such a case, one seeks to explain the available data, not by choosing the fewest number of genes but rather by the fewest number of pathways. To illustrate, consider the baby example shown in figure 1, where gene 1 is a master regulator, while genes 2–7 are regulated genes. Some are regulated directly by a master regulator gene, whereas others are indirectly regulated. In figure 1a, there are four pathways, namely whereas in figure 1b there are also four pathways, namely Ideally, we would like to choose a set of features that intersect with as few pathways as possible. We will return to this example after presenting available theories for sparse regression with overlapping groups.

Figure 1.

Two regulatory networks. (a) A network without overlapping groups. (b) A network with overlapping groups.

Two regulatory networks. (a) A network without overlapping groups. (b) A network with overlapping groups. To date, various versions of group or SGL with overlapping groups have been proposed. As before, let G1,…,G be subsets of , but now without the assumption that the groups are pairwise disjoint. The penalty-augmented optimization problems are the same as in (2.7) and (2.8), respectively; however, the objective functions are now referred to as JGLO and JSGLO to suggest (sparse) GL with overlap. For the case of overlapping groups, the theory developed in Negabhan et al. [16] continues to apply so long as the penalty terms in (2.7) and (2.8), respectively, are ‘decomposable’. The most general results available to date address the case where the groups are ‘tree structured’, that is, See, for example, Obozinski et al. [27] and Jenetton et al. [28]. Now, if we examine the groups associated with the network in figure 1a, it is obvious that (2.9) is not satisfied. However, there is a slight modification that would permit (2.9) to hold, namely to drop the root node and retain only the successors. Thus, the various groups are However, there is no way of modifying the groups so as to ensure that (2.9) holds for the network in figure 1b. The reason is easy to see. The ‘tree structure’ assumption (2.9) implies that there is only one path between every pair of nodes. But this is clearly not true in figure 1b, because there are two distinct paths from node 1 to node 5. Moreover, a little thought would reveal that that the assumption of tree-structured groups does not really permit truly overlapping groups. In particular, if (2.9) holds, then the collection of sets {G1,…,G} can be expressed as a union of chains in the form where the ‘maximal’ sets G are pairwise disjoint once duplicates are removed, and together span the total feature set . Now, in a biological network, it makes no sense to impose a condition that there must be only one path between every pair of nodes. Therefore, the problem of defining a decomposable norm penalty for inducing other types of sparsity besides tree structure, especially the types of sparsity that are consistent with biology, is still open. We conclude this section with a new algorithm and its application to sparse regression. This represents joint work with Mehmet Eren Ahsen and will be presented in more complete form elsewhere. A special case of SGL is obtained by choosing just one group, which perforce has to equal , so that Of course, as the entire index set is chosen as one group, there is nothing ‘sparse’ about it. Note that the only difference between (2.10) and (2.5) is that the ℓ2-norm is not squared in the former. For this reason, the above approach is called the ‘modified elastic net’ or MEN algorithm. Unlike in EN, the penalty (or constraint) term in MEN is a norm, being a convex combination of the ℓ1- and ℓ2-norms. In several examples, the MEN algorithm appears to combine the accuracy of EN with the sparsity of LASSO. It is relatively easy to prove an analogue of theorem 2.1 for the MEN algorithm. That is, unlike in LASSO but as in EN, MEN assigns nearly equal weights to highly correlated features. But further theoretical analysis remains to be carried out. The MEN algorithm was applied to the TCGA ovarian cancer data [29] to predict the time to tumour recurrence. Specifically, both times to tumour recurrence as well as expression levels for 12 042 genes are available for 283 patients. Out of these, 40 patients whose tumours recurred before 210 days or after 1095 days were excluded from the study as being ‘extreme’ cases. The remaining 243 samples were analysed using MEN with recursive feature elimination (RFE). The results are shown in figure 2. The number of features and the average percentage error in absolute value are shown in table 1.

Figure 2.

Table 1.

Comparison of three algorithms on TCGA ovarian cancer data on time to tumour recurrence, with extreme cases excluded.

algorithm	no. features	average perc. error (%)
LASSO	43	16.14
EN	60	14.35
MEN	42	14.91

Predicted versus actual times to tumour recurrence in 243 ovarian cancer patients. The results for the LASSO algorithm are in (a), those for the EN algorithm are in (b) and those for the MEN algorithm are in (c). (Online version in colour.) Comparison of three algorithms on TCGA ovarian cancer data on time to tumour recurrence, with extreme cases excluded.

Compressed sensing

In recent years, there have been several results that are grouped under the general heading of ‘compressed sensing’ or ‘compressive sensing’. Both expressions are in use, but ‘compressed sensing’ is used in this paper. The problem can be roughly stated as follows: suppose is an unknown vector but with known structure; is it possible to determine x either exactly or approximately by taking m≪n linear measurements of x? The area of research that goes under this broad heading grew spectacularly during the first decade of the new millennium.[5] As summarized in the introduction of the paper [31], the impetus for recent work in this area was the desire to find algorithms for data compression that are ‘universal’ in the sense of being non-adaptive (i.e. do not depend on the data). In the original papers in this area, the results and proofs were a mixture of sampling, signal transformation (time domain to frequency domain and vice versa), randomness, etc. However, as time went on, the essential ingredients of the approach were identified, thus leading to a very streamlined theory that clearly transcends its original application domains of image and signal processing. Image processing is one of the potential (and actually realized) applications of compressed sensing, and, as such, the theory has already been applied to the processing of biological images. Other than this, there do not appear to be any applications of the theory to cancer biology. Perhaps this can be attributed to the relative newness of the subject. The motivation for discussing compressed sensing theory in this paper is the following: whether it is in compressed sensing or in computational biology, one searches for a relatively simple explanation of the observations. Therefore, it may potentially be possible to borrow some of the basic ideas from compressed sensing theory and adapt them to problems in cancer biology. Compressed sensing theory as it currently stands cannot directly be applied to the analysis of biological datasets, because the fundamental assumption in compressed sensing theory is that one is able to choose the so-called measurement matrix, called A throughout this paper. Note that, in statistics, the matrix A is often referred to as the ‘design’ matrix. However, in biological (and other) applications, the measurement matrix is given, and one does not have the freedom to change it. Nevertheless, the developments in this area are too important to be ignored by computational biologists. The hope is that, by understanding the core arguments of compressed sensing theory, the computational biology community would be able to adapt the theory for its application domain. In parallel, those who are well versed in compressed sensing theory can start thinking about how the basic arguments can be modified to the case where the measurement matrix is specified, and cannot be chosen. The major developments in this area are generally associated with the names of Candès, Donoho, Romberg and Tao, though several other researchers have also made important contributions. See Donoho [31] for one of the earliest comprehensive papers, as well Donoho [32], Candés [33], Candés & Tao [34,35], Candés & Plan [36], Romberg [37] and Cohen et al. [38]. The survey paper [30] and a recent paper [16] contain a wealth of bibliographic references that can be followed up by interested readers. We begin by introducing some notation. Suppose m,n,k are given integers, with n≥2k. For convenience, we denote the set {1,…,n} by throughout. For a given vector , let supp(x) denote its support, that is, supp(x)={i:x≠0}. Let, . Thus, Σ denotes the set of ‘k-sparse’ vectors in , or, in other words, the set of n-dimensional vectors that have k or fewer non-zero components. For each vector , integer k Next, the so-called ‘restricted isometry property’ (RIP) is introduced. Note that, in some cases, the RIP can be replaced by a weaker property known as the ‘null space property’ [38]. However, the objective of this paper is not to present the most general results, but rather to present reasonably general results that are easy to explain. So the exposition below is confined to the RIP.

Definition 3.1

Suppose . We say that A satisfies the RIP of order k with constant δ if So the matrix A has the RIP of order k with constant 1−δ if the following property holds: for every choice of k or fewer columns of A (say the columns in the set where |J|≤k), the spectrum of the symmetric matrix lies in the interval [1−δ,1+δ], where denotes the submatrix of A consisting of all rows and the columns corresponding to the indices in J. If integers n,k are specified, the integer m has to be sufficiently large in order for the matrix A to satisfy the RIP.

Theorem 3.2 (Davenport et al. [30, theorem 1.4])

Suppose satisfies the RIP or order 2k with constant δ2∈(0,1/2]. Then, where Next, we state some of the main known results in compressed sensing. The theorem statement below corresponds to Candès [33, theorem 1.2] and Davenport et al. [30, theorem 1.9].

Theorem 3.3

Suppose satisfies the RIP of order δ2 with constant and that y=Ax+η for some and with ∥η∥2≤ϵ. Define Then, where The formula for C2 is written slightly differently from that in Davenport et al. [30, theorem 1.9] but is equivalent to it.

Corollary 3.4

Suppose satisfies the RIP of order δ2 with constant and that y=Ax+η for some x∈Σ and with ∥η∥2≤ϵ. Define Then, where C2 is defined in (3.5).

Corollary 3.5

Suppose satisfies the RIP of order δ2 with constant and that y=Ax for some x∈Σ. Let, and define Then, . Both corollaries follow readily from the bound (3.4). Note that if x∈Σ, then σ(x,∥⋅∥1)=0. Thus, (3.4) implies that if there is measurement error, and , i.e. that , if there no measurement error. Corollary 3.4 is referred to as the ‘near ideal’ property of the LASSO algorithm. Suppose that x∈Σ so that x is k-sparse. Let S denote the support of x, and let denote the submatrix of A consisting of the columns corresponding to indices in S. If an ‘oracle’ knew not only the size of S, but the set S itself, then the oracle could compute as Then, the error would be for some appropriate constant. On the other hand, if x∈Σ, then σ(x,∥⋅∥1)=0, and the right-hand side of (3.4) reduces to (3.7), that is, The point therefore is that, if the matrix A satisfies RIP, and the constant δ2 satisfies the ‘compressibility condition’ , then the mean-squared error of the solution to the optimization problem (3.6) is bounded by a fixed (or ‘universal’) constant times the error bound achieved by an ‘oracle’ that knows the support of x. It should be noted that there is a parallel, and closely related, set of papers that study the following problem: given a matrix , a feature vector x that is known to be k-sparse but otherwise unknown, and a random measurement error w assuming values in , suppose one is given the noise-corrupted measurement y=Ax+w. To recover x from y, one solves the minimization problem where l is a user-specified penalty weight. What, if any, is the relationship between x* and x? It is easy to see that the objective function in (3.9) is just the Lagrangian associated with the constrained objective function in (3.3). Specifically, if λ is sufficiently large, then large values of ∥z∥1 are penalized, and the problem in (3.9) begins to resemble that in (3.3). Of course, the bound ϵ on the magnitude of the noise is not present in the problem formulation (3.9). In Candès & Plan [36] and Negabhan et al. [16], the above problem is analysed, and probabilistic (with respect to the random noise w) bounds analogous to (3.7) are derived. Indeed, [16] contains a very general theory wherein the ℓ2-norm of y−Az is replaced by an arbitrary convex function, and the ℓ1-norm is replaced by any decomposable norm. The advantage of the above theorem statements, which are taken from Candès [33] and Davenport et al. [30], is that the role of various conditions is clearly delineated. For instance, the construction of a matrix that satisfies the RIP is usually achieved by some randomized algorithm. In Candès & Tao [34, theorem 1.5], such a matrix is constructed by taking the columns of A to be samples of i.i.d. Gaussian variables. In Achlioptas [39], Bernoulli processes are used to construct A, which has the advantage of ensuring that all elements a have just three possible values, namely 0,+1,−1. A simple proof that the resulting matrices satisfy the RIP with high probability is given in Baraniuk et al. [40]. Neither of these construction methods is guaranteed to generate a matrix A that satisfies RIP. Rather, the resulting matrix A satisfies RIP with some probability, say ≥1−γ1. The probability γ1 that the randomized method may fail to generate a suitable A matrix can be bounded using techniques that have nothing to do with the above theorem. Similarly, in case the measurement matrix A satisfies the RIP but the measurement noise η is random, then it is obvious that theorem 3.3 holds with probability ≥1−γ2, where γ2 is a bound on the tail probability . Again, the problem of bounding this tail probability has nothing to do with theorem 3.3. By combining both estimates, it follows that if the measurement matrix A is generated through randomization, and if the measurement noise is also random, then theorem 3.3 holds with probability ≥1−γ1−γ2. Observe that the optimization problem (3.6) is This raises the question as to whether the ℓ1-norm can be replaced by some other norm ∥⋅∥P that induces some other form of sparsity, for example group sparsity. If some other norm is used in place of the ℓ1-norm, does the resulting algorithm display near-ideal behaviour, as does LASSO? In other words, is there an analogue of theorem 3.3 if ∥⋅∥1 is replaced by another penalty ∥⋅∥P? In joint work with Ahsen [41], the author has proved a very general theorem to the following effect: whenever the penalty norm is ‘decomposable’ and the measurement matrix A satisfies a ‘group RIP’, the corresponding algorithm has near-ideal behaviour provided a ‘compressibility condition’ is satisfied. The result is described in brief. Let be a partition of . This implies that the sets G are pairwise disjoint. If S⊆{1,…,g}, define GS:=∪G. Let k be some integer such that . A subset is said to be if Λ=GS and |GS|≤k, and group if it is S-group k-sparse for some set S⊆{1,…,g}. The symbol denotes the collection of group k-sparse sets. Suppose is some norm. The next definition builds on an earlier definition from Negabhan et al. [16].

Definition 3.6

The norm ∥⋅∥P is decomposable with respect to the partition if the following is true: whenever are group k-sparse with support sets Λ⊆G, Λ⊆G and the sets S1,S2 are disjoint, it is true that By adapting the arguments in Negabhan et al. [16], it can be shown that the GL norm used in (2.7), namely and the SGL norm used in (2.8), namely are both decomposable. Next, the notion of RIP is extended to groups.

Definition 3.7

A matrix is said to satisfy the group RIP of order if We define and introduce some constants With these definitions, the following theorem can be proved.

Theorem 3.8

Suppose satisfies the group RIP property of order 2k with constants respectively, and let . Suppose and that y=Ax+η, where ∥η∥2≤ϵ. Suppose that the norm ∥⋅∥P is decomposable, and define Suppose that the compressibility condition is satisfied. Then, and where σ is shorthand for the sparsity index and and c,d are defined in (3.12). In the above theorem, (3.14) replaces the compressibility condition of theorem 3.3. The resemblance of (3.16) to (3.4) is obvious. Consequently, (3.16) can be readily interpreted as stating that minimizing the decomposable norm ∥⋅|∥P leads to near-ideal behaviour.

Classification methods

The basic problem of classification can be stated as follows: suppose we are given a collection of labelled vectors (x,y),i=1,…,m, where each is viewed as a row vector and each y∈{−1,1}. For future use, define The objective of (two-class) classification is to find a discriminant function such that f(x) has the same sign as y for all i, or equivalently y⋅sign ( f(x))=1 for all i. In the present context, the objective is not merely to find such a discriminant function, but, rather, to find one that uses relatively few features. In many ways, classification is an easier problem than regression, because the sole criterion is that the discriminant function f(x) should have the same sign as the label y for each i. Thus, if f is a discriminant function, so is αf for every positive constant α, and, more generally, so is any function ϕ( f) whenever ϕ is a so-called ‘first- and third-quadrant function’, i.e. where ϕ(u)>0 when u>0 and ϕ(u)<0 when u<0. This gives us great latitude in choosing a discriminant function.

The ℓ2-norm support vector machine

This section is devoted to the well-known SVM, first introduced in Cortes & Vapnik [42], which is among the most successful and most widely used tools in machine learning. To distinguish this algorithm from its variants, it is referred to here as the ℓ2-norm SVM, for reasons that will become apparent. A given set of labelled vectors is said to be linearly separable if there exist a ‘weight vector’ (viewed as a column vector) and a ‘threshold’ such that f(x)=xw−θ serves as a discriminant function. Equivalently, the dataset is linearly separable if there exist a weight vector and a threshold such that To put it yet another way, given a weight w and a threshold θ, define by The dataset is linearly separable if there exists a hyperplane such that and . The situation can be depicted as in figure 3a, in which the dots on either side of the dashed line represent the two classes. It is clear that linear separability is not affected by swapping the class labels.

Figure 3.

A linearly separable dataset. (Online version in colour.)

A linearly separable dataset. (Online version in colour.) It is easy to see that, if there exists one hyperplane that separates the two classes, there exist infinitely many such hyperplanes. The question therefore arises as to which of these choices is the best. The SVM introduced in Cortes & Vapnik [42] chooses the separating hyperplane such that the nearest point to the hyperplane within each class is as far as possible from it. In the original SVM formulation, the distance to the hyperplane is measured using the Euclidean or ℓ2-norm. To illustrate the concept, the same dataset as in figure 3 is shown again in figure 4, with the ‘optimal’ separating hyperplane, and the closest points to it within the two classes shown as hollow circles.

Figure 4.

Optimal separating hyperplane. (Online version in colour.)

Optimal separating hyperplane. (Online version in colour.) In symbols, the SVM is obtained by solving the following optimization problem: An equivalent formulation of the SVM is obtained by observing that the distance of the separating hyperplane to the nearest points is given by c/∥w∥, where where the equality of the two terms follows from the manner in which the separating hyperplane is chosen. Moreover, the optimal hyperplane is invariant under scale change, that is, multiplying w and θ by a positive constant. Therefore, there is no loss of generality in taking the constant c to equal one. With this rescaling, the problem at hand becomes the following: This is the manner in which the SVM is implemented nowadays in most software packages. The original SVM formulation presupposes that the dataset is linearly separable. It is easy to determine whether or not a given dataset is linearly separable, because that is equivalent to the feasibility of a linear programming problem. This naturally raises the question of what is to be done in case the dataset is not linearly separable. One way to approach the problem is to choose a hyperplane that misclassifies the fewest number of points. While appealing, this approach is impractical, because it is known that this problem is NP-hard; see Höffgen et al. [43] and Natarajan [15]. A tractable approach is to replace this problem by its convex relaxation. We will return to this issue when we discuss ℓ1-norm SVMs. An alternative approach to guarantee that the data are linearly separable can be obtained using Vapnik–Chervonenkis theory [44,45]. Suppose that the n vectors x1,…,x do not lie on an (p−1)-dimensional hyperplane in . In such a case, whenever p≥n−1, the dataset is linearly separable for every one of the 2 ways of assigning labels to the n vectors. This result suggests that, if a given dataset is not linearly separable, it can be made so by increasing the dimension of the data vectors x, for instance, by including not just the original components but also their higher powers. This is the rationale behind so-called ‘higher order’ SVMs, or, more generally, kernel-based classifiers (e.g. Cristianini & Shawe-Taylor [46] and Schölkopf & Smola [47]). If the norm in (4.1) is the ℓ2-norm, then the minimization problem (4.1) is a quadratic programming problem, which can be solved efficiently for extremely large datasets. Moreover, the introduction of new data points does not alter the optimal hyperplane, unless one of the new data points is closer to the hyperplane than the earlier closest points. This is illustrated in figure 5, which contains exactly the same vectors as in figure 4, plus two more. The optimal hyperplane remains the same. For all these reasons, the SVM offers a very attractive approach to finding a classifier in situations where the number of features is smaller than the number of samples. On the other hand, generically the optimal weight vector has all non-zero components, which is undesirable when the number of samples m is too large, even if m48]. This consists of solving the SVM problem (4.1), identifying the component of the weight vector with the smallest magnitude, discarding it, re-solving the problem and repeating. Though it is claimed in Guyon et al. [48] that the method works well on a leukaemia dataset, in general RFE applied to the traditional ℓ2-norm SVM displays rather erratic behaviour.

Figure 5.

Insensitivity of optimal separating hyperplane to additional samples. (Online version in colour.)

The ℓ1-norm support vector machine

As we have seen, in biological applications, the number of features (the dimension of the vectors x) is a few orders of magnitude larger than the number of samples (the number of vectors). In such a case, because of the results in Wenocur & Dudley [44], linear separability is not an issue. However, in general, every component of the optimal weight vector w is non-zero. This means that a classifier uses every single feature in order to discriminate between the classes. Clearly, this is undesirable. The original SVM formulation presupposes that the dataset is linearly separable. If the data are not linearly separable, then as shown in Höffgen et al. [43] and Natarajan [15], the problem of finding a hyperplane that misclassifies the fewest points is NP-hard. An alternative approach is to formulate a convex relaxation of this NP-hard problem by introducing slack variables into the constraints in (4.1), and then minimizing an appropriate norm of the vector of slack variables. Finally, in many problems, the consequences of misclassification might not be symmetric. A false positive (labelling a sample as positive when in fact it is negative) might have far more, or far less, severe consequences than a false negative. In this section, we present a problem formulation that addresses all of these issues. This problem formulation combines the ideas in two papers, namely [49,50]. If we choose a particular norm ∥⋅∥ to measure distances in ‘feature space’, then distances in ‘weight space’ should be measured using the so-called dual norm, defined by In particular, if we measure distances in feature space using the ℓ1-norm, then distances in weight space should be measured using its dual, which is the -norm. With this observation, the problem can be formulated as follows: This can be converted to This is clearly a linear programming problem. In this formulation, λ is a ‘small’ constant in (0,1), much closer to 0 than it is to 1. Suppose that the original dataset is linearly separable, and let w*,θ* denote a solution to the optimization problem in (4.1), where ∥w∥ replaces ∥w∥. Then the choice is certainly feasible for the optimization problem (4.2). Moreover, if λ is sufficiently small, any reduction in ∥w∥ achieved by violating the linear separation constraints (i.e. permitting some y or z to be positive rather than zero) is offset by the increase in the term (1−λ)∥(y,z)∥. It is therefore clear that, if the dataset is linearly separable, there exists a critical value λ0>0 such that, for all λ<λ0, the optimization problem (4.2) has (w*,θ,0,0) as a solution. On the other hand, the optimization problem (4.2) remains meaningful even when the data are not linearly separable. The final aspect of the problem, as suggested in Veropoulos et al. [50], is to introduce a trade-off between false positives and false negatives. In this connection, it is worthwhile to recall the definitions of the accuracy, etc. of a classifier. Given a discriminant function f(⋅), define Thus, consists of the samples that are assigned to class 1 by the classifier, while consists of the samples that are assigned to class 2. Then, this leads to the array shown below In the above array, the entries TP, FN, FP and TN stand for ‘true positive’, ‘false negative’, ‘false positive’ and ‘true negative’, respectively.

Definition 4.1

With the above definitions, we have where Se, Sp and Ac stand for the sensitivity, specificity and accuracy, respectively. All three quantities lie in the interval [0,1]. Moreover, accuracy is a convex combination of sensitivity and specificity. In particular, Therefore, Also, the accuracy of a classifier will be roughly equal to the sensitivity if is far larger than , and roughly equal to the specificity if is far larger than . In many classification problems, the consequences of misclassification are not symmetric. To capture these kinds of considerations, another parameter α∈(0,1) is introduced, and the objective function in the optimization problem (4.2) is modified by making the substitution where we adopt the computer science notation ← to mean ‘replaces’. If α=0.5, then both false positives and false negatives are weighted equally. If α>0.5, then there is greater emphasis on correctly classifying the vectors in , and the reverse if α<0.5. With this final problem formulation, the following desirable properties result: — the problem is a linear programming problem and is therefore tractable even for extremely large values of n, the number of features; — the formulation can be applied without knowing beforehand whether or not the dataset is linearly separable; — the formulation provides for a trade-off between false positives and false negatives; and — most important, the optimal weight vector w has at most m non-zero entries, where m is the number of samples. Hence, the classifier uses at most m out of the n features. For these reasons, the ℓ1-norm SVM forms the starting point for our further research into classification. Until now, the discussion has been on two-class classification. The SVM framework does not extend to multi-class classification very readily. For one thing, when there are only two classes, it does not matter which class is labelled ‘positive’ and which is labelled ‘negative’. On the other hand, when there are multiple classes, it is necessary to distinguish between two cases: in the first, there is a natural ordering among the class labels. For instance, if a patient's response is to be categorized as poor, fair, good, very good and excellent, the ordering is clear. In the second, there is no natural ordering, for example, if one wishes to assign a breast cancer tumour into one of the four subtypes mentioned above. The paper [51] is among the more popular methods for multi-class SVM.

Some applications of support vector machines to cancer

In contrast with sparse regression, there are many applications of sparse classification methods to cancer biology. A search of the Pubmed database of the National Library of Medicine, USA, with the string ‘SVM cancer’ returns several hundred results. The vast majority of these papers present applications where human experts pare the hundreds or even thousands of measured features to a small subset, to which a standard ℓ2-norm SVM is applied. In other words, though the raw number of measured features is very high, the actual number of features used by the SVM is smaller than the number of samples. In principle, the ℓ1-norm SVM can be applied to the original feature set to choose the most predictive features out of the overall feature set. But for the most part, existing applications appear to leave the task of feature selection to human experts and not to the algorithm. The fraction of SVM applications that exploit the feature selection property of the ℓ1-norm SVM is quite small. Two reasons, not mutually exclusive, can be proposed to explain this. First, the users might simply be unfamiliar with the ℓ1-norm SVM methodology. Second, the users might believe that a purely data-driven approach might result in a feature set whose biological significance is unclear. The paper [48] that introduces the technique of RFE to the SVM algorithm studies the application of the technique to leukaemia. It is shown that the SVM–RFE method eventually leads to just two features being retained. However, several authors report that the performance of the SVM–RFE approach is in general somewhat erratic. It would appear therefore that the dataset studied in Guyon et al. [48] is particularly amenable to the use of this technique. An excellent review of several applications of SVMs to ovarian cancer can be found in Sabatier et al. [52]. Other examples of ovarian cancer applications include Han et al. [53], in which 322 samples are analysed to generate a 349-gene biomarker panel which performs very well, but when the 349 genes are reduced to 18 genes the performance on the test data is poor; Denkert et al. [54], in which a 300-gene ovarian carcinoma index is constructed on the basis of 80 samples, which is then tested on 118 samples; and Hartmann et al. [55], in which a panel of 14 genes is identified to differentiate between early relapse and late-stage relapse. Yet, these papers ignore the result from Wenocur & Dudley [44], which states that if the number of features used is in excess of the number of samples in the training data, then generically an SVM can achieve 100% accuracy, sensitivity and specificity on the training data, irrespective of the assignment of labels to the samples. As neither Han et al. [53] nor Denkert et al. [54] reports such a phenomenon, it is unclear whether these papers have implemented the SVM algorithm accurately. As illustrations of applications to other forms of cancer, one can mention: Sabatier et al. [56], in which a 368-gene expression signature is trained on 2145 basal breast cancer samples, and then tested on another set of 2034 samples, with the aim of predicting the patient's prognosis; and Klement et al. [57], in which seven features were selected by human experts to study 399 NSCLC patients. As an example of using not just SVM but also RFE, one can cite Yang et al. [58], in which the efficacy of drug candidates against hepatocellular carcinoma (liver cancer) is studied. As the efficacy of drug candidates is actually a real number between 0 and 1, the authors discretize the efficacy into two bins: [0,0.4] and [0.6,1]. Apparently, their only motivation is to make the problem fit into the SVM framework. Therefore, it would be worthwhile to apply sparse regression techniques (as opposed to sparse classification) to such problems. Finally, an application of the multi-class SVM methodology of Crammer & Singer [51] is found in Huang et al. [59].

The lone star algorithm

As pointed out in §4, both the traditional ℓ2-norm SVM and the ℓ1-norm SVM can be used for two-class classification problems. When the number of samples m is far larger than the number of features n, the traditional SVM performs very satisfactorily, whereas the ℓ1-norm SVM of Bradley & Mangasarian [49] is to be preferred when m49], RFE of Guyon et al. [48] and stability selection of Meinshausen & Bühlmann [60]. A preliminary version of this algorithm was reported in Ahsen et al. [61]. Note that Li et al. [62] introduces an algorithm known as SVM–T–RFE that contains some similarities to this algorithm. The algorithm is as follows: (1) Choose at random a ‘training set’ of samples of size k1 from and size k2 from , such that k≤m/2 and k1,k2 are roughly equal. Repeat this choice s times, where s is a ‘large’ number. This generates s different ‘training sets’, each of which consists of k samples from , l=1,2. (2) For each randomly chosen training set, compute a corresponding optimal ℓ1-norm SVM using the formulation (4.2). This results in s different optimal weight vectors and thresholds. (3) Let k denote the average number of non-zero entries in the optimal weight vector across all randomized runs. Average all s optimal weight vectors and thresholds, retain the largest k components of the averaged weight vector and corresponding feature set, and set the remaining components to zero. This results in reducing the number of features from the original n to k. (4) Repeat the process with the reduced feature set, but the originally chosen randomly selected training samples, until no further reduction is possible in the number of features. This determines the final set of features to be used. (5) Once the final feature set is determined, carry out twofold cross validation by dividing the data s times into a training set of k1,k2 randomly selected samples and assessing the performance of the resulting ℓ1-norm classifier on the testing dataset, which is the remainder of the samples. Average the weights generated by the t≤s best-performing classifiers, where t is chosen by the user, and call that the final classifier. When the number of features n is extremely large, an optional pre-processing step is to compute the mean value of each of the n features for each class, and retain only those features wherein the difference between means is statistically significant using the ‘Student’ t-test. Our experience is that using this optional pre-processing step does not change the final answer very much, but does decrease the CPU time substantially. Note that, in Li et al. [62], a weighted combination of the weight of the ℓ2-norm SVM and the t-test statistic is used to eliminate features. Now some comments are in order regarding the above algorithm: — in some applications, and are of comparable size, so that the size of the training set can be chosen to equal roughly half of the total samples within each class. However, in other applications, the sizes of the two sets are dissimilar, in which case the larger set has far fewer of its samples used in training; — step 1 of randomly choosing s different training sets differs from Guyon et al. [48], where there is only one randomized division of the data into training and testing sets; — for each random choice of the training set, the number of non-zero entries in the optimal weight vector is more or less the same; however, the locations of non-zero entries in the optimal weight vector vary from one run to another; — in step 3 above, instead of averaging the optimal weights over all s runs and then retaining the k largest components, it is possible to adopt another strategy. Rank all n indices in order of the number of times that index has a non-zero weight in the s randomized runs, and retain the top k indices. In our experience, both approaches lead to virtually the same choice of the indices to be retained for the next iteration; — instead of choosing s randomized training sets right at the outset, it is possible to choose s randomized training sets each time the number of features is reduced; and — in the final step, there is no distinction between the training and testing datasets, so the final classifier is run on the entire dataset to arrive at the final accuracy, sensitivity and specificity figures. The advantage of the above approach vis-à-vis the ℓ2-norm SVM–RFE of Guyon et al. [48] or the SVM–T–RFE of Li et al. [62] is that the number of features reduces significantly at each step, and the algorithm converges in just a few steps. This is because, in the ℓ1-norm SVM, many components of the weight vector are ‘naturally’ zero and need not be truncated. By contrast, in general all the components of the weight vector resulting from the ℓ2-norm SVM will be non-zero; as a result, the features can only be eliminated one at a time, and in general the number of iterations is equal to (or comparable to) n, the initial number of features. The new algorithm can be appropriately referred to as the ‘ℓ1-SVM t-test and RFE’ algorithm, where SVM and RFE are themselves acronyms as defined above. Once again taking the first letters, we are led to the ‘second-level’ acronym ‘ℓ1-StaR’, which can be pronounced as ‘ell-one star’. Out of deference to our domicile, we have decided to call it the ‘lone star’ algorithm. The lone star algorithm was applied to the problem of predicting which patients of endometrial cancer are at risk of lymph node metastasis. These results are reported elsewhere. But in brief, the situation is the following: the endometrium is the lining of the uterus. When a patient contracts endometrial cancer, her uterus, ovaries and fallopian tubes are surgically removed. One of the major risks run by endometrial cancer patients is that the cancer will metastasize and spread through the body via pelvic and/or para-aortic lymph nodes. The Gynecologic Oncology Group recommends that the patient's pelvic and para-aortic lymph nodes should also be surgically removed when the size of the tumour exceeds 2 cm in diameter. However, post-surgery analysis reveals that, even in this case, lymphatic metastasis is present in only 22% of the cases [63]. To predict the possibility of lymphatic metastasis, 1428 miRNAs were extracted from 94 tumours, half with and half without metastasis. Using the lone star algorithm, 13 miRNAs were identified as being highly predictive. When tested on the entire training sample of 94 tumours, the lone star classifier correctly classified 41 out of 43 lymph-positive samples, and 40 out of 43 lymph-negative samples. In ongoing work, these miRNAs were measured on an independent cohort of 19 lymph-negative and nine lymph-positive tumours. The classifier classified eight out of nine lymph-positive tumours correctly, and 11 out of 19 lymph-negative tumours correctly. Thus, while the specificity is not very impressive, the sensitivity is extremely good, which is precisely what one wants in such a situation. Moreover, using a two-table contingency analysis and the Barnard exact test, the likelihood of arriving at this assignment by pure chance (the so-called p-value) is bounded by 0.011574. In biology, any p-value less than 0.05 is generally considered to be significant.

Some topics for further research

Both machine learning and computational biology are vast subjects, and their intersection contains many more topics than are touched upon in this brief article. Besides, there are other topics in computational cancer biology that do not naturally belong to machine learning, for example modelling tumour growth using branching processes. Therefore, the emphasis in this article has been on topics that are well established in the machine learning community and are also relevant to problems in computational cancer biology. Until now several ‘penalty’ norms have been proposed for inducing an optimization algorithm to select structured sparse feature sets, such as GL and SGL. As pointed out in §2, available extensions of these penalty norms to overlapping sets do not address biological networks where there are multiple paths from a master regulator to a final node. Any advance in this direction would have an immediate application to computational biology. Compressed sensing theory as discussed in §3 is based on the premise that it is possible to choose the measurement matrix A. The available theorems in this theory are based on assumptions on the measurement matrix, such as the RIP, or the null space property, and perhaps something even more general in future. In order to apply techniques from compressed sensing theory to cancer biology, it would be necessary to modify the theory to the case where the measurement matrix is given, and not chosen by the user. The RIP corresponds to the assumption that in an m×n matrix A, every choice of k columns results in a nearly orthogonal set. In actual biological data, such an assumption has no hope of being true, because the expression levels of some genes would be highly correlated with those of other genes. In Candès & Plan [36], the authors suggest that it is possible to handle this situation by first clustering the column vectors and then choosing just one exemplar from each cluster before applying the theory. Our preliminary attempts to apply such an approach to ovarian cancer data [29] are not very promising, leading to RIP orders of 5 or 10—far too small to be of practical use. Thus, there is a need for the development of other heuristics besides clustering to extract nearly orthogonal sets of columns for actual measurement matrices. In this connection, it is worth pointing out [64] that group RIP is easier to achieve using random projections, as compared with RIP. However, it is not clear whether a ‘given’ A matrix is likely to satisfy a group RIP with a sufficiently large order. In general, it would appear that sparse regression is more advanced than sparse classification, with both well-established theoretical foundations as well as widely used algorithms in the former. By contrast, sparse classification does not have such a wealth of results. The lone star algorithm introduced here has performed well in several applications involving cancer data, and, at least for the moment, it appears to be the only available method to select far fewer features than the size of the training set of samples. As of now, there is no theoretical justification for this observed behaviour. Recall that the ℓ1-norm SVM is guaranteed only to choose no more features than the size of the training set; but there is no reason to assume that it will use fewer. Therefore, it is certainly worthwhile to study when and why lone star and other such algorithms will prove to be effective.

16 in total

1. SVM-T-RFE: a novel gene selection algorithm for identifying metastasis-related genes in colorectal cancer using gene expression profiles.

Authors: Xiaobo Li; Sihua Peng; Jian Chen; Bingjian Lü; Honghe Zhang; Maode Lai
Journal: Biochem Biophys Res Commun Date: 2012-01-28 Impact factor: 3.575

2. Support vector machine-based prediction of local tumor control after stereotactic body radiation therapy for early-stage non-small cell lung cancer.

Authors: Rainer J Klement; Michael Allgäuer; Steffen Appold; Karin Dieckmann; Iris Ernst; Ute Ganswindt; Richard Holy; Ursula Nestle; Meinhard Nevinny-Stickel; Sabine Semrau; Florian Sterzing; Andrea Wittig; Nicolaus Andratschke; Matthias Guckenberger
Journal: Int J Radiat Oncol Biol Phys Date: 2014-01-07 Impact factor: 7.038

Review 3. Histological, molecular and functional subtypes of breast cancers.

Authors: Gautam K Malhotra; Xiangshan Zhao; Hamid Band; Vimla Band
Journal: Cancer Biol Ther Date: 2010-11-15 Impact factor: 4.742

4. Gene expression profiles predict early relapse in ovarian cancer after platinum-paclitaxel chemotherapy.

Authors: Lynn C Hartmann; Karen H Lu; Gerald P Linette; William A Cliby; Kimberly R Kalli; David Gershenson; Robert C Bast; James Stec; Natalia Iartchouk; David I Smith; Jeffrey S Ross; Sebastian Hoersch; Viji Shridhar; James Lillie; Scott H Kaufmann; Edwin A Clark; Andrew I Damokosh
Journal: Clin Cancer Res Date: 2005-03-15 Impact factor: 12.531

5. A prognostic gene expression index in ovarian cancer - validation across different independent data sets.

Authors: Carsten Denkert; Jan Budczies; Silvia Darb-Esfahani; Balazs Györffy; Jalid Sehouli; Dominique Könsgen; Robert Zeillinger; Wilko Weichert; Aurelia Noske; Ann-Christin Buckendahl; Berit M Müller; Manfred Dietel; Hermann Lage
Journal: J Pathol Date: 2009-06 Impact factor: 7.996

6. Prospective assessment of lymphatic dissemination in endometrial cancer: a paradigm shift in surgical staging.

Authors: Andrea Mariani; Sean C Dowdy; William A Cliby; Bobbie S Gostout; Monica B Jones; Timothy O Wilson; Karl C Podratz
Journal: Gynecol Oncol Date: 2008-03-04 Impact factor: 5.482

Review 7. Gene expression profiling and prediction of clinical outcome in ovarian cancer.

Authors: Renaud Sabatier; Pascal Finetti; Nathalie Cervera; Daniel Birnbaum; François Bertucci
Journal: Crit Rev Oncol Hematol Date: 2009-02-26 Impact factor: 6.312

8. Comprehensive molecular characterization of human colon and rectal cancer.

Authors:
Journal: Nature Date: 2012-07-18 Impact factor: 49.962

9. Integrated genomic analyses of ovarian carcinoma.

Authors:
Journal: Nature Date: 2011-06-29 Impact factor: 49.962

10. Optimized application of penalized regression methods to diverse genomic data.

Authors: Levi Waldron; Melania Pintilie; Ming-Sound Tsao; Frances A Shepherd; Curtis Huttenhower; Igor Jurisica
Journal: Bioinformatics Date: 2011-12-15 Impact factor: 6.937

6 in total

1. Machine learning based detection of age-related macular degeneration (AMD) and diabetic macular edema (DME) from optical coherence tomography (OCT) images.

Authors: Yu Wang; Yaonan Zhang; Zhaomin Yao; Ruixue Zhao; Fengfeng Zhou
Journal: Biomed Opt Express Date: 2016-11-03 Impact factor: 3.732