Literature DB >> 24098080

Rough sets for in silico identification of differentially expressed miRNAs.

Abstract

The microRNAs, also known as miRNAs, are the class of small noncoding RNAs. They repress the expression of a gene posttranscriptionally. In effect, they regulate expression of a gene or protein. It has been observed that they play an important role in various cellular processes and thus help in carrying out normal functioning of a cell. However, dysregulation of miRNAs is found to be a major cause of a disease. Various studies have also shown the role of miRNAs in cancer and the utility of miRNAs for the diagnosis of cancer and other diseases. Unlike with mRNAs, a modest number of miRNAs might be sufficient to classify human cancers. However, the absence of a robust method to identify differentially expressed miRNAs makes this an open problem. In this regard, this paper presents a novel approach for in silico identification of differentially expressed miRNAs from microarray expression data sets. It integrates judiciously the theory of rough sets and merit of the so-called B.632+ bootstrap error estimate. While rough sets select relevant and significant miRNAs from expression data, the B.632+ error rate minimizes the variability and bias of the derived results. The effectiveness of the proposed approach, along with a comparison with other related approaches, is demonstrated on several miRNA microarray expression data sets, using the support vector machine.

Entities: Chemical Disease Gene Species

Keywords: bootstrap error; feature selection; microRNA; support vector machine

Mesh：

Substances：
MicroRNAs

Year: 2013 PMID： 24098080 PMCID： PMC3790281 DOI： 10.2147/IJN.S40739

Source DB: PubMed Journal: Int J Nanomedicine ISSN： 1176-9114

Introduction

The microRNAs or miRNAs, a class of short, approximately 22-nucleotide, noncoding RNAs found in many plants and animals, often act posttranscriptionally to inhibit mRNA expression. Hence, the miRNAs are related to diverse cellular processes and are regarded as important components of the gene regulatory network. Multiple reports have noted the utility of miRNAs for the diagnosis of cancer and other diseases. Unlike with mRNAs, a modest number of miRNAs, 200 in total, might be sufficient to classify human cancers.1,2 Moreover, the bead-based miRNA detection method has the attractive property of being not only accurate and specific, but also easy to implement in a routine clinical setting. In addition, unlike mRNAs, miRNAs remain largely intact in routinely collected, formalin-fixed, and paraffin-embedded specimens.2 Recent studies have also shown that miRNAs can be detected in serum.2 These studies offer the promise of utilizing miRNA screening via less invasive blood-based mechanisms. In addition, mature miRNAs are relatively stable. These phenomena make miRNAs superior molecular markers and targets for interrogation and as such, miRNA expression profiling can be utilized as a tool for cancer diagnosis and other diseases. The functions of miRNAs have regulatory effects in various cellular functions. Just as miRNA is involved in the normal functioning of eukaryotic cells, so has dysregulation of miRNA been associated with disease.3 This indicates that these miRNAs can prove to be potential biomarkers for developing a diagnostic tool. Hence, in silico identification of differentially expressed miRNAs that target genes involved in diseases is necessary. These differentially expressed miRNAs can be further used in developing effective diagnostic tools. Recently, a few studies were carried out to identify differentially expressed miRNAs.4–8 However, the absence of a robust method of identification makes this an open problem. Hence, data sets are needed to be explored for understanding the complex biological activities of miRNAs. A miRNA expression data set can be represented by an expression table or matrix, where each row corresponds to one particular miRNA, each column to a sample or time point, and each entry of the matrix is the measured expression level of a particular miRNA in a sample or time point, respectively. However, for microarray data, the number of training samples is typically very small, while the number of miRNAs is in the thousands. In general, it is not possible to use all available miRNAs to form the prediction rule of any classifier. Further, use of all the miRNAs might allow the noise associated with miRNAs of little or no discriminatory power. In effect, this would also inhibit and degrade the performance of the prediction rule in its application to unclassified or test samples. In other words, although the apparent error rate, which is the proportion of the training samples misclassified by the prediction rule, will decrease as it is formed from more and more miRNAs, its error rate in classifying samples outside of the training set eventually will increase. That is, the generalization error of the prediction rule will be increased if it is formed from a sufficiently large number of miRNAs. Hence, in practice, consideration has to be given to implement some procedure of feature selection for reducing the number of miRNAs to be used in constructing the prediction rule.9 The method called significance analysis of microarrays is used in several works10–15 to identify differentially expressed miRNAs. Different statistical tests are also employed to identify differentially expressed miRNAs.1,4–8,16–19 Xu et al20 used the particle swarm optimization technique for selecting important miRNAs that contribute to the discrimination of different cancer types. However, the mutual information21 or f-information22-based minimum redundancy-maximum relevance framework can also be used to select a set of nonredundant and relevant miRNAs for sample classification. One of the main problems in miRNA expression data analysis is uncertainty. Some of the sources of this uncertainty include imprecision in computations and vagueness in class definition. In this context, the rough set theory has gained popularity in modeling and propagating uncertainty. It deals with vagueness and incompleteness and is proposed for indiscernibility in classification, according to some similarity.23 It has been applied successfully to feature selection of discrete valued data.24 Given a data set with discretized attribute values, it is possible to find a subset of the original attributes, using rough set theory, that are the most informative; all other attributes can be removed from the data set with minimal information loss. From the dimensionality reduction perspective, informative features are those that are most useful in determining classifications from their values.23,24 Rough set theory has been successfully applied to microarray data analysis.25–34 In general, the performance of the prediction rule generated by a classifier for a subset of selected miRNAs is evaluated by leave-one-out cross-validation (LOOCV) error. Given that the entire set of available samples is relatively small, in practice, one would like to make full use of all available samples in the miRNA selection and training of the prediction rule. But, if the LOOCV is calculated within the miRNA selection process, it has a selection bias when it is used as an estimate of the prediction error. The LOOCV error of the prediction rule obtained during the selection of the miRNAs provides a too optimistic estimate of the prediction error rate. Hence, an external cross-validation should be undertaken subsequent to the miRNA selection process to correct for this selection bias. Alternatively, the bootstrap procedure can be used.35,36 Although, the LOOCV error with external cross validation is nearly unbiased, it can be highly variable in the sense that there is no guarantee that the same subset of miRNAs will be obtained as during the original training of the rule, on all the training samples. Indeed, with the huge number of miRNAs available, it generally will yield a subset of miRNAs that has at most, only a few miRNAs in common with the subset selected during the original training of the rule. Suitably defined bootstrap procedures can reduce the variability of the LOOCV error in addition to providing a direct assessment of variability for the estimated parameters in the prediction rule. However, the bootstrap approach overestimates the error. To reduce the weakness of both these approaches, Efron and Tibshirani introduced the concept of B.632+ error for correcting the upward bias in bootstrap error with the downwardly biased apparent error,35 which is very much applicable for the data sets with small number of training samples and large number of features or miRNAs. In this regard, this paper presents a novel approach for in silico identification of differentially expressed miRNAs from expression data sets. It integrates the merit of the rough set-based feature-selection algorithm using a maximum relevance maximum significance criterion (RSMRMS)29 and the concept of the so-called B.632+ error rate.35 The RSMRMS algorithm selects a subset of miRNAs from a data set by maximizing both relevance and significance of the selected miRNAs. It employs rough set theory to compute both the relevance and significance of the miRNAs. Hence, the only information required in the feature selection method is in the form of equivalence partitions for each miRNA, which can be automatically derived from the given microarray data set. This avoids the need for domain experts to provide information on the data involved and ties in with the advantage of rough sets in that it requires no information other than the data set itself. On the other hand, the B.632+ error rate minimizes the variability and bias of the derived results. The support vector machine is used to compute the B.632+ error rate as well as several other types of error rates, as it maximizes the margin between data samples in different classes. The effectiveness of the proposed approach, along with a comparison with other related approaches, is demonstrated on a set of miRNA expression data sets. The paper is organized as follows: The next section reports a brief description of several miRNA data sets used in the current study, along with the proposed methodology, which covers an overview of the rough sets, the rough set-based miRNA selection algorithm, fuzzy discretization method, the concept of the B.632+ error rate, and the support vector machine. Implementation details, experimental results, discussion, and a comparison among different algorithms are presented in the following section. Finally, concluding remarks are given.

Material and method

Data sets used

In the current research work, three publicly available miRNA expression data sets are used to establish the effectiveness of the proposed approach. Three miRNA expression data sets with accession numbers GSE17681, GSE17846, and GSE29352 were downloaded from Gene Expression Omnibus (http://wwwncbi.nlm.nih.gov/geo/). The first data set was generated to detect specific patterns of miRNAs in peripheral blood samples of lung cancer patients. As controls, blood of donors without known affection were tested. The number of miRNAs, samples, and classes in this data set are 866, 36, and two, respectively. The second data set represents the analysis of miRNA profiling in the peripheral blood samples of multiple sclerosis and in the blood of normal donors. It contains 864 miRNAs, 41 samples, and two classes. In the third data set, miRNA expression profiles in pancreatic cystic tumors with low malignant potential (serous microcystic adenomas) and high malignant potential (mucinous cystadenoma and intraductal papillary mucinous neoplasm [IPMN]) have been generated. These expression profiles are further compared in pancreatic ductal adenocarcinoma and carcinoma-ex-IPMN. The data set contains 43 samples, 885 miRNAs, and three classes.

Proposed method

The rough set-based proposed in silico approach is illustrated in Figure 1. It mainly consists of a rough set-based feature selection method (ie, RSMRMS), a support vector machine (SVM), and several types of error analysis parts, namely, apparent error (AE), bootstrap error (B1), no-information error (γ), and B.632+ error. This section presents each of these topics in detail, along with the basic notions of rough sets.

Figure 1

Schematic flow diagram of the proposed in silico approach for identification of differentially expressed miRNAs.

Abbreviations: miRNA, microRNA; RSMRMS, rough set–based maximum relevance maximum significance criterion; SVM, support vector machine; AE, apparent error; γ error, no-information error.

Rough sets

The theory of rough sets begins with the notion of an approximation space, which is a pair , where is a nonempty set, the universe of discourse, and is a family of attributes, also called knowledge in the universe. V is the value domain of A and f is an information function . An approximation space is also called an information system.23 Any subset ℙ of knowledge defines an equivalence, also called indiscernibility relation IND(ℙ) on If (x) ∊ IND(ℙ), then x and x are indiscernible by attributes from ℙ. The partition of generated by IND(ℙ) is denoted as where [x ]ℙ is the equivalence class containing x. The elements in [x]ℙ are indiscernible or equivalent with respect to knowledge ℙ. Equivalence classes, also termed as information granules, are used to characterize arbitrary subsets of . The equivalence classes of IND(ℙ) and the empty set ϕ are the elementary sets in the approximation space . Given an arbitrary set , in general, it may not be possible to describe X precisely in . One may characterize X by a pair of lower and upper approximations, defined as follows:23 Hence, the lower approximation is the union of all the elementary sets which are subsets of X, and the upper approximation is the union of all the elementary sets which have a nonempty intersection with X. The tuple is the representation of an ordinary set X in the approximation space or simply called the rough set of X The lower (respectively upper) approximation (respectively ) is interpreted as the collection of those elements of that definitely (respectively possibly) belong to X. The lower approximation is also called a positive region sometimes, denoted as POS (X). A set X is said to be definable or exact in if . Otherwise X is indefinable and termed as a rough set. Definition 1: An information system is called a decision table if the attribute set where ℂ is the condition attribute set and is the decision attribute set. The dependency between ℂ and can be defined as23 where is the ith equivalence class induced by and | ⋅ | denotes the cardinality of a set. Definition 2: Given ℂ, and an attribute , the significance of the attribute is defined as23 The change in dependency that arises when an attribute is removed from the set of condition attributes is a measure of the significance of the attribute. The higher the change in dependency, the more significant the attribute is. If the significance is 0, then the attribute is dispensable.

RSMRMS algorithm

In real data analysis such as microarray data, the data set may contain a number of insignificant features. The presence of such irrelevant and insignificant features may lead to a reduction in the useful information. Ideally, the selected features should have high relevance to the classes and high significance in the feature set. The features with high relevance are expected to be able to predict the classes of the samples. However, if insignificant features are present in the subset, they may reduce the prediction capability and may contain similar biological information. A feature set with high relevance and high significance enhances the predictive capability. Accordingly, a measure is required that can enhance the effectiveness of the feature set. In this work, the rough set theory is used to select the relevant and significant miRNAs from high-dimensional microarray data sets. Let be the set of m miR-NAs of a given microarray data set and is the set of selected miRNAs. Define as the relevance of the miRNA with respect to the class labels , while is the significance of the miRNA with respect to the set . The total relevance of all selected miRNAs is as follows: while the total significance among the selected miRNAs is Therefore, the problem of selecting a set of relevant and significant miRNAs from the whole set ℂ of m miRNAs is equivalent to maximize both and , that is, to maximize the objective function , where where β is a weight parameter. To solve the above problem, a greedy algorithm is used.29 The relevance and significance of individual miRNA are calculated based on the theory of rough sets, using Equations 3 and 4, respectively. The weight parameter β in the RSMRMS algorithm regulates the relative importance of the significance of the candidate miRNA with respect to the already-selected miRNAs and the relevance with the output class. If β is zero, only the relevance with the output class is considered for each miRNA selection. If β increases, this measure is incremented by a quantity proportional to the total significance, with respect to the already-selected miRNAs. The presence of a β value larger than zero is crucial in order to obtain good results. If the significance between miRNAs is not taken into account, selecting the miRNAs with the highest relevance with respect to the output class may tend to produce a set of redundant miRNAs that may leave out useful complementary information.

Fuzzy discretization

In miRNA expression data, the class labels of samples are represented by discrete symbols, while the expression values of miRNAs are continuous. Hence, to measure both relevance and significance of miRNAs using rough set theory, the continuous expression values of a miRNA have to be divided into several discrete partitions to generate equivalence classes. In this regard, a fuzzy set-based discretization method is used to generate the equivalence classes required to compute both the relevance and significance of the miRNAs. The family of normal fuzzy sets produced by a fuzzy partitioning of the universe of discourse can play the role of fuzzy equivalence classes. Given a finite set , ℂ is a fuzzy condition attribute set in , which generates a fuzzy equivalence partition on . If c denotes the number of fuzzy equivalence classes generated by the fuzzy equivalence relation and n is the number of objects in , then c-partitions of are sets of (cn) values that can be conveniently arrayed as a (c × n) matrix , which is denoted by where represents the membership of object x in the ith fuzzy equivalence partition or class F.27,37 Each row of the matrix is a fuzzy equivalence partition or class. In the rough set-based feature selection method the π function in one dimensional form is used to assign membership values to different fuzzy equivalence classes for the input miRNAs. A fuzzy set with membership function represents a set of points clustered around c, where where σ>0 is the radius of the π function with c as the central point and || ⋅ || denotes the Euclidean norm. When the pattern x lies at the central point c of a class, then ||x – c|| = 0 and its membership value is maximum, that is, . The membership value of a point decreases as its distance from the central point c (ie, ||x – c||) increases. When ||x – c|| = (σ/2), the membership value of x is 0.5, and this is called a crossover point.38 The (c × n) matrix , corresponding to the ith miRNA , can be calculated from the c-fuzzy equivalence classes of the objects x = {x1, …, x}, where In effect, each position of the matrix must satisfy the following conditions: and for any value of k, if After the generation of the matrix corresponding to the miRNA A, the object x is assigned to one of the c equivalence classes, based on the maximum value of memberships of the object in different equivalence classes that follows next: Each input real valued miRNA in quantitative form can be assigned to different fuzzy equivalence classes in terms of membership values, using the π fuzzy set with appropriate and σ. The centers and radii of the π functions along each miRNA axis are determined automatically from the distribution of the training patterns. In the proposed RSMRMS algorithm, three fuzzy equivalence classes (c = 3), namely, low, medium, and high are considered. These three equivalence classes correspond to underexpression, baseline, and overexpression of continuous valued miRNAs, respectively. Corresponding to the three fuzzy sets, low, medium, and high, the following relations hold: The parameters and σ of each π fuzzy set are computed according to the following procedure.39 Let be the mean of the objects x = {x, …, x, …, x} along the ith miRNA . Then and are defined as the mean along the ith miRNA of the objects having coordinate values in the range , respectively, where and denote the upper and lower bounds of the dynamic range of miRNA for the training set. For the three fuzzy sets, low, medium, and high, the centers and corresponding radii are computed as follows: where η is a multiplicative parameter controlling the extent of the overlapping. The distribution of the patterns or objects along each miRNA axis is taken into account while computing the corresponding centers and radii of the fuzzy sets. Also, the amount of overlap between the three fuzzy sets can be different along the different axes, depending on the distribution of the objects or patterns.

B.632+ error rate

In order to minimize the variability and bias of derived result, the so-called B.632+ bootstrap approach35 is used, which is defined as follows: where AE denotes the proportion of the original training samples misclassified, termed as apparent error rate, and B1 is the bootstrap error, defined as follows: where n is the number of original samples and M is the number of bootstrap samples. If the sample x is not contained in the kth bootstrap sample, then I = 1, otherwise 0. Similarly, if x is misclassified, Q = 1, otherwise 0. The weight parameter ω is given by where K is the number of classes, p is the proportion of the samples from the ith class, and q is the proportion of them assigned to the ith class. Also, γ is termed as the no-information error rate that would apply if the distribution of the class-membership label of the sample x did not depend on its feature vector.

SVM

In the current study, the SVM40 is used to compute the B.632+ error rate. The SVM is a margin classifier that draws an optimal hyperplane in the feature vector space; this defines a boundary that maximizes the margin between data samples in different classes, therefore leading to good generalization properties. A key factor in the SVM is to use kernels to construct a nonlinear decision boundary. In the present work, linear kernels are used. The source code of the SVM is downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvm/.

Experimental results and discussions

In this section, the performance of the RSMRMS algorithm is compared with that of the mutual information based minimum redundancy-maximum relevance (mRMR) algorithm,21 on three miRNA microarray data sets. The fuzzy set–based discretization method is also compared with several other discretization methods.22,24 The margin classifier SVM40 is used to evaluate the performance of different algorithms. To compute the different types of error rates obtained using the SVM, the bootstrap approach is performed on each miRNA expression data set. For each training set, a set of differential miRNAs is first generated, and then the SVM is trained with the selected miRNAs. After the training, the information of miRNAs, those selected for the training set, is used to generate a test set, and then the class label of the test sample is predicted using the SVM. For each data set, the 50 (d = 50) top-ranked miRNAs are selected for the analysis.

Optimum values of parameters

The rough set–based miRNA selection algorithm uses the weight parameter β to control the relative importance of significance of a miRNA with respect to its relevance. On the other hand, the multiplicative parameter η controls the degree of overlapping between the three fuzzy sets that are used to generate fuzzy equivalence classes. Hence, the performance of the proposed approach very much depends on both the parameters β and η. The value of β is varied from 0.0 to 1.0, while the parameter η varies from 0.5 to 2.0. Extensive experimental results were obtained for all values of β and η on the three miRNA expression data sets. Figure 2 presents the variation of the B.632+ error rate obtained using the RSMRMS algorithm for different values of β and η on the three miRNA data sets. From the results reported in Figure 2, it is seen that as the value of β increases, the B.632+ error of the SVM decreases. On the other hand the error rate increases for very high or very low values of η. Table 1 presents the optimum values of β and η for which the minimal B.632+ error rate of the SVM is achieved. From the results reported in Table 1, it is seen that the proposed algorithm with β ≠ 0.0 provides a better result than that of β= 0.0, in all three cases, which justifies the importance of both the relevance and significance criteria. The corresponding values of η indicate that very large or very small amounts of overlapping among the three equivalence classes of input miRNAs are found to be undesirable for β> 0.0.

Figure 2

Variation of B.632+ error rate of the SVM with respect to multiplicative parameter η and weight parameter β.

Abbreviation: SVM, support vector machine.

Table 1

Optimum values of two parameters for three miRNA data

Parameter/data set	GSE17681	GSE17846	GSE29352
Weight parameter β	1.0	0.5	1.0
Multiplicative parameter η	1.7	1.0	1.7

Abbreviation: miRNA, microRNA.

Importance of B.632+ error rate

This section establishes the importance of using the B.632+ error rate over other types of errors, such as AE, γ, and B1. Different types of errors on each miRNA expression data set are calculated using the SVM for the proposed method. Figure 3 represents the various types of errors obtained by the proposed algorithm on the three miRNA expression data sets. From Figure 3, it is seen that different types of errors decrease as the number of selected miRNAs increases. For all three data sets, the AE consistently attains the lowest value, while γ has highest value. On the other hand, the B1 has a smaller error rate than γ but is higher than the AE. Moreover, the B.632+ estimate has smaller error rate than the B1 but higher than the AE.

Figure 3

Error rate of the SVM obtained using the RSMRMS algorithm averaged over 50 random splits.

Abbreviations: SVM, support vector machine; RSMRMS, rough set–based maximum relevance maximum significance; miRNA, microRNA; AE, apparent error; B1, bootstrap error; γ, no-information error; B.632+, B.632+ error.

Table 2 reports the minimum values of different errors along with the number of required miRNAs to attain these values. From all the results reported in this table, it can be seen that the B.632+ estimator corrects the upward bias of B1 and downward bias of AE. Also, it puts more weight on B1 in the situation where the amount of overfitting, as measured by (B1 – AE), is relatively large. It thus is applicable in the present context where the prediction rule generated by the SVM is overfitted.

Table 2

Comparative analysis of different errors

Error/no of miRNA	Microarray data sets
Error/no of miRNA	GSE17681	GSE17846	GSE29352
AE	0.000	0.000	0.000
miRNA	8	2	17
B1 error	0.142	0.093	0.429
miRNA	24	39	20
γ error	0.423	0.441	0.455
miRNA	4	1	23
B.632+ error	0.103	0.064	0.413
miRNA	24	39	20

Abbreviations: AE error, apparent error; B1 error, bootstrap error; γ error, no-information error; B.632+, B.632+ error.

Role of fuzzy discretization method

In the current study, the fuzzy set–based discretization method was used to generate equivalence classes or information granules, for computing the relevance and significance of miRNAs using the theory of rough sets. To establish the effectiveness of the fuzzy set–based discretization method over other discretization methods, extensive experimentation was done on three miRNA data sets. The methods compared were the mean and standard deviation–based method,22 the supervised discretization method,24 and the unsupervised discretization method.24Figure 4 reports the variation of several errors with respect to number of selected miRNAs, while Table 3 presents the minimum error values obtained using the different discretization methods. From all the results reported in Figure 4 and Table 3, it can be seen that the fuzzy set– based discretion method performed better than the other discretization methods, irrespective of the types of errors and miRNA data sets used.

Figure 4

Error rates of the SVM obtained using different discretization methods averaged over 50 random splits.

Abbreviations: SVM, support vector machine; miRNA, microRNA; AE, apparent error; B1, bootstrap error; γ, no-information error; B.632+, B.632+ error.

Table 3

Comparative performance analysis of different discretization methods

Microarray data sets	Discretization methods	AE		B1		γ		B.632+
Microarray data sets	Discretization methods	Error	miRNAs	Error	miRNAs	Error	miRNAs	Error	miRNAs
GSE17681	Mean-stddev	0.000	9	0.153	21	0.424	11	0.110	21
	Supervised	0.139	20	0.351	17	0.423	3	0.319	17
	Unsupervised	0.000	26	0.190	40	0.420	8	0.143	40
	Fuzzy set based	0.000	8	0.142	24	0.423	4	0.103	24
GSE17846	Mean-stddev	0.000	2	0.098	41	0.462	17	0.067	41
	Supervised	0.338	1	0.394	1	0.445	1	0.429	1
	Unsupervised	0.000	15	0.129	43	0.450	4	0.091	43
	Fuzzy set based	0.000	2	0.093	39	0.441	1	0.064	39
GSE29352	Mean-stddev	0.023	25	0.454	25	0.455	25	0.454	25
	Supervised	0.000	23	0.454	22	0.454	22	0.454	22
	Unsupervised	0.000	19	0.450	27	0.450	27	0.450	27
	Fuzzy set based	0.000	17	0.429	20	0.455	23	0.413	20

Abbreviations: AE error, apparent error; B1 error, bootstrap error; γ error, no-information error; B.632+, B.632+ error; miRNA, microRNA.

Comparative performance analysis

This section compares the performance of the mRMR and RSMRMS algorithms with respect to the various types of errors. Figure 5 presents the different error rates obtained by the mRMR and RSMRMS algorithms on the three miRNA expression data sets. From the figure, it is seen that in most cases, the different types of error rates were consistently lower for the RSMRMS algorithm compared with the mRMR method.

Figure 5

Error rates of the SVM obtained using the mRMR and RSMRMS algorithms averaged over 50 random splits.

Abbreviations: SVM, support vector machine; mRMR, mutual information–based minimum redundancy-maximum relevance; RSMRMS, rough set–based maximum relevance maximum significance; miRNA, microRNA; AE, apparent error; B1, bootstrap error; γ, no-information error; B.632+, B.632+ error.

Finally, Table 4 compares the performance of the rough set–based proposed method with the best performance of the mRMR method. The results are presented based on the error rate of the SVM classifier obtained on the three miRNA microarray data sets. From the results reported in Table 4, it is seen that although the best AE for each miRNA data set was same for both algorithms, the RSMRMS achieved this value with a lower number of selected miRNAs than that obtained by the mRMR method. Also, the RSMRMS attained the lowest B.632+ bootstrap error rate, as well as B1 error rate, of the SVM classifier for all three miRNA data sets, with a lesser number of selected miRNAs.

Table 4

Comparative performance analysis of mRMR and RSMRMS algorithms

Microarray data sets	Methods/algorithms	AE		B1		γ		B.632+
Microarray data sets	Methods/algorithms	Error	miRNAs	Error	miRNAs	Error	miRNAs	Error	miRNAs
GSE17681	mRMR	0.000	10	0.175	28	0.414	13	0.130	28
GSE17681	RSMRMS	0.000	8	0.142	24	0.423	4	0.103	24
GSE17846	mRMR	0.000	3	0.101	48	0.441	1	0.069	49
GSE17846	RSMRMS	0.000	2	0.093	39	0.458	5	0.064	39
GSE29352	mRMR	0.000	21	0.430	43	0.447	32	0.420	43
GSE29352	RSMRMS	0.000	17	0.429	20	0.455	23	0.413	20

Abbreviations: mRMR, mutual information–based minimum redundancy-maximum relevance; RSMRMS, rough set–based maximum relevance maximum significance; AE, apparent error, B1, bootstrap error; γ, no-information error; B.632+, B.632+ error; miRNA, microRNA.

The better performance of the RSMRMS algorithm was achieved due to the fact that it uses rough sets for computing both miRNA-class relevance and miRNA-miRNA significance to select differentially expressed miRNAs. The lower and upper approximations of rough sets can effectively deal with incompleteness, vagueness, and uncertainty of the data set.

Conclusion

This paper presents a novel approach for in silico identification of differentially expressed miRNAs. It integrates judiciously the merits of rough sets, SVM, and the B.632+ error rate for selecting relevant and significant miRNAs, which can classify samples into different classes with minimum error rate. The results obtained on three miRNA data sets demonstrate that the proposed method can bring a remarkable improvement to the miRNA selection problem, and therefore, can be a promising alternative to existing models for the prediction of class labels of samples. All the results reported in this paper demonstrate the feasibility and effectiveness of the proposed method. The new method is capable of identifying effective miRNAs that may contribute to revealing the underlying etiology of a disease, providing a useful tool for exploratory analysis of miRNA data.

25 in total

1. f-Information measures for efficient selection of discriminative genes from microarray data.

Authors: Pradipta Maji
Journal: IEEE Trans Biomed Eng Date: 2008-09-16 Impact factor: 4.538

2. Rough-fuzzy clustering for grouping functionally similar genes from microarray data.

Authors: Pradipta Maji; Sushmita Paul
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2013 Mar-Apr Impact factor: 3.710

3. Selection bias in gene extraction on the basis of microarray gene-expression data.

Authors: Christophe Ambroise; Geoffrey J McLachlan
Journal: Proc Natl Acad Sci U S A Date: 2002-04-30 Impact factor: 11.205

4. MicroRNA expression profiles classify human cancers.

Authors: Jun Lu; Gad Getz; Eric A Miska; Ezequiel Alvarez-Saavedra; Justin Lamb; David Peck; Alejandro Sweet-Cordero; Benjamin L Ebert; Raymond H Mak; Adolfo A Ferrando; James R Downing; Tyler Jacks; H Robert Horvitz; Todd R Golub
Journal: Nature Date: 2005-06-09 Impact factor: 49.962

5. Differential expression of microRNA species in human gastric cancer versus non-tumorous tissues.

Authors: Junming Guo; Ying Miao; Bingxiu Xiao; Rong Huan; Zhen Jiang; Dan Meng; Yanjun Wang
Journal: J Gastroenterol Hepatol Date: 2008-11-03 Impact factor: 4.029

6. MicroRNA signatures in human ovarian cancer.

Authors: Marilena V Iorio; Rosa Visone; Gianpiero Di Leva; Valentina Donati; Fabio Petrocca; Patrizia Casalini; Cristian Taccioli; Stefano Volinia; Chang-Gong Liu; Hansjuerg Alder; George A Calin; Sylvie Ménard; Carlo M Croce
Journal: Cancer Res Date: 2007-09-15 Impact factor: 12.701

7. Differential patterns of microRNA expression in neuroblastoma are correlated with prognosis, differentiation, and apoptosis.

Authors: Yongxin Chen; Raymond L Stallings
Journal: Cancer Res Date: 2007-02-01 Impact factor: 12.701

8. MicroRNA classifiers for predicting prognosis of squamous cell lung cancer.

Authors: Mitch Raponi; Lesley Dossey; Tim Jatkoe; Xiaoying Wu; Guoan Chen; Hongtao Fan; David G Beer
Journal: Cancer Res Date: 2009-07-07 Impact factor: 12.701

9. Distinctive patterns of microRNA expression associated with karyotype in acute myeloid leukaemia.

Authors: Amanda Dixon-McIver; Phil East; Charles A Mein; Jean-Baptiste Cazier; Gael Molloy; Tracy Chaplin; T Andrew Lister; Bryan D Young; Silvana Debernardi
Journal: PLoS One Date: 2008-05-14 Impact factor: 3.240

10. MicroRNA expression profiling of human breast cancer identifies new markers of tumor subtype.

Authors: Cherie Blenkiron; Leonard D Goldstein; Natalie P Thorne; Inmaculada Spiteri; Suet-Feung Chin; Mark J Dunning; Nuno L Barbosa-Morais; Andrew E Teschendorff; Andrew R Green; Ian O Ellis; Simon Tavaré; Carlos Caldas; Eric A Miska
Journal: Genome Biol Date: 2007 Impact factor: 13.583

3 in total

1. μHEM for identification of differentially expressed miRNAs using hypercuboid equivalence partition matrix.

Authors: Sushmita Paul; Pradipta Maji
Journal: BMC Bioinformatics Date: 2013-09-04 Impact factor: 3.169

2. Identification of miRNA-mRNA Modules in Colorectal Cancer Using Rough Hypercuboid Based Supervised Clustering.

Authors: Sushmita Paul; Petra Lakatos; Arndt Hartmann; Regine Schneider-Stock; Julio Vera
Journal: Sci Rep Date: 2017-02-21 Impact factor: 4.379

3. Sensitive detection of microRNAs based on the conversion of colorimetric assay into electrochemical analysis with duplex-specific nuclease-assisted signal amplification.

Authors: Ning Xia; Ke Liu; Yingying Zhou; Yuanyuan Li; Xinyao Yi
Journal: Int J Nanomedicine Date: 2017-07-13

3 in total