Literature DB >> 34243809

A new pipeline for structural characterization and classification of RNA-Seq microbiome data.

Sebastian Racedo¹, Ivan Portnoy^2,3, Jorge I Vélez¹, Homero San-Juan-Vergara¹, Marco Sanjuan¹, Eduardo Zurek¹.

Abstract

BACKGROUND: High-throughput sequencing enables the analysis of the composition of numerous biological systems, such as microbial communities. The identification of dependencies within these systems requires the analysis and assimilation of the underlying interaction patterns between all the variables that make up that system. However, this task poses a challenge when considering the compositional nature of the data coming from DNA-sequencing experiments because traditional interaction metrics (e.g., correlation) produce unreliable results when analyzing relative fractions instead of absolute abundances. The compositionality-associated challenges extend to the classification task, as it usually involves the characterization of the interactions between the principal descriptive variables of the datasets. The classification of new samples/patients into binary categories corresponding to dissimilar biological settings or phenotypes (e.g., control and cases) could help researchers in the development of treatments/drugs.
RESULTS: Here, we develop and exemplify a new approach, applicable to compositional data, for the classification of new samples into two groups with different biological settings. We propose a new metric to characterize and quantify the overall correlation structure deviation between these groups and a technique for dimensionality reduction to facilitate graphical representation. We conduct simulation experiments with synthetic data to assess the proposed method's classification accuracy. Moreover, we illustrate the performance of the proposed approach using Operational Taxonomic Unit (OTU) count tables obtained through 16S rRNA gene sequencing data from two microbiota experiments. Also, compare our method's performance with that of two state-of-the-art methods.
CONCLUSIONS: Simulation experiments show that our method achieves a classification accuracy equal to or greater than 98% when using synthetic data. Finally, our method outperforms the other classification methods with real datasets from gene sequencing experiments.

Entities: Chemical Disease Gene Species

Keywords: 16 rRNA sequencing; Classification method; Compositional nature; Microbial communities

Year: 2021 PMID： 34243809 PMCID： PMC8268467 DOI： 10.1186/s13040-021-00266-7

Source DB: PubMed Journal: BioData Min ISSN： 1756-0381 Impact factor: 2.522

Background

Microorganisms living inside and on humans are known as the microbiota. When integrated with their genes’ information, it is known as the microbiome. The Human Microbiome Project (HMP) was an endeavor for the characterization of the human microbiota to further understanding its impact on human health and diseases [1]. In recent years, biological sciences have experienced substantial technological advances that have led to the rediscovery of systems biology [2-4]. These advances were possible thanks to the technological ability to completely sequence the genome from any organism at a low cost [5, 6]. Such advances triggered the development of various analytic approaches and technologies to simultaneously monitoring all the components within cells (e.g., genes and proteins). With the genome information and analytic technologies, the mining and exploration of the resulting data opened up the possibility to better understand biological systems, such as microbial populations, and their complexity. The network structure of such biological systems can give insight into the underlying interactions taking place within those systems [7-10]. Furthermore, the understanding of these interactions can lead to the discovery of new methods that can help physicians, biologists, scientists, and healthcare workers with disease diagnosis, gene identification, classification of new data, and many other tasks [11]. We initially conducted a literature search in different medical, biological, and engineering databases as well as academic sites prestigious journals such as BMC Bioinformatics, PLOS ONE, ScienceDirect, and IEEE Xplore using the queries “correlation structure for gene expression classifications,” “classifiers for compositional data,” and “classifiers based on correlation structures” in order to identify papers in English using procedures for sample classification based on correlation structures in the 2009–2019 time window. Figure 1 shows the evolution of the number of publications retrieved when the keywords “correlation structure for gene expression classifications” are used. Publications were retrieved from several academic sites, namely BMC Bioinformatics, PLOS One, ScienceDirect, and Scopus. Figure 2 summarizes the current principal stages of gene expression analysis for sample classification.

Fig. 1

Evolution of the number of publications per year from 2009 to 2019

Fig. 2

Scheme of gene analysis used for sample classification

Evolution of the number of publications per year from 2009 to 2019 Scheme of gene analysis used for sample classification Operational Taxonomic Unit (OTU) count tables are the usual output when processing the 16S rRNA sequences of microbiota samples [12]. These tables show the relative abundances of the bacteria that make a microbiota population (e.g., the human gut microbiota). OTU-based data have a compositional nature, which makes them difficult to work with [13, 14]. Thus, data transformation is required prior to any further analysis. Aitchison [15] proposed two transformations to compensate for the data’s compositionality, thus allowing the use of standard metrics in further analysis. The first transformation is the additive log-ratio (alr), which is defined as: where x is an element of {x1, x2, x3…, x}. Because one value x is selected as the denominator to build the log-ratios, the alr has been criticized as being subjective since the outcome depends mostly on the value of x selected [15-18]. The second transformation proposed by Aitchison is the centered log-ratio (clr), which is defined as: where is the geometric mean. The use of g() avoids the subjectivity of the alr transformation since the method is taking all the information of [15-19]. The clr transformation has proven to be reliable and has been extensively used in the scientific literature over the years to analyze microbiome data. In [20] authors proposed a transformation called the isometric log-ratio (irl) transformation. This approach takes any compositional data ∈ S, and computes ilr() = z = [z1, z2, …, z], where z is calculated as: However, implementing the ilr transformation poses serious practical difficulties for high-dimension data as the computational complexity increases rapidly with dimensionality [21].

Feature selection

After transforming the data, the next step is to separate the data into train, test, and validation sets, although in some cases only the train and test sets are considered. One of the most common problems prior to that step is the limitation of the number of data samples. Indeed, for a normal classifier to be employed using multivariate metrical techniques, the sample size required for optimum training is in order of thousands. This is known as the “curse of dimensionality” problem, and the usual way to overcome this limitation is by using a dimensionality reduction technique to collapse all the attributes (variables) into a lower-dimension space where the most dominant information of the dataset can be retrieved [13, 22]. Feature selection methods are usually separated into three categories: filter, wrapper, and embedded. Table 1 summarizes different approaches for feature selection in gene expression data, the most relevant categories for feature selection, and the current weaknesses when analyzing gene expression data. Filter methods can work with univariate and multivariate data, where univariate methods focus on each feature separately and multivariate methods focus on finding relationships between features [23, 24]. Here we only consider multivariate methods.

Table 1

Summary of feature selection approaches in gene expression analysis

Category	Description	Weaknesses	References
Filter	- Extract features from the data without any type of learning involved.	- Ignore interaction with the classifier.	[13, 23, 25–30]
Wrapper	- Use learning approaches to evaluate which features are useful.	- Risk of overfitting. - Classifier dependent selection.	[23, 26, 29, 30]
Embedded	- Combine the traditional feature selection step with the classifier construction.	- Classifier dependent selection.	[23, 26, 29–31]

Classification

The final step after finding the most relevant features of the transformed data is to select a classifier. In clinical and bioinformatic research, prediction models are extensively used to derive classification rules useful to accurately predict whether a patient has or would develop a disease, whether the treatment is going to work, or even whether a disease would recur [33-35]. Table 2 summarizes the relevant aspects of some widely used classifiers.

Table 2

Summary of classifiers used in gene expression analysis

Category	Classifier	References
Metrical and classical	- Probabilistic: Bayesian classifier, probabilistic linear discriminant analysis. - Non probabilistic: Support Vector Machine (SVM), SVM-RFE, Nearest-neighbor (NN), linear discriminant analysis.	[13, 37–41]
Artificial Intelligence	- Fuzzy Logic, Genetic Algorithms, Classification and Regression trees.	[13, 38, 39, 42, 43]
Boosting	- LogitBoost, AdaBoost.M1, GradientBoosting (GrBoost)	[13, 14, 38, 39, 44]

Proposed classification method

Here, we explain in detail the proposed classification method. First, in section “Data pretreatment”, we introduce the Data Pretreatment stage, and in section “Assessing correlation structure distortion”, a novel metric to be used as the metric to assess correlation structure distortion is described. Finally, in section “Dimensionality reduction technique”, we present the proposed classification rule, which is based on the previously defined metric and a proposed dimensionality-reduction approach to assess the disruption of a dataset’s correlation structure after a new sample is included.

Data pretreatment

Let and be the OTU count tables where m features are assessed in n and n samples from control and case individuals, respectively. In the expressions above, the superindex ρ indicates the datasets are ‘raw’ or without pretreatment. From now on, will represent any of the two groups (g = c for control, or g = v for case). When analyzing OTU counts tables, a log-ratio transformation, such as the clr, is to be applied [15, 18, 19] before estimating correlations. However, in order to apply the log-ratio transformation, it is necessary to consider that compositional count datasets may contain null values resulting from insufficiently large or non-existing samples. As log-ratio transformations require data with exclusively positive values, the use of a zero-replacement method is a must. Here we use the Bayesian-multiplicative (BM) algorithm proposed by Martín-Fernández [49]. Let ∈ℝ1 × be the i-th row of the matrix (i = 1, 2, …, n). The BM algorithm replaces the null counts by When using the Bayes-Laplace prior, we set , t = m−1 and s = m. Let be the resulting matrix after the BM algorithm is applied row-wise to . To ensure the data’s compositionality on , a closure operation [15, 18, 19] is applied to every row of , as follows: where k is an arbitrary constant (usually k = 100). Let be the resulting matrix after the BM algorithm and the closure operation have been applied. Now, the clr transformation is applied to each vector ∈ ℝ1 × , as where is the geometric mean. Hence, Finally, a normalization is applied, resulting in: where is a column vector of ones, is a column vector that contains the means of all the variables in X, and Σ ∈ ℝ is a diagonal matrix that contains the standard deviation (, for i = 1, …, m) of all variables.

Assessing correlation structure distortion

Here, we introduce φ, a new metric to quantitatively assess the distortion in the correlation structure of a dataset after the incorporation of a new sample. The Pearson correlation matrix for X is calculated as follows [50]: Now, consider a new sample, ∈ ℝ1 × . The pretreatment step for this sample yields: Let be the (augmented) dataset X after incorporating the new sample, and let S and be the correlation matrices for X and , respectively. The spectral decomposition for these matrices is where are diagonal matrices containing the eigenvalues for S and . Let and be the eigenvector matrices of S and . Figure 3a illustrates, in a 2-dimensional example, the datasets X and . Figure 3b illustrates the datasets after carrying out the pre-treatment, along with their eigenvectors (which are unitary) scaled by their corresponding eigenvalues obtained from the spectral decompositions. Note that scaled eigenvectors mark out the directions of largest variability, capturing high order interactions between the OTUs ruling the overall association structure. Therefore, looking at deviations in both the magnitude and direction of those scaled eigenvectors must give insightful information on overall changes in the association structure of a microbiota population.

Fig. 3

Bidimensional representation of datasets and X a without pretreatment, and b after the pretreatment along with the eigenvectors scaled by the corresponding eigenvalues

Bidimensional representation of datasets and X a without pretreatment, and b after the pretreatment along with the eigenvectors scaled by the corresponding eigenvalues Based on the abovementioned remarks, we introduce φ to characterize the distortion produced in the underlying correlation structure when two OTU counts datasets are compared. This metric first requires a dimensional reduction, which will be performed by selecting the principal components for each sample group. This procedure, integrated within the Principal Component Analysis (PCA) algorithm [25], consists of finding the minimum number of eigenvalues a or (for X and , respectively) that explain 100(1 − α)% of the total variance, i.e.: Thus, φ is defined as where is the algebraic difference (magnitude deviation) of the j-th eigenvalues in Λ and , computes angular deviation between the j-th eigenvectors in V and , and provides a weighting factor so that the contribution of the j-th deviation to the index φ is proportional to the relative importance among principal components.

Dimensionality reduction technique

Now that we have a metric to measure the distortion caused in the correlation structure of the g group after the incorporation of a new sample, we could then infer to which group the new sample would belong, providing a classification criterion based on how distorted the correlation structure is when incorporating . The intuitive way of approaching the evaluation of the distortion would be to integrate into X and (re)calculate the correlation matrix for the further evaluation of its distortion. However, considering that the g group may contain many samples, a single new sample may not be enough to generate a significant distortion in the correlation structure. Furthermore, if the number of samples in the groups is unbalanced, the distortion caused by the inclusion of a new sample may not be comparable. An approach to overcome this dimensional problem is to randomly subsample a small number of rows in X, combining them with , and then calculating the distortion caused. This approach, however, would not include a considerable amount of information, which is contained in the rows that were left out. To address this issue, we propose a new dimensionality reduction approach that allows a weighted assessment of the distortion in S caused by the integration of a new sample . This approach will use all the information contained in the original data, with the objective of providing a classification algorithm for any upcoming sample. The first step of the proposed approach is to find an expression for the distorted correlation matrix that reveals the natural weights of the contributions of X and to the make-up of the new correlation structure. Suppose that the data is concatenated as: where is the number of rows of . Combining Eqs. (15) and (8) yields Normalizing produces where is the vector that contains the means of , is a diagonal matrix that contains the distorted standard deviations, is the distortion in the mean vector, and . Both and are unknown. Thus, we need to derive expressions for them. The distorted means vector is calculated as , which can be converted into: Equation (18) shows that the natural weights are and for b and , respectively. To find an expression for the diagonal matrix of distorted standard deviations, , a column-wise subtraction of the mean vector for is performed: Adding and subtracting to in Eq. (19) yields: where is the i-th column of , the corresponding i-th variable. Then, the variance of this i-th variable will be , which can be written as: Equation (22) can be further expanded as: Notice that, in this expression, the terms , , and . Then, Eq. (23) can be reduced to: Considering that and , it follows that From Eq. (25), notice that the (distorted) variances of the variables of the group depend on: (1) the original variances in X, with natural weight ; (2) the quadratic (mean centered) values of the new sample, , with natural weight ; and the quadratic values of the distortion in the mean vector, . Based on equation [25], the standard deviation matrix for all m variables is Having expressions for and , it follows that the distorted correlation matrix is calculated as . Combining with Eq. (17) yields It follows that, As , , , this expression can be expressed as: Now, as , the second and third terms of Eq. (29) disappear. Then, the distorted correlation matrix is given by Note that, in this expression, depends on three terms: , which considers the contributions made from the non-distorted correlation matrix S after an actualization of the standard deviation, with a natural weight of . , which considers the contribution of the new sample to the constitution of the distorted correlation matrix, with a natural weight of . , which considers the effects of the distortion of Σ and b in . Finally, the distortion of the correlation matrix will be measured with the estimation of the deviation between S and , using the metric defined in Eq. (14). As previously mentioned, if the number of samples for the group g is large, the integration of x will barely cause a distortion in the correlation structure, even if it has different features compared to the samples in X. For example, if X were composed of 200 samples, the natural relative weight of the mean vector (b) for the construction of the distorted mean vector would be ~ 0.995, while the natural weight of the sample would (only) be ~ 0.005. On the other hand, if the weights were calculated assuming that X is composed of few samples, that is, replacing n for (so that ) in the quotients to calculate the relative weights, these weights would be more even and provide a weighting factor for the calculation of the distorted correlation matrix using all the information contained in the original samples of X (in b, Σ, and S). This is equivalent to finding a generatrix base of a few samples/patients () that can represent all the characteristics of X, incorporate , and then evaluate the distortion caused to the correlation structure, providing an artificial dimensional reduction. For example, if the relative weights were calculated assuming that X is composed only of three samples that exhibit all the attributes of the original dataset (i.e., ), these weights would have the values of 0.75 and 0.25, respectively, for the calculation of the distorted mean vector. The lower threshold for this artificial dimensional reduction could be found making in the calculation of the relative weights. If , this would lead to leaving out all the information contained in S to the estimation of (see Eq. (30)). A similar result is obtained for the standard deviation (see Eq. (28)).

Proposed classification rule

Now that the artificial dimensional reduction approach has been proposed, it will be used alongside the metric φ for the creation of a tool to classify new samples/patients into either the control or case group. The classifier will work under the assumption that a sample’s likelihood of belonging to either group is inversely proportional to the distortion caused by its incorporation into that group. This classification approach includes the following steps: Store the new sample in . Define the “maximum artificial dimension” to be evaluated as Choose a dimension “step of change”, , such as n − 2 is divisible by ∆n. Thus, would define the number of artificial dimensions to be evaluated. Therefore, we set for both g = c and g = v. Evaluate Eqs. (18), (25), (26) and (30) using instead of n. Perform this evaluation for both g = c and g = v, and for all values of . Store the resulting distorted correlation matrices as For each , calculate where |l| is the absolute value of l. In consequence, large values of ψ indicate a small distortion in the correlation structure, and therefore, a high degree of affinity between X and . On the other hand, small values of ψ indicate a big distortion and a low degree of affinity between X and . Calculate the average value for as Finally, the outcomes of the proposed classification rule, for a single sample, are and . The method will classify the sample into the group with the greater value of . Figure 4 shows a graphical representation to visualize the outcome of the proposed classification method after classifying a set of new samples one-by-one.

Fig. 4

Illustration of new samples and the line that separates both groups with the proposed method. Samples lying in the upper semi-plane will be classified in the case (v) group and in the control (c) group otherwise

Performance assessment with synthetic data

In this section, we assess the performance of the proposed method to correctly classify synthetically generated data.

Synthetic data generation

We conducted in silico experiments to assess the performance of the proposed method under different parameter settings. The following procedure was used to generate synthetic datasets: Define the quadruplet (n, m, ρ, ρ). Set n = {20,40,60,80,100,120,140,160}, m = {20,40,60,80,100,120,140}, ρ = 0.1, ρ = 0.2. For every quadruplet in step 1 construct a pair of generatrix correlation matrices, and as and , where is the identity matrix and is column vector of ones. For every pair , B pairs of Normal-distributed matrices and (with r = {1, 2, …, B}) of dimension n × m are generated. For this purpose, the NumPy [54] Python package was used. The number of experimental replicates was B = 100.

Performance assessment procedure

We used the correct classification rate (accuracy) as the assessment criterion to measure the performance of our method as follows: Merge each into a single matrix . For every pair , execute the proposed algorithm with each row sample , i = {1, 2, …, 2n}, and classify . Compute the average classification accuracy as: where N is the number of correctly classified samples.

Performance assessment results with synthetic data

Table 3 summarizes the main results. Our method exhibits exceptional accuracy for all the configurations tested. Interestingly, accuracy decreases as the number of features m decreases and the sample size n increases.

Table 3

Performance of the proposed method for synthetic datasets. Configurations (n, m) not reported showed 100% Classification Accuracy

Sample size (n)	Number of features (m)	Classification Accuracy (%)
80	40	99.8
100	20	98.1
120	20	99.7
160	20	98.0

Performance of the proposed method for synthetic datasets. Configurations (n, m) not reported showed 100% Classification Accuracy

Validation with real datasets

In this section, we study the performance of the proposed method using two real-world datasets, which contain OTU count tables obtained through 16S rRNA gene sequencing data from microbiota experiments. We also compare the classification accuracy of our method with those of two state-of-the-art methods: SVM [39] and SVM-RFE [41].

Datasets

The first dataset is from the American Gut Project (AGP) [51], which is one of the largest crowd-funded microbiome research projects. The second dataset is the Greengenes (GG) database [52], created with the PhyloChip 16s rRNA microarray. For the comparison experiment, only fractions of the datasets were used. In particular, a total of 578 samples and 127 features comprised the AGP data set, while 500 samples and 26 features comprised the GG data set. In both data sets, 50% of the samples correspond to cases.

Validation scenarios results

Datasets were preprocessed as described in section “Data pretreatment”. Further, the proposed method, as well as the SVM and SVM-RFE methods, were applied after separating the whole data set into training, testing, and validation sets using 70, 20, and 10% of the data, respectively. For the SVM-RFE method, the number of features to select was and the average of the results was calculated. The tuning parameters used for the SVM and SVM-RFE methods were C = 1 and γ = 0.05, where C trades off the correct classification of training examples against the maximization of the decision function’s margin, and γ defines how far the influence of a single training example reaches. Table 4 shows the main results. For the AGP data set, SVM is the least accurate, and SVM-RFE has the highest accuracy. This latter result is mostly due to all the strong features of SVM and the ability of the SVM-RFE method to eliminate variables that are not highly relevant in the data. Interestingly, our method outperforms SVM and is a close competitor of SVM-RFE.

Table 4

Classification accuracy for each method for the AGP and GG data sets

Dataset	SVM	SVM-RFE	Proposed Method
AGP	92.03%	96.33%	95.06%
GG	89.34%	92%	94%

Classification accuracy for each method for the AGP and GG data sets For the GG dataset, although the number of variables is small, the SVM-RFE and our method showed accuracy values above 90%, while the accuracy for the SVM method is below this threshold. It is worth highlighting that, for this data set, our method outperforms both the SVM and SVM-RFE methods. The latter result is thanks to the artificial dimensional reduction conducted to balance the natural weights when the number of samples is greater than the number of variables. Figure 5 provides a graphical illustration of the proposed method’s classification outcome for both real datasets used for validation, i.e., the AGP and the GG.

Fig. 5

Illustration of new samples and the line that separates both groups with the proposed method for the AGP (left) and GG (right) data sets

Discussion and conclusions

The ability to characterize populations of patients, species, or biological features, usually comprising a large number of variables in order to use the extracted characteristics to classify new samples into one of such populations’ categories is a relevant tool for biological and medical studies. When data describing these populations is compositional, further limitations and challenges arise. Here, we proposed a new method to classify samples into one of two previously known categories. The method uses a new metric developed to quantify the overall correlation structure deviation between two datasets, and a new dimensionality reduction technique. Although we illustrated the usefulness of our proposal with compositional data, its application is not limited, under any circumstances, to data of this nature. In fact, when data is not compositional, the centered log-ratio transformation and the zero-replacement algorithm must not be applied. Validation with synthetic data showed that the proposed method achieves accuracy values above 98%. Moreover, comparison of the performance of our method with that of SVM and the SVM-RFE (i.e., two state-of-the-art classification techniques), using two real-world datasets from 16 s RNA sequencing experiments, showed that our method outperforms the SVM method in both data sets, outperforms the SVM-RFE method in the GG data set, and is a close competitor of the SVM-RFE method in the AGP data set. Future studies may address the ability of our proposed method to perform accurately for a broader range of dimensions (number of variables and samples) and assess its performance for more scenarios of dissimilar correlation structures other than that for ρ = 0.1 and ρ = 0.2. Moreover, our method may be extrapolated for multi-category classification, and a performance assessment may be conducted to test its classification accuracy in non-binary scenarios.

31 in total

Review 1. Plant functional genomics.

Authors: C Somerville; S Somerville
Journal: Science Date: 1999-07-16 Impact factor: 47.728

Review 2. Looking beyond the details: a rise in system-oriented approaches in genetics and molecular biology.

Authors: Hiroaki Kitano
Journal: Curr Genet Date: 2002-04-04 Impact factor: 3.886

3. Systems biology. Life's complexity pyramid.

Authors: Zoltán N Oltvai; Albert-László Barabási
Journal: Science Date: 2002-10-25 Impact factor: 47.728

4. Boosting for tumor classification with gene expression data.

Authors: Marcel Dettling; Peter Bühlmann
Journal: Bioinformatics Date: 2003-06-12 Impact factor: 6.937

5. The human microbiome project.

Authors: Peter J Turnbaugh; Ruth E Ley; Micah Hamady; Claire M Fraser-Liggett; Rob Knight; Jeffrey I Gordon
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

6. Classification of breast cancer versus normal samples from mass spectrometry profiles using linear discriminant analysis of important features selected by random forest.

Authors: Somnath Datta
Journal: Stat Appl Genet Mol Biol Date: 2008-02-19

7. Correlation-based linear discriminant classification for gene expression data.

Authors: M Pan; J Zhang
Journal: Genet Mol Res Date: 2017-01-23

8. Sparse and compositionally robust inference of microbial ecological networks.

Authors: Zachary D Kurtz; Christian L Müller; Emily R Miraldi; Dan R Littman; Martin J Blaser; Richard A Bonneau
Journal: PLoS Comput Biol Date: 2015-05-07 Impact factor: 4.475

9. Differential expression analysis for sequence count data.

Authors: Simon Anders; Wolfgang Huber
Journal: Genome Biol Date: 2010-10-27 Impact factor: 13.583

10. Study of the impact of long-duration space missions at the International Space Station on the astronaut microbiome.

Authors: Alexander A Voorhies; C Mark Ott; Satish Mehta; Duane L Pierson; Brian E Crucian; Alan Feiveson; Cherie M Oubre; Manolito Torralba; Kelvin Moncera; Yun Zhang; Eduardo Zurek; Hernan A Lorenzi
Journal: Sci Rep Date: 2019-07-09 Impact factor: 4.379