Literature DB >> 35052161

An Information Theoretic Interpretation to Deep Neural Networks.

Xiangxiang Xu¹, Shao-Lun Huang¹, Lizhong Zheng², Gregory W Wornell².

Abstract

With the unprecedented performance achieved by deep learning, it is commonly believed that deep neural networks (DNNs) attempt to extract informative features for learning tasks. To formalize this intuition, we apply the local information geometric analysis and establish an information-theoretic framework for feature selection, which demonstrates the information-theoretic optimality of DNN features. Moreover, we conduct a quantitative analysis to characterize the impact of network structure on the feature extraction process of DNNs. Our investigation naturally leads to a performance metric for evaluating the effectiveness of extracted features, called the H-score, which illustrates the connection between the practical training process of DNNs and the information-theoretic framework. Finally, we validate our theoretical results by experimental designs on synthesized data and the ImageNet dataset.

Entities: Chemical

Keywords: deep neural network; feature extraction; information theory; local information geometry

Year: 2022 PMID： 35052161 PMCID： PMC8774347 DOI： 10.3390/e24010135

Source DB: PubMed Journal: Entropy (Basel) ISSN： 1099-4300 Impact factor: 2.524

1. Introduction

Due to the striking performance of deep learning in various application fields, deep neural networks (DNNs) have gained great attention in modern computer science. While it is a common understanding that the features extracted from the hidden layers of DNN are “informative” for learning tasks, the mathematical meaning of informative features in DNN is generally not clear. From the practical perspective, DNN models have obtained unprecedented performance in varying tasks, such as image recognition [1], language processing [2,3], and games [4,5]. However, the understanding of the feature extraction behind these models is relatively lacking, which poses challenges for their application in security-sensitive tasks, such as the autonomous vehicle. To address this problem, there have been numerous research efforts, including both experimental and theoretical studies [6]. The experimental studies usually focus on some empirical properties of the feature extracted by DNNs, by visualizing the feature [7] or testing its performance on specific training settings [8] or learning tasks [9]. Though such empirical methods have provided some intuitive interpretations, the performance can highly depend on the data and network architecture used. For example, while the feature visualization works well on convolutional neural networks, its application to other networks is typically less effective [10]. In contrast, theoretical studies focus on the analytical properties of the extracted feature or the learning process in DNNs. Due to the complicated structure of DNNs, existing studies were often restricted to the networks of specific structures, e.g., network with infinite width [11] or two-layer network [12,13], to characterize the theoretical behaviors. However, the interpretation of the optimal feature remains unclear, which limits their further applications. To obtain better interpretability, tools and measures from information theory [14] have recently been applied to connect DNNs with general information processing problems [15]. For instance, the information bottleneck [16,17] employs the mutual information as the metric to quantify the informativeness of features in DNN, and other information metrics, such as the Kullback–Leibler (KL) divergence [18] and Weissenstein distance [19], are also used in different problems. However, there is still a disconnection between these information metrics and the performance objectives of the inference tasks that DNNs want to solve [20]. Therefore, it is, in general, difficult to match the DNN learning with the optimization of a particular information metric. This paper aims to provide an information-theoretic interpretation to the feature extraction process in DNNs, to bridge the gap between the practical deep learning implementations and information-theoretic characterizations. To this end, we first propose an information-theoretic feature selection framework, which establishes an information metric to measure the performance of each given feature in inference tasks. In addition, we demonstrate that the optimal features extracted by DNNs coincide with the solutions of the information-theoretic feature selection problem, which share the same performance metric. Therefore, our results give an explicit interpretation of the learning goal of the back-propagation (BackProp) and stochastic gradient descent (SGD) operations in deep learning [21], which also lead to a performance metric for evaluating the effectiveness of the extracted features. Finally, we validate our theoretic characterizations using numerical experiments on both synthesized data and the ImageNet [22] dataset for image classification.

2. Preliminaries and Methods

2.1. Methodological Background

The main method used in our development is local information geometry [23,24], which characterizes the local geometric properties of the probability distribution space. The local information geometric method is closely related to the conventional Hirschfeld–Gebelein–Rényi (HGR) maximal correlation [25,26,27] problem, which has attracted increasing interest in the information theory community [28,29,30,31,32,33], and has also been applied in data analysis [34] and privacy studies [35]. Specifically, we use the local information geometric method to construct and investigate an information-theoretic feature selection problem in Section 3.1, which leads to an information metric of features and also demonstrates an SVD (singular value decomposition) structure of the feature selection process. Following the same analysis framework, we characterize the optimal feature extracted by DNNs in Section 3.2, and demonstrate that the same SVD structure is shared by DNNs. Based on the established connection, we then propose an effectiveness measure for DNNs, with details presented in Section 3.3.

2.2. Notations

Throughout this paper, we use X, , , and x to represent a discrete random variable, the range, the probability distribution, and the value of X. In addition, for any function of X, we use to denote the mean of , and “” to denote the centered variable with mean subtracted, e.g., . Moreover, we use and to denote the -norm and the Frobenius norm, respectively. All logarithms in our analyses are base e, i.e., natural.

2.3. Local Information Geometry

The following concepts from local information geometry would be useful in our development. (-Neighborhood). Let (-Dependence). The random variables (-Attribute). A random variable U is called an ϵ-attribute of X if We will focus on the small regime, which we refer to as the local analysis regime. In addition, for any , we define the information vector and feature function corresponding to P, with respect to a reference distribution , as This gives a three way correspondence for all distributions in , which will be useful in our derivations.

2.4. Modal Decomposition

Given a pair of discrete random variables with the joint distribution , the matrix is defined as where is the th entry of . The matrix is referred to as the canonical dependence matrix (CDM) [24]. The SVD of is referred to as the modal decomposition [24] of the joint distribution , which has the following property [18]. The SVD of This SVD decomposes the feature spaces of into maximally correlated features. To see that, consider the generalized canonical correlation analysis (CCA) problem: where denotes the Kronecker delta function. It can be shown that for any , the optimal features are , and , for , where and are the xth and yth entries of and , respectively [18]. The special case corresponds to the HGR maximal correlation [25,26,27], and the optimal features can be computed from the ACE (Alternating Conditional Expectation) algorithm [36].

2.5. Deep Neural Networks

The architecture of deep neural networks (under log-loss) can be depicted as Figure 1, where X is the input data, e.g., images, audios, or natural languages. Moreover, Y is the objective to predict, which can represent a discrete label in classification tasks, or represent target natural languages in machine translations [37]. Specifically, for given data X, the network produces a (trainable) feature mapping to generate k-dimensional feature . In practice, the feature mapping block (depicted as the gray block in Figure 1) is typically composed of hundreds and thousands of functional components (e.g., residual block [1]) with different types of layers, and may contain recurrent structure, e.g., LSTM (Long Short-Term Memory) [38]. In general, the internal structure of the feature mapping can have various different types of designs, depending on the learning tasks.

Figure 1

A deep neural network that uses data X to predict Y. All hidden layers together map the input data X to k-dimensional feature . Then, the probabilistic prediction of Y is computed from , and , where v and bias b are the weights and bias in the last layer.

After obtaining the feature , the Y is then predicted by the probability distribution of the form which is obtained by applying the softmax function [39] on , where and are the weights and biases in the last layer, respectively (this is equivalent to the common practice that denotes weight and biases by the matrix and the vector , respectively. However, as we will show later, expressing weights v and biases b as mappings of y can better illustrate their roles in feature selection). We will use to refer to when there is no ambiguity. Then, for a given training set of labeled samples , for , all the parameters in the network, including v, b, as well as those in the feature mapping block, are chosen to maximize the log-likelihood function (or, equivalently, minimize the log-loss) The procedure of choosing such parameters is called the training of network, which can be performed by stochastic gradient descent (SGD) or its variants [21]. With a trained network, the label for a new data sample x can be predicted by the maximum a posteriori (MAP) estimation, i.e., Specifically, when we make predictions for samples in a test dataset, the proportion of samples with correct prediction (i.e., ) over all samples is called the test accuracy.

3. Results

3.1. Information-Theoretic Feature Selection

Suppose that, given random variables with joint distribution , we want to infer about an attribute V of Y from observed i.i.d. samples of X. When the statistical model is known, the optimal decision rule is the log-likelihood ratio test, where the log-likelihood function can be viewed as the optimal feature for inference. However, in many practical situations [18], it is hard to identify the model of the targeted attribute, and it is necessary to select low-dimensional informative features of X for inference tasks before knowing the model. An information-theoretic formulation of such feature selection problem is the universal feature selection problem [24], which we formalize as follows. To begin, for an attribute V, we refer to as the configuration of V, where is the information vector specifying the corresponding conditional distribution . The configuration of V models the statistical correlation between V and Y. In the sequel, we focus on the local analysis regime, for which we assume that all the attributes V of our interests to detect are -attributes of Y. As a result, the corresponding configuration satisfies , for all . We refer to such configurations as ϵ-configurations. The configuration of V is unknown in advance but assumed to be generated from a rotational invariant ensemble (RIE). (RIE). Two configurations are called rotationally equivalent, if there exists a unitary matrix The RIE can be interpreted as assigning a uniform measure to the attributes with the same level of distinguishability. To infer about the attribute V, we construct a k-dimensional feature vector , for some , of the form for some choices of feature functions . Our goal is to determine the such that the optimal decision rule based on achieves the smallest possible error probability, where the performance is averaged over the possible generated from an RIE. In turn, we denote as the corresponding information vector, and define the matrix . (Universal Feature Selection). For where See Appendix A. □ As a result of (7), designing the as the singular vectors of , for , optimizes (7) for all RIEs, pairs of , and -configurations. Thus, the feature functions corresponding to are universally optimal for inferring the unknown attribute V. Moreover, (7) naturally leads to an information metric for any feature of X, measured by projecting the normalized through a linear projection . This information metric quantifies how informative a feature of X is when solving inference problems with respect to Y and is optimized when designing features by singular vectors of . Thus, we can interpret the universal feature selection as solving the most informative features for data inferences via the SVD of , which also coincides with the maximally correlated features in (3). Later, we will show that the feature selection in DNNs shares the same information metric as universal feature selection in the local analysis regime.

3.2. Feature Extraction in Deep Neural Networks

3.2.1. Network with Ideal Expressive Power

For convenience of analysis, we first consider the ideal case where the neural network can express any feature mapping as desired. While this assumption can be rather strong, the existence of such ideal networks is guaranteed by the universal approximation theorem [40]. In addition, one goal of practical network designs is to approximate the ideal networks and obtain sufficient expressive power. For such networks, we will show that when are -dependent, the extracted feature and weights coincide with the solutions of the universal feature selection. To begin, we use to denote the joint empirical distribution of the labeled samples , and to denote the corresponding marginal distributions. Then, the objective function of (5) is the empirical average of the log-likelihood function Therefore, maximizing this empirical average is equivalent as minimizing the KL divergence: This can be interpreted as finding the best fitting to empirical joint distribution by distributions of the form . In our development, it is more convenient to denote the bias by , for . Then, the following lemma illustrates the explicit constraint on the problem (8) in the local analysis regime. If See Appendix B. □ In turn, we take (9) as the constraint for solving the problem (8) in the local analysis regime. Moreover, we define the information vectors for zero-mean vectors , as , , and define matrices The KL divergence ( where See Appendix C. □ Lemma 3 reveals key insights for feature selection in neural networks. To see this, we consider the following two learning problems: learning the optimal weight v for given s and learning the optimal feature s for given v. For the case that s is fixed, we can optimize (10) with fixed and obtain the following optimal weights: For fixed and the optimal weights where See Appendix D. □ Specifically, when , Theorem 2 gives the optimal weights for softmax regression. Note that Equation (11) can be viewed as a projection of the input feature , to a feature computable from the value of y, which is the most correlated feature to . The solution is given by the operation that left multiplies matrix, which we refer to as forward feature projection. While we assume the continuous input We then consider the “backward feature projection” problem, which attempts to find informative feature to minimize the loss (10) with given weights and bias. In particular, we can show that the solution of this backward feature projection is precisely symmetric to the forward one. For fixed and the optimal feature function where See Appendix D. □ Finally, when both s and (and hence ) can be designed, the optimal corresponds to the low rank factorization of , and the solutions coincide with the universal feature selection. The optimal solutions for weights and bias to minimize ( See Appendix E. □ Therefore, we conclude that the learning of neural networks, when both s and are designable, is to extract the most correlated aspects of the input data X and the label Y that are informative features for data inferences from universal feature selection. In the practical learning process of DNN, the BackProp updates the weights of the softmax layer and those on the previous layer(s) in an iterative manner. As we have illustrated in Lemma 3, such iterative updates will converge to the same solution as the alternating between the forward feature projection (11) and the backward feature projection (13), which is indeed the power method to solve the SVD for [41], also known as the Alternating Conditional Expectation (ACE) algorithm [36]. From Theorem 4, for a neural network with sufficient expressive power, the trained feature depends only on the distribution of input data rather than the training process. It is worth mentioning that this result does not contradict the practice that trained weights in hidden layers can be different during each training run. In fact, due to the over-parameterized nature of practical network designs, there exist multiple choices of weights in hidden layers to express the same optimal feature

3.2.2. Network with Restricted Expressive Power

The analysis of the previous section has considered neural networks with ideal expressive power, where the feature can be selected as any desired function. In general, however, the form of feature functions that can be generalized is often limited by the network structure. In the following, we consider networks with restricted expressive power to characterize the impacts of network structure on the extracted feature. For illustration, we consider the neural network with a hidden layer of k nodes, and a zero-mean continuous input to this hidden layer, where t is assumed to be a function of some discrete variable X. Our goal is to analyze the weights and bias in this layer with labeled samples . Assume the activation function of the hidden layer is a generally smooth function , then the output of the z-th hidden node is where and are the weights and bias from input layer to hidden layer as shown in Figure 2. We denote as the input vector to the output classification layer.

Figure 2

A multi-layer neural network, where the expressive power of the feature mapping is restricted by the hidden representation t. All hidden layers previous to t are fixed, represented by the “pre-processing” module.

To interpret the feature selection in hidden layers, we fix at the output layer and consider the problem of designing to minimize the loss function (8) at the output layer. Ideally, we should have picked and to generate to match from (14), which minimizes the loss. However, here we have the constraint that must take the form of (15) and, intuitively, the network should select so that is close to . Our goal is to quantify the notion of such closeness. To develop insights on feature selection in hidden layers, we again focus on the local analysis regime, where the weights and bias are assumed to satisfy the local constraint Then, since t is zero-mean, we can express (15) as Moreover, we define a matrix with the th entry , which can be interpreted as a generalized CDM for the hidden layer. Furthermore, we denote as the information vector of with the matrix defined as , and we also define The following theorem characterizes the loss (8). Given the weights and bias where See Appendix F. □ Equation (20) quantifies the closeness between s and in terms of the loss (8). Then, our goal is to minimize (20), which can be separated to two optimization problems: Note that the optimization problem (21) is similar to the one that appeared in Lemma 3, and the optimal solution is given by . Therefore, solving the optimal weights in the hidden layer can be interpreted as projecting to the subspace of feature functions spanned by to find the closest expressible function. In addition, the problem (22) is to choose (and hence the bias ) to minimize the quadratic term similar to in (10). Similar to the analyses of parameters in the last layer, we can obtain analytical solutions for hidden layer parameters, e.g., and , with detailed discussions provided in Appendix G. Overall, we observe the correspondence between (11), (14), and (21), (22), and interpret both operations as feature projections. Our argument can be generalized to any intermediate layer in a multi-layer network, with all the previous layers viewed as the fixed pre-processing that specifies , and all the layers after determining . Then, the iterative procedure in back-propagation can be viewed as alternating projection finding the fixed-point solution over the entire network. This final fixed-point solution, even under the local assumption, might not be the SVD solution as in Theorem 4. This is because the limited expressive power of the network often makes it impossible to generate the desired feature function. In such cases, the concept of feature projection can be used to quantify this gap, and thus to measure the quality of the selected features.

3.3. Scoring Neural Networks

Given a learning problem, it is useful to tell whether or not some extracted features are informative [42]. Our previous development naturally gives rise to a performance metric. Given a feature In addition, for given H-score can be used to measure the quality of features generated at any intermediate layer of the network. It is related to (20) when choosing the optimal bias and as the identity matrix. This can be understood as taking the output of this layer and directly feeding it to a softmax output layer with used as the weights, and measures the resulting performance. Note that here can be an arbitrary function of Y, not necessarily the weights on the next layer computed by the network. When the optimal as defined in (12) is used, the resulting performance becomes the one-sided H-score , which measures the quality of . In addition, by comparing (26) with (7), the performance measure also coincides with the information metric (7), up to a scale factor. Specifically, for a given dataset and a feature extractor that generate , the H-score can be efficiently computed from the second equation of (26). In addition, when we use H-score to compare the performance of different feature extractors (models), the model complexity has to be taken into account to reduce overfitting. To this end, we adopt Akaike information criterion (AIC) and define AIC-corrected H-score for comparing different models, where and represent the number of parameters in the model and the training sample size, respectively. In current practice, the cross-entropy is often used as the performance metric. One can, in principle, also use log-loss to measure the effectiveness of the selected feature at the output of an intermediate layer [42]. However, one problem of this metric is that, for a given problem, it is not clear what value of log-loss one should expect, as the log-loss is generally unbounded. In contrast, the H-score can be directly computed from the data samples and has a clear upper bound. Indeed, it follows from Lemma 1 that, for k-dimensional feature s and weights v, we have the sequence of inequalities where indicates the ith singular value of . In particular, the first “≤” follows from the definition (24), and the gap between and measures the optimality of the weights v; the second “≤” follows from the first equality of (26), and the gap between two sides characterizes the difference between the chosen feature and the optimal solution, which is a useful measure of how restrictive (lack of expressive power) the network structure is; the last “≤” follows from the fact that (cf. Lemma 1), which measures the dependency between data variable and label for the given dataset. In Section 3.4.3, we validate this metric on real data.

3.4. Experiments

This section presents experiments for validating our theoretical characterizations, with corresponding code available at https://github.com/XiangxiangXu/dnn (accessed on 7 December 2021). Specifically, all DNN models used in Section 3.4.3 are available at https://keras.io/applications/ (accessed on 7 December 2021).

3.4.1. Experimental Validation of Theorem 4

We first validate Theorem 4, the optimal feature extracted by network with ideal expressive power. Here, we consider the discrete data with alphabet sizes, and , and construct the network as shown in Figure 3. Specifically, the network input is the one-hot encoding of X, i.e., , where takes one if and only if , and takes zero otherwise. Then, the feature is generated by a linear layer, with sigmoid function used as the activation function. For ease of comparison and presentation, we set feature dimension to , since otherwise the optimal feature (cf. Theorem 4) lies in a subspace and is non-unique. It can be verified that this network has ideal expressive power, i.e., with proper weights in the first layer, can express any desired function up to scaling and shifting.

Figure 3

A simple neural network with ideal expressive power, which can generate any dimensional feature s of X by tuning the weights in the first layer.

To compare the result trained by the neural network and that in Theorem 4, we first randomly generate a distribution , and then draw independently 100,000 pairs of samples. We then train the network using batch gradient descent, where we have applied Nesterov momentum [43] with the momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of 0.01 and clip gradients with norm exceeding 0.5. After training, the learned values of and are shown in Figure 4 and compared with theoretical results. From the figure, we can observe that the training results match our theoretical analyses.

Figure 4

The trained feature s, weights v, and bias b of the network in Figure 3, which are compared with the corresponding theoretical results to show their coincidences.

3.4.2. Experimental Validation of Theorem 5

In addition, we validate Theorem 5 by the neural network depicted in Figure 5, with the same settings of . Specifically, the number of neurons in hidden layers are set to and , where is randomly generated from X, and we have chosen sigmoid function as the activation function to generate . We then fix the weights and bias at the output layer and train the weights , and bias c in the hidden layer to optimize the log-loss. Specifically, we use the batch gradient descent with the Nesterov momentum hyperparameter being 0.9. In addition, we set the learning rate to 4 with a decay factor of and clip gradients with norm exceeding 0.1. After training, Figure 6 shows the matching between the learned results and the corresponding theoretical values.

Figure 5

The designed network for validating the impact of network structure on feature extraction, with and neurons in two hidden layers. Our goal is to compare the learned weights , and bias c in the hidden layer with our theoretic characterizations in Section 3.2.2.

Figure 6

The trained weights w and bias c of the network in Figure 5, which are compared with the corresponding theoretical results to show their coincidences.

3.4.3. Experimental Validation of H-Score

To validate H-score as a performance measure for extracted features, we compare the H-score and classification accuracy of DNNs on image classification tasks. Specifically, we use the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) [22] dataset as the dataset and extract features using several deep neural networks with representative architectures designs [44,45,46,47,48,49]. After training the feature extractors on the ILSVRC2012 training set, we then compute the H-score of the feature in the last hidden layer, as well as the classification accuracies on ILSVRC2012 validation set (here, we use ILSVRC2012 validation set for testing, as the labels in ILSVRC2012 testing set have not been publicly released). The results are summarized in Table 1, where is the AIC-corrected H-score as defined in (27), with being the number of model parameters, and 1,300,000 corresponding to the number of training samples in ImageNet. The AIC-corrected H-score is consistent with the classification accuracy, which validates the effectiveness of H-score as a measurement of neural networks.

Table 1

Classification accuracy and H-score for different DNN models on ImageNet dataset, where “Paras” indicates the number of parameters (in millions) in the model and represents the AIC-corrected H-score.

DNN Model	Paras [×106]	H(s)	HAIC(s)	Accuracy [%]
VGG16 [44]	138.4	148.3	41.9	64.2
VGG19 [44]	143.7	152.7	42.2	64.7
MobileNet [45]	4.3	45.9	42.6	68.4
DenseNet121 [46]	8.1	59.5	53.3	71.4
DenseNet169 [46]	14.3	81.2	70.2	73.6
DenseNet201 [46]	20.2	89.1	73.5	74.4
Xception [47]	22.9	179.8	162.2	77.5
InceptionV3 [48]	23.9	181.2	162.9	76.3
InceptionResNetV2 [49]	55.9	241.1	198.1	79.1

4. Discussion

Our characterization gives an information-theoretic interpretation of the feature extraction process in DNNs, which also provides a practical performance measure for scoring neural networks. Different from empirical studies focusing on specific datasets [7], our development is based on the probability distribution space, which is more general and can also provide theoretic insights. Moreover, the information-theoretic framework allows us to obtain direct operational meaning and better interpretations for the solutions, compared with optimization-based theoretical characterizations, e.g., [11,13]. As a first step in establishing a rigorous framework for DNN analysis, the present work can be extended in both theoretical and practical aspects. From the theoretical perspective, one extension is to investigate the analytical properties for general DNNs, using the theoretic insights obtained from local analysis regime. For example, it was shown in [50] that the symmetry between feature and weights in DNNs established in the local analysis regime (cf. Section 3.2.1) also holds for general probability distributions. Another extension is to apply the framework to investigate the optimal feature for structured data or network, e.g., data with sparsity structure [51]. From the practical perspective, in addition to the demonstrated example of evaluating existing DNN models (cf. Section 3.4.3), the H-score can also be used as an objective function in designing learning algorithms. In particular, such usages have been illustrated in multi-modal learning [52] and transfer learning [53] tasks.

5. Conclusions

In this paper, we apply the local information geometric analysis and provide an information-theoretic interpretation to the feature extraction scheme in DNNs. We first establish an information metric for features in inference tasks by formalizing the information-theoretic feature selection problem. In addition, we demonstrate that the features extracted by DNNs coincide with the information-theoretically optimal feature, with the same metric measuring the performance of features, called H-score. Furthermore, we discuss the usage of the H-score for measuring the effectiveness of DNNs. Our framework demonstrates a connection between the practical deep learning implementations and information-theoretic characterizations, which can provide theoretical insights for DNN analysis and learning algorithm designs.

4 in total

1. Mastering the game of Go with deep neural networks and tree search.

Authors: David Silver; Aja Huang; Chris J Maddison; Arthur Guez; Laurent Sifre; George van den Driessche; Julian Schrittwieser; Ioannis Antonoglou; Veda Panneershelvam; Marc Lanctot; Sander Dieleman; Dominik Grewe; John Nham; Nal Kalchbrenner; Ilya Sutskever; Timothy Lillicrap; Madeleine Leach; Koray Kavukcuoglu; Thore Graepel; Demis Hassabis
Journal: Nature Date: 2016-01-28 Impact factor: 49.962

2. Long short-term memory.

Authors: S Hochreiter; J Schmidhuber
Journal: Neural Comput Date: 1997-11-15 Impact factor: 2.026

3. A mean field view of the landscape of two-layer neural networks.

Authors: Song Mei; Andrea Montanari; Phan-Minh Nguyen
Journal: Proc Natl Acad Sci U S A Date: 2018-07-27 Impact factor: 11.205

4 in total