| Literature DB >> 22917185 |
Nicolas Tchitchek1, José Felipe Golib Dzib, Brice Targat, Sebastian Noth, Arndt Benecke, Annick Lesne.
Abstract
The problem of identifying differential activity such as in gene expression is a major defeat in biostatistics and bioinformatics. Equally important, however much less frequently studied, is the question of similar activity from one biological condition to another. The fold-change, or ratio, is usually considered a relevant criterion for stating difference and similarity between measurements. Importantly, no statistical method for concomitant evaluation of similarity and distinctness currently exists for biological applications. Modern microarray, digital PCR (dPCR), and Next-Generation Sequencing (NGS) technologies frequently provide a means of coefficient of variation estimation for individual measurements. Using fold-change, and by making the assumption that measurements are normally distributed with known variances, we designed a novel statistical test that allows us to detect concomitantly, thus using the same formalism, differentially and similarly expressed genes (http://cds.ihes.fr). Given two sets of gene measurements in different biological conditions, the probabilities of making type I and type II errors in stating that a gene is differentially or similarly expressed from one condition to the other can be calculated. Furthermore, a confidence interval for the fold-change can be delineated. Finally, we demonstrate that the assumption of normality can be relaxed to consider arbitrary distributions numerically. The Concomitant evaluation of Distinctness and Similarity (CDS) statistical test correctly estimates similarities and differences between measurements of gene expression. The implementation, being time and memory efficient, allows the use of the CDS test in high-throughput data analysis such as microarray, dPCR, and NGS experiments. Importantly, the CDS test can be applied to the comparison of single measurements (N=1) provided the variance (or coefficient of variation) of the signals is known, making CDS a valuable tool also in biomedical analysis where typically a single measurement per subject is available.Entities:
Mesh:
Year: 2012 PMID: 22917185 PMCID: PMC5054499 DOI: 10.1016/j.gpb.2012.06.002
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Figure 1Graphical representation of the problematic and encountered scenarios A. Expression signals of a single gene in two different biological conditions, with normal distributions having the parameters x1 and x2 (mean values) and σ1 and σ2 (variances). The fold-change criteria defining the difference or similarly is represented with a conic section defined by parameter θ. The problem is to determine the value of (x1,x2) having the values of estimators . B. Potential scenario for the statistical test for differential expression and low variability. C. Potential scenario of having low variability and similarly expressed genes. D. Potential scenario for the statistical test for no statistical significance and high variabilities.
Figure 2Representation of the different regions andA. Region R shown in blue with . B. Region R shown in blue with . C. Region shown in red with , and . D. Region shown in red with , and .
Figure 3Test behavior validationIn silico simulations using standard normal distributed data with parameters x1, x2, σ1, σ2 capture 3 different situations shown in A–C. A. Case of differentially expressed gene having a significant Q0 value but an high Q value. B. The opposite case being statistically similar . C. Case where neither Q0 nor Q display statistical significance. Our method is tested on a real biological dataset (panels D and E) showing the correct behavior. D. Values of Q0 are directly changing as a function of the parameter. As we increase the parameter, the values of Q0 are higher. This means that the more we increase the parameter the less Q0 values we have for a given bin of the Q0 histogram. E. Values of Q are inversely changing as a function of the parameter. As we increase the parameter the values of Q are lower. This means that the more we increase the parameter the more Q values we have for a given bin of the Q histogram.
Figure 4Experimental validation of our CDS statistical test Venn Diagrams of the differentially expressed genes when comparing 3 different biological conditions are shown in panel A–C. A. Comparing differential expression across the three comparisons is the usual case. With the CDS method we can capture more cases, for instance shown in panel B and C. B. Comparing differential expression in two subtractions and similarity in one subtraction. C. Comparing one differential expression and two similarity expressions. Panels D–F. Examples of genes detected using both Q0 and Q values issued from our method. D. Difference in the three biological conditions. E. Similarity between two biological conditions. F. Similarity in two comparisons and difference among the three biological conditions.