| Literature DB >> 30891564 |
Kewei Wang1,2,3,4,5,6, Wenji Wang1,2,3,4, Mang Li1,2,3,4.
Abstract
There are a lot of biological and experimental data from genomics, proteomics, drug screening, medicinal chemistry, etc. A large amount of data must be analyzed by special methods of statistics, bioinformatics, and computer science. Big data analysis is an effective way to build scientific hypothesis and explore internal mechanism. Here, gene expression is taken as an example to illustrate the basic procedure of the big data analysis.Entities:
Keywords: PCA analysis; big data analysis; cluster analysis; microarray; regression model
Year: 2018 PMID: 30891564 PMCID: PMC6388068 DOI: 10.1002/ame2.12028
Source DB: PubMed Journal: Animal Model Exp Med ISSN: 2576-2095
Figure 1Data sources of gene expression derived from different detection methods
Different chip categories of GEO microarrays in NCBI database
| Datasets | Years | Gene number | Names |
|---|---|---|---|
| 1 | 2007 | 33297 | Aflymetrix Human Gene 1.0 ST array |
| 2 | 2003 | 54675 | Aflymetrix Human Genome U133 Plus 2.0 Array |
| 3 | 2002 | 22283 | Aflymetrix Human Genome U133A Array |
| 4 | 2002 | 8793 | Affymetrix Human HG‐Focus Target Array |
| 5 | 2007 | 20228 | Agilent‐012097 Human 1A Microarray (V2) G4110B |
| 6 | 2006 | 45220 | Agilent‐014850 Whole Human Genome Microarray 4×44K G4112F |
| 7 | 2008 | 20589 | Sentrix Human Ref‐8 v2 Expression BeadChip |
| 8 | 2008 | 24526 | Illumina HumanRef‐8 v3.0 expression beadchip |
Logarithmic transformed data (log e ratio, reference series: GSE9539)
| Gene | 4/100 Fold | 8/100 Fold | 12/100 Fold | 24/100 Fold | 4/200 Fold | 8/200 Fold | 12/200 Fold |
|---|---|---|---|---|---|---|---|
| BAD | 0.9330 | 0.9560 | 0.9390 | 0.9930 | 0.9600 | 0.9770 | 0.9800 |
| BAX | 1.0710 | 0.9640 | 1.0070 | 0.9390 | 1.0970 | 0.9430 | 0.9870 |
| BCL2 | 1.0730 | 0.9490 | 0.9060 | 1.0620 | 1.2360 | 0.7870 | 1.2920 |
| BCL2A1 | 0.9320 | 1.3620 | 0.9410 | 0.9280 | 0.9090 | 0.9110 | 0.8130 |
| BCL2L11 | 1.0740 | 0.9790 | 1.1810 | 0.9670 | 0.9880 | 1.0160 | 1.1500 |
| BCL2L13 | 1.1290 | 0.9720 | 1.0990 | 1.0170 | 1.1080 | 0.9930 | 1.0610 |
| BCL2L2 | 1.1300 | 1.0480 | 1.0410 | 1.0650 | 1.1120 | 1.0530 | 1.0290 |
| BIK | 1.0440 | 0.9560 | 0.9770 | 1.0120 | 0.9920 | 0.9560 | 1.0480 |
| BOK | 0.9000 | 0.9350 | 0.9300 | 0.9570 | 0.9620 | 0.9320 | 0.8810 |
Transformed data by zero‐mean normalization
| Gere symbol | 153‐T0‐1 | 153‐T0‐2 | 153‐T0‐3 | 153‐T0.5‐1 | 153‐T0.5‐2 | 153‐T0.5‐3 | 153‐T1.S‐1 | 153‐T1.S‐1 |
|---|---|---|---|---|---|---|---|---|
| TNFS F10 | −0.5237 | −0.6558 | −0.4203 | −0.5577 | −0.4949 | −0.6140 | 2.5225 | 3.2670 |
| TNF | 0.2021 | −0.3106 | 0.0823 | −0.2435 | 0.2932 | −0.4927 | −0.5215 | 1.5486 |
| TNFSF 12 | 0.3596 | 1.0538 | 0.1945 | 0.3850 | −0.0340 | 1.8283 | −0.9779 | 1.9087 |
| FAS | −1.0077 | −0.0496 | −0.5411 | 0.0521 | −0.8960 | −0.1018 | −1.3602 | −1.1021 |
| TNFRSF 10B | 1.8797 | 1.6844 | 0.9053 | 1.9502 | 1.0238 | 1.2344 | 0.0413 | −0.6137 |
| TNFRSF 10A | −0.9825 | 0.0200 | −0.6054 | −0.5993 | −1.0315 | −0.6330 | 0.1672 | −0.5839 |
| FADD | 0.1915 | 0.2694 | 0.8454 | −0.5752 | −0.4390 | −0.5752 | −0.8126 | −0.9527 |
| TNFRSF 1A | 0.0443 | 0.7225 | 0.0773 | 0.0733 | 0.3874 | −0.4379 | −0.4959 | 0.0333 |
| TNFRSF 10D | 0.4991 | 0.6449 | 1.1089 | 0.3600 | 0.6891 | 0.5102 | −0.2631 | −0.2520 |
| TNFRSF 10C | −0.5613 | −0.0915 | 1.8041 | 0.2629 | −0.5284 | −0.6603 | 0.6667 | 1.7877 |
| TRAF2 | −0.3673 | −0.5960 | −0.4435 | 0.6127 | −0.3728 | 0.8686 | −0.4163 | −1.6087 |
| TRADD | −0.1146 | −0.9489 | −0.9365 | −1.2715 | −1.5972 | −0.8590 | 0.2048 | 1.0081 |
| TNFRS F11B | −0.8241 | −0.4364 | −0.5804 | −0.5804 | −1.1564 | −0.5804 | 0.1729 | 0.1729 |
| TRAF1 | 0.0494 | 0.7638 | −0.0038 | −1.4858 | −1.2730 | −1.7290 | 0.0190 | −0.6498 |
Figure 2Cluster analysis of gene expression. Different colors represent clusters of similar characteristics based on geometric distance
Figure 3PCA analysis of gene expression. The horizontal and vertical coordinates represent two‐dimensional distribution of the first two components. The number in the figure stands for different samples
Figure 4Sigmoid curve is a typical pattern of biological response as reflected by possibility range [0, 1]. Logistic regression or logit regression model is able to estimate the probability of a binary response based on one or more independent variables.