| Literature DB >> 27657141 |
Sriram Chockalingam1, Maneesha Aluru2, Srinivas Aluru3.
Abstract
Pre-processing of microarray data is a well-studied problem. Furthermore, all popular platforms come with their own recommended best practices for differential analysis of genes. However, for genome-scale network inference using microarray data collected from large public repositories, these methods filter out a considerable number of genes. This is primarily due to the effects of aggregating a diverse array of experiments with different technical and biological scenarios. Here we introduce a pre-processing pipeline suitable for inferring genome-scale gene networks from large microarray datasets. We show that partitioning of the available microarray datasets according to biological relevance into tissue- and process-specific categories significantly extends the limits of downstream network construction. We demonstrate the effectiveness of our pre-processing pipeline by inferring genome-scale networks for the model plant Arabidopsis thaliana using two different construction methods and a collection of 11,760 Affymetrix ATH1 microarray chips. Our pre-processing pipeline and the datasets used in this paper are made available at http://alurulab.cc.gatech.edu/microarray-pp.Entities:
Keywords: Arabidopsis thaliana; gene networks; microarray
Year: 2016 PMID: 27657141 PMCID: PMC5040970 DOI: 10.3390/microarrays5030023
Source DB: PubMed Journal: Microarrays (Basel) ISSN: 2076-3905
Microarray data collected from public databases. Columns show the list of databases, the number of experiments obtained from each database and the number of original CEL files that passed quality control.
| Database | Experiments | CEL Files | QC Filtered |
|---|---|---|---|
| ArrayExpress | 543 | 7923 | 848 |
| GEO | 83 | 1270 | 102 |
| NASC | 211 | 2859 | 387 |
| AtGenExpress | 44 | 1334 | 289 |
| Total | 881 | 13,386 | 1626 |
Figure 1Plot of the inter-quartile range (IQR) profile of the complete dataset.
Figure 2Plot of the first derivative of the inter-quartile range (IQR) profile for the complete dataset.
Figure 3Histogram plot of IQR values of gene expression profiles for the entire dataset. For this dataset, an IQR of 0.45 ± 0.0125 shows the maximum value for number of genes, and hence the threshold q = 0.45.
Classification of microarray experiments. Columns show the classification basis, class, number of experiments assigned to the particular class, corresponding number of CEL Files.
| Basis | Class | Experiments | CEL Files |
|---|---|---|---|
| Process | Chemical | 75 | 808 |
| Development | 190 | 2252 | |
| Hormone | 116 | 1806 | |
| Light | 64 | 1210 | |
| Metabolism | 214 | 1535 | |
| Pathogen | 69 | 1156 | |
| Stress | 153 | 2476 | |
| Tissue | Flower | 69 | 764 |
| Leaf | 279 | 4268 | |
| Root | 121 | 1939 | |
| Seedling | 379 | 5234 | |
| Whole Plant | 144 | 2359 |
Figure 4Expression value histogram compares the expression value distribution for the gene AT5G08520 between the dataset by [7] and our dataset.
Data filtering results after classification, showing the number of genes that survive the filter.
| Basis | Class | Genes |
|---|---|---|
| Process | Chemical | 18,026 |
| Development | 17,827 | |
| Hormone | 17,646 | |
| Light | 17,895 | |
| Metabolism | 17,989 | |
| Pathogen | 17,486 | |
| Stress | 19,041 | |
| Tissue | Flower | 17,209 |
| Leaf | 17,215 | |
| Root | 17,775 | |
| Seedling | 17,960 | |
| Whole Plant | 18,805 |
Number of genes covered by reverse-engineered networks. Shows the sizes of networks constructed from complete dataset when (A) IQR threshold q = 0.65 and (B) q is selected from the histogram plot. Also shown in the table are the sizes of networks constructed by [7,8,9].
| Network | Size of Input Datset | Genes in Input Dataset | Genes in Network |
|---|---|---|---|
| PCC with dataset (A) | 11,760 | 13,384 | 2670 |
| MI with dataset (A) | 11,760 | 13,384 | 13,181 |
| PCC with dataset (B) | 11,760 | 18,806 | 3940 |
| MI with dataset (B) | 11,760 | 18,806 | 18,606 |
| MI network by [ | 3137 | 15,578 | 15,495 |
| PCC network by [ | 1094 | 16,293 | 6206 |
| GGM network by [ | 2045 | NA 1 | 6760 |
1 [9] constructs the GGM network by randomly sampling 2000 genes at a time. NA 1: Not Applicable; PCC: Pearson correlation coefficient; MI: Mutual Information; GGM: Gaussian Graphical Model.
Network statistics for all the classified datasets.
| Basis | Network | Pearson Correlation Coefficient Networks | Mutual Information Networks | ||
|---|---|---|---|---|---|
| Vertices | Edges | Vertices | Edges | ||
| Process | Chemical | 2553 | 51,934 | 17,355 | 109,837 |
| Development | 2284 | 53,890 | 17,813 | 90,238 | |
| Hormone | 1696 | 17714 | 17,598 | 98,954 | |
| Light | 3575 | 175,171 | 17,877 | 99,671 | |
| Metabolism | 4190 | 302,844 | 18,026 | 90,564 | |
| Pathogen | 2494 | 85,468 | 17,406 | 115,085 | |
| Stress | 4078 | 919,149 | 19,036 | 182,545 | |
| Tissue | Flower | 3712 | 82,947 | 16,594 | 122,866 |
| Leaf | 3073 | 119,432 | 17,210 | 73,168 | |
| Root | 2549 | 141,982 | 17,768 | 103,778 | |
| Seedling | 4156 | 314,204 | 17,947 | 82,494 | |
| Whole Plant | 5152 | 1,054,976 | 18,797 | 136,115 | |