| Literature DB >> 28973463 |
Shaoliang Peng1,2, Shunyun Yang2, Xiaochen Bo3, Fei Li3.
Abstract
More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA.Entities:
Mesh:
Year: 2017 PMID: 28973463 PMCID: PMC5737394 DOI: 10.1093/nar/gkx679
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.paraGSEA technical diagrams. (A) The paraGSEA pipeline. Schematic outline of paraGSEA mainly illustrates input/output data files, data flow and algorithm steps. The original inputs are expression profile files from LINCS in the HDF5 file format with a gctx or gct extension. The core modules consist of Quick Search, Compare Profiles and Parallel Cluster. The first two can be performed independently, but the third module depends on the results of the Compare Profiles module. Only the first module prints the results to the command line window. The other two modules all write out the results to external text files, because the result is too huge to show in the command line window. (B) Data partitioning and output strategies in comparison to the profiles module. Profiles are divided into several nodes, and each MPI process is in charge of different nodes. All processes read the input file in a parallel manner with efficient MPI I/O routines. After each reading stage is completed, the processes communicate and exchange data to ensure each process holds the entire second file. Then tasks will be divided equally between several threads in all processes, and similarity matrix calculation can be carried out in a parallel manner with a strict load balancing strategy. Finally, each process writes a part of the ES matrix into an output text file. (C) The parallel k-mediods’ clustering process in one iteration. Schematic outline of parallel k-mediods in one iteration includes five steps. Step 1: generating initial centers. The algorithm first needs to generate some clustering centers in a specified number to continue, which is a simple random number generation process but must be carried out in one process and then broadcast to all processes to make sure that every process holds the same clustering centers. Step 2: classifying all profiles. This step is a second level of parallelization. Every process just holds several lines of the ES matrix, in which their scale is substantially equal, by reading the results of the Comparing Profiles step, and they can classify a part of profiles to get a local class flag vector respectively by multi-threads method using OpenMP. Then we gather all local results to form a global class flag vector to all processes. Step 3: calculating the average similarity vector. This step is also a second level of parallelization, which is very similar to Step 2, but calculates and forms a global average similarity vector to root process. Step 4: calculating new clustering centers. This step is only carried out on root process in a serial manner. Through comparing class flag vector and average similarity vector, we can find these profiles with the greatest average similarity of other profiles in the same cluster. Step 5: write out or move on. If the new centers are different than the old centers, they will replace the old centers and move on to the next iteration. Otherwise, the algorithm will stop and write out the cluster result.
Figure 2.Runtime of four GSEA implementations. (A) Runtime of four GSEA implementations with 1000 times permutations, 50 gene sets versus different numbers of expression profiles. (B) Runtime of four GSEA implementations with no permutations and 50 gene sets versus different numbers of expression profiles.
Figure 3.Performance evaluation diagrams of Comparing Profiles. (A) Runtime of Comparing Profiles in each stage. (B) Runtime of Comparing Profiles in each stage except calculating ES matrix. The ‘Cores’ in coordinate axis means the degree of second level of parallelization where ‘P’ means the number of processes and ‘T’ means the number of threads per process. (C) Runtime and Speedup of Comparing Profiles in computing ES matrix stage of whole LINCS phase I dataset. (D) Runtime and Speedup of Comparing Profiles in computing ES matrix stage of different proportion of LINCS phase I dataset on Tianhe-2 supercomputer. The ‘Nodes’ in coordinate axis means the number of node in Tianhe-2 supercomputer used in calculation. Each node made full use of its computing potential and kept running all the 24 cores.
Figure 4.Evaluation diagrams of Comparing Profiles. (A) Parallel efficiency of each Iteration. (B) Convergence steps of different numbers of clusters. (C) Runtime of different numbers of clusters. Note: some data points are not drawn in 320 clusters because the growth of value here is exaggerated.
Figure 5.Evaluation results of paraGSEA. (A) Comparison between paraGSEA and GSEA2 in calculating ES. (B) Comparison between paraGSEA and GSEA2 in calculating normalized ES. (C) Comparison between paraGSEA and GSEA2 in calculating P-value. (D) The clustering results of 254 profiles based on k-mediods++. Because these profiles in LINCS did not have category labels, we think the experimental groups acting on the same cell lines should be clustered together after controlling some experimental conditions. The profiles are classified into five groups marked by the color of the dendrogram line. The expected profile classifications are labeled by the color of the font. The figures in parentheses represent the number of profiles for the item.
Integrated clustering results from case study in order to Kappa statistic
| Expected category 1 | Expected category 2 | Expected category 3 | Expected category 4 | Expected category 5 | |
|---|---|---|---|---|---|
| Category 1 | 51 | 5 | 0 | 2 | 0 |
| Category 2 | 18 | 45 | 5 | 2 | 2 |
| Category 3 | 0 | 0 | 43 | 0 | 0 |
| Category 4 | 1 | 6 | 0 | 57 | 0 |
| Category 5 | 0 | 0 | 0 | 1 | 16 |