| Literature DB >> 28018423 |
Jingjing Zhai1, Yunjia Tang1, Hao Yuan1, Longteng Wang1, Haoli Shang1, Chuang Ma1.
Abstract
The identification of genes associated with a given biological function in plants remains a challenge, although network-based gene prioritization algorithms have been developed for Arabidopsis thaliana and many non-model plant species. Nevertheless, these network-based gene prioritization algorithms have encountered several problems; one in particular is that of unsatisfactory prediction accuracy due to limited network coverage, varying link quality, and/or uncertain network connectivity. Thus, a model that integrates complementary biological data may be expected to increase the prediction accuracy of gene prioritization. Toward this goal, we developed a novel gene prioritization method named RafSee, to rank candidate genes using a random forest algorithm that integrates sequence, evolutionary, and epigenetic features of plants. Subsequently, we proposed an integrative approach named RAP (Rank Aggregation-based data fusion for gene Prioritization), in which an order statistics-based meta-analysis was used to aggregate the rank of the network-based gene prioritization method and RafSee, for accurately prioritizing candidate genes involved in a pre-specific biological function. Finally, we showcased the utility of RAP by prioritizing 380 flowering-time genes in Arabidopsis. The "leave-one-out" cross-validation experiment showed that RafSee could work as a complement to a current state-of-art network-based gene prioritization system (AraNet v2). Moreover, RAP ranked 53.68% (204/380) flowering-time genes higher than AraNet v2, resulting in an 39.46% improvement in term of the first quartile rank. Further evaluations also showed that RAP was effective in prioritizing genes-related to different abiotic stresses. To enhance the usability of RAP for Arabidopsis and non-model plant species, an R package implementing the method is freely available at http://bioinfo.nwafu.edu.cn/software.Entities:
Keywords: biological network; data fusion; flowering time; gene prioritization; machine learning; meta-analysis; rank aggregation; systems biology
Year: 2016 PMID: 28018423 PMCID: PMC5156684 DOI: 10.3389/fpls.2016.01914
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
Figure 1Schematic of the RAP-based gene prioritization.
Figure 2Distribution of sequence, evolutionary, and epigenetic features in the positive and negative sample sets. (A) Boxplot distributions for the occurrence frequency of 20 amino acids in the positive and negative sample sets. Asterisks (*) indicate that the differences between positive and negative samples are statistically significant at the level of 0.05. (B) Differences in the occurrence frequency of 400 amino acid pairs between positive and negative samples. “Sig” represents a significant difference and “NS” represents a non-significant difference at the level of 0.05. (C) Density distributions of the median percentage of identity of positive and negative samples to the top BLASTP matches in 34 plant species. (D) Percentage of positive and negative samples that have a paralog derived from α and βγ whole genome duplicates. (E) Percentage of genes with methylation in the positive and negative sample sets.
List of the top 10 PCP-related features.
| 1 | NOZY710101 | Transfer energy, organic solvent/water | 1.16E-86 | Nozaki and Tanford, |
| 2 | ZHOH040101 | The stability scale from the knowledge-based atom-atom potential | 7.16E-86 | Zhou and Zhou, |
| 3 | SWER830101 | Optimal matching hydrophobicity | 5.04E-83 | Sweet and Eisenberg, |
| 4 | CORJ870102 | SWEIG index | 6.67E-83 | Cornette et al., |
| 5 | MEEJ810102 | Retention coefficient in NaH2PO4 | 9.77E-82 | Meek and Rossetti, |
| 6 | CIDH920104 | Normalized hydrophobicity scales for alpha/beta-proteins | 5.62E-81 | Cid et al., |
| 7 | CIDH920103 | Normalized hydrophobicity scales for alpha + beta-proteins | 5.74E-80 | Cid et al., |
| 8 | CIDH920105 | Normalized average hydrophobicity scales | 8.34E-80 | Cid et al., |
| 9 | GUYH850102 | Apparent partition energies calculated from Wertz-Scheraga index | 1.82E-79 | Guy, |
| 10 | MEEJ810101 | Retention coefficient in NaClO4 | 2.20E-79 | Meek and Rossetti, |
Figure 3Performance of RafSee in distinguishing positives and negatives using 10-fold cross validation. (A) The ROC curves of 10-fold cross validation for RafSee trained with 766 statistically significant features. The dashed curves denote the ROC curves from the testing dataset in each round of 10-fold cross-validation. The solid curves represent the average curve of the 10 ROC curves. (B) Boxplot distribution of 10 AUC values of the 10-fold cross validation for RafSee trained with different sets of features. The APAAC, PAAC, AAC, and PCP, respectively indicated 26 APAAC-, 20 PAAC-, 255 AAC-, and 461 PCP-related statistically significant features extracted from protein sequences.
Figure 4Performance of three different gene prioritization methods for identifying flowering-time genes. (A) Relationships between gene rank and their connectivity with known flowering-time genes for AraNet v2. (B) Relationships between gene rank and their connectivity with known flowering-time genes for RafSee. (C) Pairwise comparison between gene ranks predicted by AraNet v2 and RafSee. Each symbol denotes a flowering-time gene, and its coordinates represent the ranks assigned by the corresponding two gene prioritization methods. The dashed diagonal line denotes a 1:1 correspondence. (D) Pairwise comparison between gene ranks predicted by AraNet v2 and RAP.
Performance statistics for ranking flowering-time genes using different gene prioritization methods.
| RafSee | 7 | 415.5 | 1908.5 | 5419.5 | 18678 |
| AraNet v2 | 149.5 | 830 | 3019.25 | ||
| RAP | 12099 |
Bold denotes the best method for the corresponding ranking criteria.
Figure 5A hierarchical network of functional associations between the top 20 ranked genes and 449 known flowering-time genes.
Performance statistics for ranking stress-responsive genes using different gene prioritization methods.
| Response to salt (388 genes) | RafSee | 6 | 1035.5 | 8928.75 | 8928.75 | 19202 |
| AraNet v2 | 11 | 1088 | 6131.5 | 6131.5 | ||
| RAP | 13130 | |||||
| Response to temperature (373 genes) | RafSee | 1270 | 3520 | 8623 | 19550 | |
| AraNet v2 | 1025 | 2668 | 5810 | |||
| RAP | 13029 | |||||
| Response to cold (289 genes) | RafSee | 10 | 3423 | 9165 | 21571 | |
| AraNet v2 | 9 | 1282 | 2832 | |||
| RAP | 916 | 5542 | 12355 | |||
| Response to water (238 genes) | RafSee | 1026.25 | 3712.5 | 8260.25 | 22045 | |
| AraNet v2 | 18 | 754 | 2155 | |||
| RAP | 8 | 5070.25 | 12007 |
Bold denotes the best method for the corresponding ranking criteria.