| Literature DB >> 36212273 |
Jonathan Wei Xiong Ng1, Swee Kwang Chua1, Marek Mutwil1.
Abstract
Understanding how the different cellular components are working together to form a living cell requires multidisciplinary approaches combining molecular and computational biology. Machine learning shows great potential in life sciences, as it can find novel relationships between biological features. Here, we constructed a dataset of 11,801 gene features for 31,522 Arabidopsis thaliana genes and developed a machine learning workflow to identify linked features. The detected linked features are visualised as a Feature Important Network (FIN), which can be mined to reveal a variety of novel biological insights pertaining to gene function. We demonstrate how FIN can be used to generate novel insights into gene function. To make this network easily accessible to the scientific community, we present the FINder database, available at finder.plant.tools.Entities:
Keywords: Arabidopsis; database; feature importance; machine learning; network; random forest
Year: 2022 PMID: 36212273 PMCID: PMC9539877 DOI: 10.3389/fpls.2022.944992
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 6.627
HPs tested for time trial.
| Model | Hyperparameter | Range of values |
|---|---|---|
| Adaboost | n_estimators (maximum number of estimators in model) | 100, 120, 130, 150, and 200 |
| learning_rate (weight applied to each classifier at each training iteration, a higher learning rate increases the contribution of each estimator) | 0.6, 0.625, 0.65, 0.675, and 0.7, 0.725, 0.75, 0.775, and 0.8 | |
| Balanced random forest | max_features (number of features to consider when looking for the best split in the tree) | sqrt, 0.1, 0.2, 0.3, 0.4, 0.5, and 0.75 |
| n_estimators (maximum number of estimators in model) | 50, 100, 200, 500, and 1,000 | |
| max_depth (maximum depth of the tree) | 10, 20, 50, 70, 100, 125, 150, 200, 500, and None | |
| Logistic regression | C (inverse of regularization strength, smaller values specify stronger regularization) | 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1,000, and 10,000 |
| Linear SVM | C (inverse of regularization strength) | 0.0001, 0.001, 0.01, 0.1, 1, and 10, 100, 1,000, and 10,000 |
| Random forest | ccp_alpha (complexity parameter, used to determine extent of tree pruning) | 0, 0.1, 0.001, and 0.001 |
| max_features (number of features to consider when looking for the best split in the tree) | sqrt, 0.1, 0.2, 0.3, 0.4, 0.5, and 0.75 | |
| n_estimators (maximum number of estimators in model) | 50, 100, 200, and 500 | |
| max_depth (maximum depth of the tree) | 20, 50, 100, 200, and None |
Summary of features used.
|
|
|
|
|---|---|---|
| Gene expression | SPM (9) | Expression specificity |
| TPM (6) | Gene expression levels | |
| DGE (436) | Differential gene expression | |
| Diurnal (13) | Diurnal gene expression, amplitude and time point | |
| Gene family | Orthogroups (2) | Gene family size |
| Phylostrata | Phylostrata (1) | Phylostrata which genes belong to |
| Genomic information | Single copy genes (1) | Single copy genes in the same gene family |
| Tandemly duplicated genes (1) | Tandemly duplicated genes in the same gene family | |
| Protein domain | MobiDBLite (1) | Prediction of disordered domains regions |
| Pfam (2761) | Collection of protein families | |
| TMHMM (1) | Prediction of transmembrane helices | |
| Number of domains (2) | Number of protein domains | |
| Biochemical | Length of peptide (1) | Shows how long each peptide is |
| Molecular weight (1) | Molecular weight of peptide | |
| Isoelectric point (pI) (1) | pI of peptide | |
| PPI | Network centrality (2) | Degree and betweenness centrality |
| Network clusters (1295) | Cluster size and ID | |
| Gene coexpression | Network centrality (2) | Degree and betweenness centrality |
| Network clusters (279) | Cluster size and ID | |
| GO terms | GO terms (3645) | Experimentally determined gene annotations |
| cis-regulatory elements | cis-regulatory | Gene regulation |
| cis-regulatory | Gene regulation | |
| Multi-omics | GWAS (33) | Genomic loci within genes, correlated with phenotype traits |
| TWAS (28) | Gene expression level, correlated with phenotype traits | |
| Gene regulatory network | Network centrality (2) | Degree and betweenness centrality |
| Network clusters (55) | Cluster size and id | |
| Properties (76) | Biological characteristics of transcription factors and their target genes | |
| Aranet gene-interactions | Network centrality (2) | Degree and betweenness centrality |
| Network clusters (2957) | Cluster size and id | |
| Evolution | Homologs (22) | Presence of |
| Nucleotide Diversity (1) | Nucleotide diversity calculated from | |
| Epigenetics | Gene body methylation (1) | Whether gene body is methylated |
| Conservation | Sequence conservation (3) | Protein sequence % identity to fungi, plants, and metazoans |
| Percent identity to paralogs (1) | Maximum percent identity from BLAST to closest paralog | |
| dN/dS values (4) | dN/dS substitution rates between | |
| Paralog dS (1) | dS with putative paralog | |
| PTMs | Protein PTM (58) | Protein PTM frequency |
The first column describes the feature type. The second describes the feature name and parentheses indicate the number of features per name. The third column contains the feature description.
Figure 1Evaluation of machine learning algorithms. (A) F1 scores (y-axis) of 16 GO cellular location terms (x-axis). The algorithms are logistic regression (average F1 score 0.32), balanced random forest (0.26), adaboost (0.41), random forest (0.43), and linear SVM (0.35). The predictions were performed five times, and the error bars represent the 95% CI. (B) Time (y-axis) taken to train the different machine learning models to finish training. The predictions were performed five times on GO terms GO:0005829 (cytosol) and GO:0016020 (membrane). (C) OOB F1 score (y-axis) for the 71 GO terms using different sets of hyperparameters. “Individually selected HP” refers to random grid search to optimise hyperparameters for each GO term individually. “Default HP” means that default hyperparameters are used. “Groups of HP (group 1)” and “(group 2),” refers to the most frequent hyperparameter group observed after optimizing HPs for the 71 GO terms. “Most frequent individual hyperparameter” refers to the most frequent individual hyperparameter chosen after optimizing for the 71 GO terms.
Figure 2OOB F1 or R2 scores (x-axis) of the biological features. (A) F1 scores (x-axis) for categorical features (y-axis). (B) R2 continuous feature scores. (C) F1 scores (y-axis) of GO terms vs. number of genes in each GO term (x-axis).
Figure 3Characteristics of feature importance network. (A) Power-law distribution of node degrees. (B) Distribution of mutual ranks (MR) of feature importance values. (C) Distribution of highest reciprocal rank (HRR) of gene coexpression values in the coexpression network used in our study.
Figure 4Feature importance network. Nodes represent features while edges connect features that have putative biological links. The features are divided into eight major groups (Table 2). Red edges show relationships within groups belonging to the same box. These groups are divided into eight boxes, and the first seven boxes represent clusters of feature categories. The boxes are 1 (Aranet features: cluster IDs and size), 2 (gene coexpression features: cluster IDs and size), 3 (DGE features: subdivided into five categories), 4 (GO terms: subdivided into three categories), 5 (PPI features: cluster IDs and size), 6 (cis-regulatory elements), 7 (gene regulatory features: cluster IDs and size), and 8 (a miscellaneous group which contains all other feature categories). Yellow circle highlights DGE features, showing how many of them are linked to each other, while also having links to other features.
Figure 5Degree distribution of feature categories. Top blue circle shows GO terms while the bottom blue circle shows DGE features. Top yellow circle shows orthogroups and phylostrata, middle yellow circle shows transmembrane helices, biochemical features (length and molecular weight of peptide) and number protein domains, and bottom yellow circle shows network features (cluster size) from the PPI, coexpression, regulatory and Aranet networks.
Figure 6Identification of significantly associated feature types. The clustermap shows whether there are significantly (BH-adjusted values of p < 0.05) more (red squares) or significantly less (blue square) between feature categories that are expected by chance. Not statistically significant associations are indicated by black boxes. Circles indicate clusters of feature categories which are associated with each other. Yellow (top left) and pink (bottom left) squares are red squares which have been coloured differently to enable easy identification when our study refers to them.
Figure 7Examples capturing protein sizes and conservation. Selected nodes and edges from the local neighbourhood of specific features in the database are shown. (A) Mean gene expression (tpm_mean) is positively associated with maximum (tpm_max) and median (tpm_median) gene expression. (B) Protein length (pep_aal) is positively associated with the number of protein domains (num_counts), number of unique protein domains (num_u_counts), number of disordered protein domains (mob_counts) and protein molecular weight (pep_mw). (C) Sequence conservation in plants [con_Sequence_conservation_in_plants_(%_ID)] is positively associated with sequence conservation in paralogs, fungi and metazoans (nodes starting with “con_”), fundamental GO terms (nodes starting with “GO_”), and homology with multiple plant species (nodes starting with “hom”). Sequence conservation in plants is negatively associated with the evolutionary age of target genes of transcription factors (ttf_Evolutionary_age), phylostrata (phy_phylostrata), and protein PTMs (nodes starting with “ptm_”).
Figure 8Examples capturing posttranslational modifications and the number of transmembrane and disordered domains. Selected nodes and edges from the local neighbourhood of specific features in the database are shown. (A) Protein PTM—lysine acetylation (ptm_ac_K) is positively associated with GO cellular components terms (nodes starting with “GO_CC_”), especially those related to the chloroplast. (B) Number of transmembrane helices in a protein sequence (tmh_counts) is positively associated with GO transmembrane transporter and channel terms (nodes starting with “GO_”), and sphingolipid metabolic process (GO_BP_sphingolipid metabolic process). (C) Number of disordered domains in a protein sequence (mob_counts) is positively associated with sequence conservation in paralogs, plants, fungi and metazoans (nodes starting with “con_”), GO microtubule terms (nodes starting with “GO_”), gene length (nodes starting with “ttf_”), protein PTMs (nodes starting with “ptm_”), protein length (pep_aal), number of unique protein domains (num_u_counts), and protein molecular weight (pep_mw).