Literature DB >> 35523803

Interpretable machine learning identifies paediatric Systemic Lupus Erythematosus subtypes based on gene expression data.

Jennifer R S Meadows¹, Jan Komorowski^2,3,4,5, Sara A Yones⁶, Alva Annett⁷, Patricia Stoll⁸, Klev Diamanti⁹, Linda Holmfeldt⁹, Carl Fredrik Barrenäs⁷.

Abstract

Transcriptomic analyses are commonly used to identify differentially expressed genes between patients and controls, or within individuals across disease courses. These methods, whilst effective, cannot encompass the combinatorial effects of genes driving disease. We applied rule-based machine learning (RBML) models and rule networks (RN) to an existing paediatric Systemic Lupus Erythematosus (SLE) blood expression dataset, with the goal of developing gene networks to separate low and high disease activity (DA1 and DA3). The resultant model had an 81% accuracy to distinguish between DA1 and DA3, with unsupervised hierarchical clustering revealing additional subgroups indicative of the immune axis involved or state of disease flare. These subgroups correlated with clinical variables, suggesting that the gene sets identified may further the understanding of gene networks that act in concert to drive disease progression. This included roles for genes (i) induced by interferons (IFI35 and OTOF), (ii) key to SLE cell types (KLRB1 encoding CD161), or (iii) with roles in autophagy and NF-κB pathway responses (CKAP4). As demonstrated here, RBML approaches have the potential to reveal novel gene patterns from within a heterogeneous disease, facilitating patient clinical and therapeutic stratification.

Entities: Chemical

Mesh：

Year: 2022 PMID： 35523803 PMCID： PMC9076598 DOI： 10.1038/s41598-022-10853-1

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.996

Introduction

Paediatric systemic lupus erythematosus (pSLE) is a rare, clinically and genetically heterogeneous systemic autoimmune disease with a prevalence of between 3.3 and 8.8 per 100,000 children[1]. The disease course is unpredictable, with periods of remission and flares that lead to cumulative damage over time[2]. SLE is classified by the presence of at least 4 out of 11 of clinical criteria[3], with disease activity (DA) severity calculated based on composite scores, including Systemic Lupus Erythematosus Disease Activity Index (SLEDAI)[4]. Genome-wide association studies have identified more than 130 SLE-associated loci[5], including those driven by interferons[6], or those controlling inflammation and tissue response to injury[7]. Together these have been used to highlight the link between SLE and viral responses[8]. However, the trigger that initiates the expression of these genes and the progression of SLE disease remains poorly understood[9]. Efforts to unravel the SLE gene expression pathway have been initiated. A 2016 study of paediatric disease examined the personal transcriptomic profiles of 158 patients using linear mixed models built on blood expression data from 15,386 transcripts[10]. The transcript panel utilised for this process considered each gene locus individually, and correlated the binary up- or down-regulation patterns with patient phenotypes. The result was the stratification of patients into distinct subclasses, with an enrichment of neutrophil expressed transcripts noted as a patient passed from the low DA1 state to the high DA3 form of disease. While the molecular pathways proposed by the study have led to a better understanding of personal disease progression[10], the analysis lacked the co-predictive power of rule-based machine learning (RBML) models. Machine learning (ML) approaches are well suited to address this process, as they can model and characterise data with very high dimensionality, such as that generated through personal transcriptomics. However, the majority of methods work as black boxes. These offer little to no explanation in terms of how, and why, a specific classification decision is made. For clinical -omics, understanding how a classification decision is made, may offer insight into the underlying biological mechanisms, for example contrasting a disease state to healthy controls[10]. Interpretable ML methods such as RBML models, offer classification transparency[11,12]. We applied RBML that is based on rough set theory[13]. It uses Boolean reasoning to identify the minimal set of features that can discern decision classes (reducts). Reducts are subsequently overlaid onto transcriptomics data samples to create IF–THEN rules. One of the main advantages of this method is co-prediction, i.e., the identification of descriptors that collaboratively correctly classify samples from the data into the outcomes. Co-prediction can provide insight into the candidate biological processes beyond of what can be learnt through co-expression networks. In the current study, we apply a RBML approach using rough sets to existing pSLE blood transcriptome data[10]. Here, the goal was to identify the genes and interactions that demarcate a low pSLE DA1 state from a high DA3 state. The disease sub-groups discovered were intersected with available clinical data, revealing gene sets key to the progression of disease and the involvement of the innate and acquired immune arms. These genes, and their protein products, have the potential to be translated to biomarkers, or could be suggested points for therapeutic intervention.

Results

Minimum gene set model discerns DA1 from DA3

The two extremes of disease activity were defined with SLEADAI score (DA1: SLEDAI = 0–2; DA3: SLEDAI > 7). The initial rule-based model was built with R.ROSETTA[14] using data from 629 unique patient clinical visits (observations) and the discretised gene expression value for each DA1and DA3 patient visit (features: 33,006 probes for 629 observations; Fig. 1). This initial model had an overall prediction accuracy of 71% using tenfold cross validation (Supplementary Fig. S1 online). The observations (visits) incorrectly classified by the model (Supplementary Fig. S2 online) were pruned to achieve a better separation between DA1 and DA3 then intersected with the patient metadata in order to understand the potential reasons behind their misclassification. Observations were more likely to be pruned or removed based on patient treatment, low SLEDAI score or the number of days since diagnosis (Logistic regression p-value for all < 0.05; Supplementary Fig. S3 online; Supplementary Table S1). No significant association was observed between removed observations and clinical symptoms, and the significant associations were a reflection of observations removed from class DA3 (38%, 125/330 removed) rather than reductions from DA1 (53%, 157/299 removed).

Figure 1

Overview of the modelling process implemented to classify and interrogate gene expression relationships between DA1 and DA3.

Overview of the modelling process implemented to classify and interrogate gene expression relationships between DA1 and DA3. Following Monte Carlo Feature selection (MCFS)[15] on the pruned dataset, 4980 genes were available and subsequently used to build an enhanced rule-based model. Gene set enrichment analysis revealed terms connected to neutrophils (e.g., activation, mediation, degranulation) and the production and degradation of gene products (e.g., transcription initiation and nonsense-mediated decay; Supplementary Fig. S4 online). This suggests a difference in neutrophil mediated immune response between patients with DA1 and DA3, a known functional shift in SLE manifestation between disease states[4]. Feature boosting was performed to identify the optimal number of genes for the model (Fig. 1). Empirical studies revealed that model accuracy was lost if more than 200 of the top 4980 MCFS ranked genes were used for this process (Supplementary Fig. S5 online). Iterative R. ROSETTA computational rounds added genes from the starting set of 200, with maximum model accuracy of 81% achieved with a minimum set of 34 genes (Fig. 2; Supplementary Table S2 online). These genes were used in 22 and 44 classifying rules for DA1 and DA3 respectively. The model mirrored the structure of the initial model (Supplementary Fig. S1 online). Figure 2 shows DA1 and DA3 were again split, however with a reduction of complexity, in terms of rules (edges) connecting the genes (nodes) and a refinement of the central hub genes. The 10% points gain in the model accuracy provided improvement in terms of a clearer and visible separation between the disease activities in the rule networks (RN); this gain in accuracy was too small to imply an overfitting of the model. The similarity between the network of the initial model and the enhanced model implied that removed objects were unnecessary for classification of DA1 and DA3 since their removal did not significantly impact the main network structure or the rule model.

Figure 2

The rule networks discern the disease states. DA1 is largely defined by medium gene expression, whereas DA3 includes more genes, and those that were highly expressed. For each decision class, internal node colour indicates discretised gene expression value (high, medium, low; orange, grey, blue), node size is proportional to the number of objects supporting rules associated to a node, node border thickness is proportional to the number of rules associated to a node (low, high; circle border thin, thick) and edges connecting nodes represent normalised connection values (< 55%, ≥ 85%; grey, red with increasing line thickness per support interval). The latter is the strength of the co-appearance of connected nodes in rules supporting a decision class. The network was filtered to visualise rules with minimum support of 10% and rule p-value ≤ 0.05. In DA3, hub gene CKAP4 was surrounded by a thick blue border, indicating the importance of this gene to predicting this disease state. In fact, CKAP4 was a member of 14/44 co-prediction rules (Supplementary Table S2 online). The protein product of this gene, CKAP4, formerly CLIMP63, can act to regulate endoplasmic reticulum (ER) nanodomain homeostasis via shaping the luminal space or through interaction with other ER-resident proteins[16]. CKAP4 was highly expressed (orange), whereas connected gene SEC11C showed a medium level of discretised expression (grey), and RPS14 was lowly expressed (blue). In DA1, IFI35 and KLRB1 were both hub genes with medium expression levels. However, the latter had larger number of observations supporting its membership to rules (larger node size) but contributed to slightly fewer rules than IFI35 (thinner circle border size: IFI35, 6/22 rules; KLRB1, 4/22 rules). CD161/NKR-P1A, encoded by KLRB1, is a surface receptor of natural killer (NK) cells and subtypes of T lymphocytes, whereas IFI35 encodes the Interferon-induced 35 kDa protein, a proinflammatory damage-associated molecular pattern (DAMP) molecule in the innate immune pathway[17]. The membership of genes to the rule networks was not exclusive. For example, both IFITM1, which encodes interferon-induced transmembrane protein 1, and KLRB1 appeared in DA1 and DA3 although with different expression values (Supplementary Table S2 online). The sharing of genes across rules was more common in DA1 (4/12 plotted genes are unique to DA1) than DA3, where 18/26 were unique to that class. The model showed that the type 1 interferon response term was limited to DA1 (IFI35 and IFITM1) whereas B-cell activation was restricted to DA3 (CD38 and IGLL1) (Supplementary Fig. S6 online). However, while each term was enriched based on very few genes, it should be noted that these genes were present in multiple rules.

Patient subgroups reflect clinical manifestations

Hierarchical clustering of the enhanced model results (i.e., the membership of observations for each rule) revealed five subgroups largely contained within DA1 (C1 and C2) or DA3 (C3, C4 and C5) (Fig. 3A). The model and sub-groups were tested for significance to confirm that they cannot be attained by using random data. The significance was tested using permutation of DA state (p-value ≤ 0.05). These sub-clusters were subsequently projected onto the RNs (Fig. 3B). Of note, the C1 and C2 sub-clusters were not restricted to the DA1 rule set, however C4 and C5 reflected partially intersecting networks that were all included in C3 and limited to DA3 (Fig. 3B). In comparison to C3, the DA3 hub gene CKAP4 was absent from C4, whereas the two small unconnected DA3 networks were absent from C5. Due to the small number of genes available for consideration, a sub-cluster-based gene enrichment analysis was not informative for all sets. The C4 and C5 enrichments were largely based on the combination of two genes (MT1F, MT1A; 11 genes available) and suggested response to ion levels (see Supplementary Fig. S7 online), whereas C3 was again led by a small number of gene combinations (e.g., cell cycling and division: CDC20, PTTG1, PTTG3P, UBE2C; B-cell pathways: CD38, GBA, TYM) but this cluster also included the MTIF, and MT1A signals.

Figure 3

Hierarchical clustering of the model rules showed the major subdivision between the DA clusters. (a) Supported rules (black) and unsupported rules (grey) distinguish five disease subgroups that were projected into the (b) RN where group (cluster) membership is indicated by pie colour. The relationship between clinical phenotype (Supplementary Table S3) and sub-cluster was explored in two ways. First, to assess clinical association to a sub-cluster, the phenotype values supporting that sub-cluster were compared to those that did not. To interrogate which rule(s) were driving that pattern, a similar assessment was performed, this time for visits supporting a rule within the sub-cluster. The examination of continuous phenotypes showed that these measures were only significantly different between the three DA3 clusters and not between the two DA1 clusters (Tukey HSD adjusted p-value < 0.05; Supplementary Table S4 online). However, for DA1, the C1 and C2 clusters did contain the majority of low SLEDAI score visits (~ 1.7 in each, Supplementary Table S4, Supplementary Fig. S8 online), with C1 tending towards lower alanine aminotransferase (ALT) and serum creatinine (CR) values compared with the C2 cluster. As expected, the DA3 cluster contained the higher SLEDAI scores (C4 ~ 8.8, C5 ~ 12.1, C3 ~ 14.6). C4 was largely reflective of low measures for anti-dsDNA antibody, erythrocyte sedimentation (ESR) and white blood cell count (WBC). C5 presented lower ALT and aspartate aminotransferase levels (AST), while C3 was most representative of active disease, with low complement factor C3 and C4 values (Supplementary Table S4, Supplementary Fig. S8 online). Only two phenotypes, lymphocyte percent (LP) and neutrophil percent (NP), were significantly different in all pairwise DA3 cluster comparisons. LP was highest in C4 and NP, highest in C5. C3 was intermediate for both (Supplementary Table S4, Supplementary Fig. S8 online). In terms of categorical phenotypes, no significant association was detected between sex or race for each of the five clusters. In C1, the alopecia category was enriched when compared with all others (Fisher exact test p-value = 0.04; Supplementary Fig. S8 online). In C2, the musculoskeletal term and both oral steroid and nephritis treatment groups were enriched (all Fisher exact test p-value < 0.05). Treatment could not be ruled out as the factor driving differences between this and other clusters (Supplementary Fig. S9 online).

Rules reveal which gene co-predictions drive phenotype correlation

To interrogate which genes and rules drove the phenotypic associations, a closer examination of the rules within the clusters was performed. To associate rules to the discovered clusters a frequency distribution was built for all rules with support set matching at least 10% of the visits assigned to each of the discovered clusters. Based on the distribution 20% match was an empirical threshold for assigning rules to each (Supplementary Fig. 10 online). Figure 4 illustrates the fraction of rules from each cluster that were significantly associated with a phenotype, either continuous or categorical. Overall, rules from C1 or C3 were significantly associated with all phenotypes displayed (Fig. 4), an enrichment not seen with the other clusters. Interestingly, whilst no individual continuous phenotype was significantly different between the two DA1 clusters, or categorical phenotype different between the DA3 clusters, the graphs clearly showed that the same was not true for the proportion of rules significantly associated with a phenotype in either class.

Figure 4

Fraction of rules per cluster significantly associated with (a) continuous and (b) categorical phenotypes. See Supplementary Table S3 online, for a list of clinical variables and phenotypes abbreviations. For example, in the continuous class, rules from both DA1 clusters were significantly associated with lymphocyte count (LC; Fig. 4a). There, three unique rules were contributed by C1 (rules 5, 44, 56), whereas the fourth rule was shared by both clusters (rule 41: KLRB1, SEC11C; Supplementary Table S5 online). Interestingly from the seven genes contained across the four rules, only the gene encoding the signal peptidase complex catalytic subunit, SEC11C, showed decreased expression, all others had medium values. This maintenance of gene expression likely explained the overall lack of significant difference between clusters for this trait. For the DA3 clusters, a significant difference was recorded for the complement factor C3 phenotype between the C3 cluster (mean 62.1 mg/dL) and the C5 cluster (mean 85.9 mg/dL) (Wilcoxon test p-value = 2.4 × 10–3; Supplementary Table S4 online). An examination of the rules associated with phenotype C3 revealed that 17 rules were significantly linked to this phenotype in cluster C3, whilst only eight were found in the C5 cluster (Supplementary Table S5 online). All C5 rules were shared with C3, and no rules were contributed from C4 (Fig. 4a). As expected from the associated RN, none of the nine rules unique to cluster C3 showed discrete gene membership, rather they served to illustrate how in comparison to C5, rules represented by network edges could introduce additional unique features that may serve to explain the phenotypic difference. For example, shared rule 15 (CKAP4, MT1F) can form an extended connection with rules 4 (MT1F, KLRB1), 23 (CKAP4, SEC11C) and 51 (MT1F, PTTG1), widening this network to include genes KLRB1, SEC11C and PTTG1 (Supplementary Table S6 online). Each of these genes had previously been associated with SLE, but the link was not always clear. As noted before, KLRB1, expressed by NK cells and shown to be in the medium discretised expression level here, has been implicated in the regulation of the interferon gamma immune response[18]. SEC11C, encodes a subunit of microsomal signal peptidase complex and was the only DA3 gene maintained within medium levels for this phenotype. This gene was previously shown to be significantly down regulated in the T cells of adult SLE patients with low complement levels[17]. PTTG1 was previously linked to SLE via SNP association[19], although it was later shown that the risk allele was tagging the nearby microRNA, miR-146a, and this was down-regulated in European disease[20].

Discussion

The use of machine learning in the current study has served to identify the key regulatory networks that underlie two disease states, DA1 and DA3, of the highly heterogeneous condition, paediatric systemic lupus erythematosus (pSLE). In doing so, the high dimensionality of data drawn from 33,006 gene expression measures across 629 paediatric patient visits has been reduced to co-predictive networks linked via genes. These genes were under-represented or down-weighted in published studies of SLE differential gene expression (DGE) profiling (Supplementary Fig. S11 online). The result here was five sub-networks; two distinguishing DA1, perhaps as a result of treatment response, and three subgroups not related to treatment, within the more severe DA3 disease state. Two major factors underpinned the difference in the results observed here, versus those generated by others in the field. The first was the study of patient visits, rather than individuals over time via longitudinal gene expression. The second was methodological, as RNs are co-predictive and as such, are conceptually different from co-expression networks. The goal here was to delve into the co-predictive RNs based on gene expression at different stages of disease, potentially creating a set of biomarkers, which could be used to stratify patient subgroups for clinical trials or personalised medicine based on their disease state at a particular time. This contrasts to the prognostic goals of others using the same dataset[10,21]. Let us set the scene. For the transcriptomic data analysed here, the nodes of an RN are genes and their discretised expression values. The edges between two nodes of an RN are formed from pairs of genes and their discretized expression values as they co-occurred in the IF-part of rules (Fig. 2). Significantly, in one outcome a gene may have one discretised value, but in the other outcome it will have a different value. It follows that each outcome has its own network. As such, co-prediction can provide insight into the candidate biological processes characteristic of the given outcome. For example, one combination of descriptors, i.e., pairs of gene and their discretised value may be associated to DA1 state, and another pair to DA3. This is in contrast to co-expression networks that identify genes that are co-expressed, not necessarily co-predictive of the outcomes. SLE is a condition that spans the axes of both autoinflammatory and autoimmune disease. In this study, three DA3 subgroups were identified. The C3 sub-group sits on the autoimmune side, and had the clinical hallmarks of hypocomplementemia (low C3 and C4 clinical measures) in combination with high anti-dsDNA values, whilst the C4 sub-group likely represented the autoinflammatory side, with normal complement levels and low anti-dsDNA values (Supplementary Table S4 online). This was reinforced by the higher SLEDAI scores observed in C3 versus C4. Cluster C5 likely represented the intermediate stage between C3 and C4, where a significant shift between neutrophil and lymphocyte involvement is observed. This could translate to an immune complex driven disease state in C5, where the type I interferon process was active (low lymphocyte percent and increased neutrophil involvement). In studies using independent patient groups, both changes in complement ratio (C3/C4)[22] and the categorisation of neutrophil to lymphocyte ratio (NLR)[23], have been suggested as ways to distinguish SLE patient groups. Here, network analysis and unsupervised clustering combined both C3/C4 and NLR biomarker sets and resulted in three separate groups spanning these factors. The novelty in the current study lies in linking the clusters to co-predictive RNs, and this was the second major factor differentiating this work from others. While the application of machine learning approaches to the big data sets generated by biology -omics is not new[24], the approach used here removes the ‘black box’ interpretation of both the modelling and the results. This is required in the trade-off between predictability and interpretability[25]. Here we accepted the potentially reduced, but still high prediction accuracy of 81%, in favour of transparent classical models that perform well when the number of features available in the dataset (i.e. observations versus genes) outnumber the observations[26]. It is important to note that the rough sets approach to constructing rules, is based on finding all minimal subsets of features that preserve discernibility of the decision classes from the original set. The rules will contain conjunctions of genes that may reflect different levels of gene regulation but that do not need to be co-expressed. In RNs, the genes and their regulation levels are associated to the outcome and discern the decision classes (here DA1 or DA3) based on the training data, while in co-expression networks the genes are co-expressed with other genes and may not discern the outcomes. The R.ROSETTA method used for constructing the model has been shown to outperform other existing rule based methods[14], and has the key distinction of being the only method that can compute a significance level for the rules in the model. This is useful for calculating model prediction reliability, but it is the use of a minimum set of significant rules that served to highlight the genes contributing most strongly to the separate networks. In practice, this was illustrated by the hub genes for DA1 (e.g., IFI35, KLRB1) and DA3 (e.g., CKAP4, OTOF; Fig. 2). IFI35 expression is stimulated in response to IFN-α/γ[27] and it can act intracellularly as a negative switch in the innate immune pathway via retinoic acid-inducible gene I regulation[28]. Extracellularly, the opposite effect has been observed, and the IFI35 molecule can act as a DAMP, and serve to activate the NF-κB pathway in macrophages via TLR4 signalling[17]. The end result is the release of proinflammatory cytokines, including interleukin 6 and tumour necrosis factor[17]. In DA1, IFI35 expression is observed within the medium range, but a change in this value could be key in driving DA1 patients back to a remissive or inactive SLE state. Likewise, the maintained medium expression of KLRB1 (encoding the surface receptor CD161) suggests a role for other cell sets, including natural killer (NK) cells and T lymphocytes in this lower disease state. The cell population expressing CD161 has been shown to be lower in SLE patients versus controls[29]. This is intriguing as this receptor can mark the NK cells that respond to innate cytokines and so promote innate inflammation[30]. Here again we see a contradiction between the promotion and reduction of the innate immune response. While CKAP4 was shown as a highly expressed hub gene in DA3, the protein product is most often reported to have a role in cancer, for example acting with RBP1 to induce autophagy in murine models of oral squamous cell carcinoma[31]. Autophagy can also play into the pathogenesis of SLE in a number of ways. Dysregulated autophagy can affect the regulation of T and B cell populations[32], and increased autophagy can promote the NF-κB pathway response[33]. Through its interaction with ER-resident proteins, CKAP4 also has the potential to regulate or reflect the current state of cellular immune signalling[15]. For the individuals studied here, increased levels of CKAP4 may not be driving disease, but the finding opens a potential line of anti-CKAP4 antibody drug development for SLE patients; an avenue previously only promoted for cancer treatment[34]. Another DA3 hub gene, OTOF, is an interferon inducible gene, and has been recognised by others as a marker for SLE disease flares[35]. This is in keeping with the finding of OTOF in the C3 and C5 clusters, but not in C4. Recently it was suggested that through interaction with melatonin, OTOF may have a role in proteasome inhibition[36], and so could function in the downstream signal transduction pathway of NF-κB[37]. While that study was focused on neuronal survival driven by melatonin ubiquitin proteasome system inhibition, a protective anti-inflammatory role of melatonin in SLE pathogenesis has been reported previously[38,39]. Gene networks acting through the fulcrum of OTOF may help to explain this action, and suggests that further investigation of melatonin treatment in SLE flare could be warranted. The current analysis aimed to explore the different networks that underlie pSLE disease states with the goal of developing a minimum set of rules that could discern disease states DA1 from DA3. It is worth mentioning that we did not aim to model the entire spectrum of pSLE disease activity, so we chose the objects that could optimally and clearly separate between DA1 and DA3 states and highlight their subgroups. This required the pruning the misclassified objects from the initial model. A logistic regression was used to reveal the probability of object removal (Supplementary Fig. S3 online, Supplementary Table S1). More specifically, this analysis showed that the enhanced model applies to a subset of SLE patients for whom there have been no prior treatments, do not have very low SLEDAI scores, or have not markedly long period of time between diagnosis and clinic visit. If the pathways are to be generalized, these factors must be accounted for prior to using the model in a clinical setting. The enhanced model showed clearer sub-networks even though the gain in the accuracy was only 10%. While the networks generated here are based on a single gene expression set, multiple lines of evidence from previous SLE studies support their value; whether that be in classifying sub-cluster patient states or indicating possible treatments based on hub genes. It will be important to test the predictive, or replicative, ability of the gene networks to classify additional SLE patient sets. This includes generalisation to known patient subcategories, such as those with nephritis, and the wider adult SLE population. However, the permutation analysis conducted here suggests that this should be possible. We believe that machine learning approaches, such as the one demonstrated here, could aid disease understanding. This applies not only to SLE, but to any complex heterogeneous syndrome.

Methods

Figure 1, an overview of the analysis pipeline was generated with http://www.lucidchart.com resources.

Data and pre-processing

Publicly available whole blood transcriptome records (Illumina HT-12 V4 bead chip) and clinical metadata from 158 pSLE patients and 48 healthy controls were downloaded (NCBI GEO: GSE65391)[9]. All the procedures were performed in accordance with the relevant guidelines and regulations. From that exisiting data set, the values corresponding to DA1 (SLEDAI 0-2), DA3 (SLEDAI > 7) and control visits were extracted. In this analysis, the transcriptome generated per visit to the clinic, and not per patient lifetime, was considered. As such, an individual may be represented in the analysis multiple times (between 1 and 15 times) if their disease status at the time was classified as DA1 or DA3 (Supplementary Fig. S12 online). For expression data, gene loci represented by more than one probe were combined and averaged, before each gene locus was log transformed. Following a linear mixed model used to identify biological and technical confounders (Variance Partition R package[40]), no biological cofounders were identified. Batch effects were identified (Variance Partition R package[41]) and corrected (SVA R package[40]). The batch effects identified here were limited to the reported batch replicates from the original metadata (batch 1 and 2) and not found for other phenotypes (Supplementary Fig. S13 online).

Machine learning rule-based modelling to obtain explainable classifiers for DA state

For methodological context, we applied an interpretable learning method based on rough sets that offers classification transparency[11,12]. Given data in the form of a decision table, where rows represent observations and columns are features with the last column being the outcome or decision, rough set algorithms find all minimal subsets of features that preserve discernibility between the outcomes for the observations. These subsets of features are called reducts, and are used to generate IF–THEN rules by overlaying them on the observations. An IF–THEN rule consists of the condition part, the IF-part, often called the left-hand side, and the THEN part is the decision or outcome and often called the right-hand side of the rule. The elements of the IF-part are called descriptors, and are in the form of pairs, feature and its value. To aid interpretation, the rules generated by the model were visualized as RNs, where the nodes are descriptors. For every pair of descriptors in a rule of the RBM, an edge connecting the corresponding nodes is added to the network. First, expression values were subject to data discretisation, since R. ROSETTA[14] generates rules for that data form. For each gene, the control data expression mean (μ) and standard deviation (σ) were calculated, and then all DA data for that gene projected onto this threshold frame and discretised (Low ≤ μ − 2σ; μ—2σ ≤ Medium < μ + 2σ; High ≥ μ − 2σ; and coded by numeric values 1, 2, 3, respectively). To generate the initial model, data was first collected into a decision table where unique visit identifiers were the objects and put in rows (n = 629), while genes (n = 33,006) were variables and constituted columns. The objects were labelled with disease activity, DA1 or DA3, accordingly. Next, Monte Carlo Feature selection (MCFS) algorithm[14] was applied to obtain a ranked list of informative features with respect to classifying the objects. A significance cut-off for selecting features from the ranked list was obtained by a permutation test (p-value ≤ 0.05). Feature boosting was applied to select the optimal number of features to build the model and then the rule model was visualized with the VisuNet R package[42]. The initial rule-based model defined above was used as a base to further improve classification. Data (DA1 or DA3 visits) that did not match the left-hand side of any significant rules in the previous model were removed (p-value < 0.05). The MCFS[15] process was then repeated after object removal. Prior to building the enhanced rule-based model, iterative computational rounds were performed (Feature boosting in Fig. 1) in order to select the optimal number of features for building the final predictive model. The significant features from MCFS output were incrementally added to build several rule-based models. The selected features that were used to build the model with the best overall accuracy where chosen for building the final enhanced model using R.ROSETTA[14] and then visualized using VisuNet[42]. In order to identify patient subgroups, a matrix was constructed with maintained observations (visits) as rows and rules as columns. All cells for the observations that supported a rule were assigned 1, otherwise 0. Hierarchical clustering based on binary distance as the distance function was applied on this matrix.

Correlating clusters to clinical and phenotypic data

Available metadata, including continuous and categorical clinical values (Supplementary Table S3), were accessed[10]. For continuous variables, a one-way ANOVA following a post-hoc Tukey HSD test was used to compute significance. A Fisher's exact test was used for the assessment of categorical variables to sub-clusters.

Correlating rules associated with clusters to clinical and phenotypic data

Empirical values were used to determine the minimal threshold for rule membership to clusters. Rules were considered associated with a cluster if they had a support set matching at least 10% of the cluster’s support set (i.e., observations associated with a cluster; Supplementary Fig. S14 online). The association between a cluster’s supported rules and clinical phenotypes was assessed by contrasting phenotype values for supported samples of each rule versus the non-supported samples (categorical variables, non-parametric Wilcoxon test; binary variables, Fisher’s exact test). Supplementary Fig. S15 online illustrates this process.

Model validation

The decision label (DA1 or DA3) was permuted 1000 times and rule-based models were created for these random sets. A normal distribution was built for the model accuracies and an alpha of 0.05 and a 95% confidence interval used to determine the significance of the p-value. The mean, standard deviation and the standard error for the normal distribution were computed. The accuracy of the original model was compared to the mean μ and standard error σ. If the accuracy of the original model was smaller than μ − σ or greater than μ + σ then the p-value in this case was < 0.05.

Gene enrichment analysis

Overrepresentation of gene sets belonging to each cluster and the gene sets belonging to rules in DA1 and DA3 was determined using the R package clusterProfiler[43]. The background list was set as initial set of 33,006 available loci. Supplementary Figures. Supplementary Tables.

40 in total

Review 1. Systemic lupus erythematosus.

Authors: George C Tsokos
Journal: N Engl J Med Date: 2011-12-01 Impact factor: 91.245

2. The sva package for removing batch effects and other unwanted variation in high-throughput experiments.

Authors: Jeffrey T Leek; W Evan Johnson; Hilary S Parker; Andrew E Jaffe; John D Storey
Journal: Bioinformatics Date: 2012-01-17 Impact factor: 6.937

3. clusterProfiler: an R package for comparing biological themes among gene clusters.

Authors: Guangchuang Yu; Li-Gen Wang; Yanyan Han; Qing-Yu He
Journal: OMICS Date: 2012-03-28

4. Genetic association of miRNA-146a with systemic lupus erythematosus in Europeans through decreased expression of the gene.

Authors: S E Löfgren; J Frostegård; L Truedsson; B A Pons-Estel; S D'Alfonso; T Witte; B R Lauwerys; E Endreffy; L Kovács; C Vasconcelos; B Martins da Silva; S V Kozyrev; M E Alarcón-Riquelme
Journal: Genes Immun Date: 2012-01-05 Impact factor: 2.676

Review 5. Prevalence and burden of pediatric-onset systemic lupus erythematosus.

Authors: Sylvia Kamphuis; Earl D Silverman
Journal: Nat Rev Rheumatol Date: 2010-08-03 Impact factor: 20.543

6. IFP 35 is an interferon-induced leucine zipper protein that undergoes interferon-regulated cellular redistribution.

Authors: F C Bange; U Vogel; T Flohr; M Kiekenbeck; B Denecke; E C Böttger
Journal: J Biol Chem Date: 1994-01-14 Impact factor: 5.157

7. Genome-wide association scan in women with systemic lupus erythematosus identifies susceptibility variants in ITGAM, PXK, KIAA1542 and other loci.

Authors: John B Harley; Marta E Alarcón-Riquelme; Lindsey A Criswell; Chaim O Jacob; Robert P Kimberly; Kathy L Moser; Betty P Tsao; Timothy J Vyse; Carl D Langefeld; Swapan K Nath; Joel M Guthridge; Beth L Cobb; Daniel B Mirel; Miranda C Marion; Adrienne H Williams; Jasmin Divers; Wei Wang; Summer G Frank; Bahram Namjou; Stacey B Gabriel; Annette T Lee; Peter K Gregersen; Timothy W Behrens; Kimberly E Taylor; Michelle Fernando; Raphael Zidovetzki; Patrick M Gaffney; Jeffrey C Edberg; John D Rioux; Joshua O Ojwang; Judith A James; Joan T Merrill; Gary S Gilkeson; Michael F Seldin; Hong Yin; Emily C Baechler; Quan-Zhen Li; Edward K Wakeland; Gail R Bruner; Kenneth M Kaufman; Jennifer A Kelly
Journal: Nat Genet Date: 2008-01-20 Impact factor: 38.330

8. Interferon-inducible protein IFI35 negatively regulates RIG-I antiviral signaling and supports vesicular stomatitis virus replication.

Authors: Anshuman Das; Phat X Dinh; Debasis Panda; Asit K Pattnaik
Journal: J Virol Date: 2013-12-26 Impact factor: 5.103

9. Reticulon and CLIMP-63 regulate nanodomain organization of peripheral ER tubules.

Authors: Guang Gao; Chengjia Zhu; Emma Liu; Ivan R Nabi
Journal: PLoS Biol Date: 2019-08-30 Impact factor: 8.029

Review 10. The Dickkopf1-cytoskeleton-associated protein 4 axis creates a novel signalling pathway and may represent a molecular target for cancer therapy.

Authors: Akira Kikuchi; Katsumi Fumoto; Hirokazu Kimura
Journal: Br J Pharmacol Date: 2017-07-07 Impact factor: 8.739