Literature DB >> 31784565

The role and robustness of the Gini coefficient as an unbiased tool for the selection of Gini genes for normalising expression profiling data.

Marina Wright Muelas¹, Farah Mughal², Steve O'Hagan^3,4, Philip J Day^5,6, Douglas B Kell^7,8.

Abstract

We recently introduced the Gini coefficient (GC) for assessing the expression variation of a particular gene in a dataset, as a means of selecting improved reference genes over the cohort ('housekeeping genes') typically used for normalisation in expression profiling studies. Those genes (transcripts) that we determined to be useable as reference genes differed greatly from previous suggestions based on hypothesis-driven approaches. A limitation of this initial study is that a single (albeit large) dataset was employed for both tissues and cell lines. We here extend this analysis to encompass seven other large datasets. Although their absolute values differ a little, the Gini values and median expression levels of the various genes are well correlated with each other between the various cell line datasets, implying that our original choice of the more ubiquitously expressed low-Gini-coefficient genes was indeed sound. In tissues, the Gini values and median expression levels of genes showed a greater variation, with the GC of genes changing with the number and types of tissues in the data sets. In all data sets, regardless of whether this was derived from tissues or cell lines, we also show that the GC is a robust measure of gene expression stability. Using the GC as a measure of expression stability we illustrate its utility to find tissue- and cell line-optimised housekeeping genes without any prior bias, that again include only a small number of previously reported housekeeping genes. We also independently confirmed this experimentally using RT-qPCR with 40 candidate GC genes in a panel of 10 cell lines. These were termed the Gini Genes. In many cases, the variation in the expression levels of classical reference genes is really quite huge (e.g. 44 fold for GAPDH in one data set), suggesting that the cure (of using them as normalising genes) may in some cases be worse than the disease (of not doing so). We recommend the present data-driven approach for the selection of reference genes by using the easy-to-calculate and robust GC.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2019 PMID： 31784565 PMCID： PMC6884504 DOI： 10.1038/s41598-019-54288-7

Source DB: PubMed Journal: Sci Rep ISSN： 2045-2322 Impact factor: 4.379

Introduction

In a recent paper[1], we introduced the Gini index (or Gini coefficient, GC)[2-5] as a very useful, nonparametric statistical measure for identifying those genes whose expression varied least across a large set of samples (when normalised appropriately[6] to the total expression level of transcripts). The GC is a measure that is widely used in economics (e.g.[4,7-12]) to describe the (in)equality of the distribution of wealth or income between individuals in a population. However, although it could clearly be used to describe the variation in any other property between individual examples[13-16]), it has only occasionally been used in epidemiology[17-19] and in biochemistry[1,5,20-25]. Its visualisation and calculation are comparatively straightforward (Fig. 1): individual examples are ranked on the abscissa in increasing order of the size of their contribution, and the cumulative contribution is plotted against this on the ordinate. The GC is given by the fractional area mapped out by the resulting ‘Lorenz’ curve (Fig. 1). For a purely ‘socialist’ system in which all contributions are equal (GC = 0), the curve joins the normalised 0,0 and 1,1 axes, while for a complete ‘autocracy’, in which the resource or expression is held or manifest by only a single individual (GC = 1), the ‘curve’ follows the two axes (0,0 → 1,0 → 1,1).

Figure 1

Graphical indication of the means by which we calculate the Gini coefficient.

Graphical indication of the means by which we calculate the Gini coefficient. Since the early origins of large-scale nucleic acid expression profiling, especially those using microarrays[26-28], it has been clear that expression profiling methods are susceptible to a variety of more or less systematic artefacts within an experiment, whose resolution would require or benefit from some kind of normalisation (e.g.[29-39]). By this (‘normalisation of the first kind’), and what is typically done, we mean the smoothing out of genuine artefacts within an array or a run, that occur simply due to differences in temperature or melting temperature or dye binding or hybridisation and cross-hybridisation efficiency (and so on) across the surface of the array. This process can in principle use reference genes, but usually exploits smoothing methods that normalise geographically local subsets of the genes to a presumed distribution. Even after this is done, there is a second level of normalisation, that between chips or experiments, that is usually done separately, not least because it is typically much larger and more systematic, especially because of variations in the total amount of material in the sample analysed or of the overall sensitivity of the detector (much as is true of the within-run versus between-run variations observed in mass spectrometry experiments[40,41]). This kind of normalising always requires ‘reference’ genes whose expression varies as little as possible in response to any changes in experimental conditions. The same is true for expression profiling as performed by qPCR[42-47], where the situation is more acute regarding the choice of reference genes since primers must be selected for these a priori. Commonly, the geometric mean of the expression levels of that or those that vary the least is selected as the ‘reference’. The question then arises as to which are the premium ‘reference’ genes to choose. Data-driven and hypothesis-dependent science are complementary, though when a field is data-rich but hypothesis-poor, as is genomics, data-driven strategies are to be preferred[48]. Perhaps surprisingly[48], rather than simply letting the data speak for themselves, choices of candidate reference genes were often made on the basis that reference genes should be ‘housekeeping’ genes that would simply be assumed (‘hypothesised’) to vary comparatively little between cells, be involved in nominal routine metabolism and also that they should have a reasonably high expression level (e.g.[49-66]). This is not necessarily the best strategy, and there is in fact (and see below) quite a wide degree of variation of the expression of most standard housekeeping genes between cells or tissues (e.g.[53,62,65,67-79]). Indeed, Lee et al.[69] stated explicitly that housekeeping genes may be uniformly expressed in certain cell types but may vary in others, especially in clinical samples associated with disease. It became obvious that an analysis of the GC of the various genes was actually precisely what was required to assess those ‘housekeeping’ (or any other) genes that varied least across a set of expression profiles, and we found 35 transcripts for which the GC was 0.15 or below when assessing 56 mammalian cell lines taken from a wide variety of tissues[1]. These we refer to as the ‘Gini genes’. Most of these were ‘novel’ as they had never previously been considered as reference genes, and we noted that their Gini indices were significantly smaller (they were more stably expressed) than were those of the more commonly used reference genes[66]. However, this analysis was done on only one (albeit large) dataset of gene expression profiles. While some of the compilations (e.g.[65,80]) contain massive amounts of expression profiling data, many of these, especially the older ones, may well be of uncertain quality. Thus, especially since the GC is very prone to being raised by small numbers of large outliers, we decided for present purposes that we should compare our analyses of candidate Gini genes using a smaller but carefully chosen set of expression profiling experiments. The more modern RNA-seq (e.g.[81-85]), in which individual transcripts are simply counted digitally via direct sequencing, is seen as considerably more robust[81,86,87] and sensitive[88,89], and so we selected additional large and recent datasets that used RNA-seq in cell lines and tissues (Table 1). We note too that the precision of these digital methods (as with other, digital, single-molecule strategies[90-92]), means that the requirement for reasonably high-level expression levels is much less acute.

Table 1

Studies used for assessing proposed stable reference genes.

Study short name	Comments	Reference
GiniGene	Study presenting novel potential housekeeping genes in cells and tissues from the HPA project cell and tissue RNA-seq data.	[1]
geNorm or Vandesompele	Classic set of reference genes in tissues and a means of analysing them	[66]
Eisenberg	Very detailed analysis of housekeeping/ reference genes in tissues using the Illumina Body Map study of RNA-seq of 16 Human Tissues. E-MTAB-513.	[49]
Lee	Two novel reference genes from a detailed analysis of 281 normal tissue samples from 17 different organs then compares between disease states m and cell lines.	[131]
Caracausi	646 expression profile data sets from 54 different human tissues.	[65]

Studies used for assessing proposed stable reference genes. In a similar vein (Table 2), we selected a small number of reasonably detailed studies in which particular housekeeping genes had been proposed as reference genes.

Table 2

Studies used for expression profiling data.

Dataset short name	Comments	Reference
HPA	RNA-seq-based dataset from the Human Protein Atlas group. Two data sets available: one of 19,628 protein coding genes in 56 cell lines (HPA_C) and another of 19,613 protein coding genes in 59 tissues (HPA_T).	[1,93,140]
CCLE	RNA-seq-based dataset (Cancer Cell Line Encyclopedia) of 58,035 genes in 934 human cancer cell lines (downloaded from EBI Expression Atlas E-MTAB2770).	[141]
Klijn / Genentech	RNA-seq-based analysis of 57,711 genes in 622 human cancer cell lines (downloaded from EBI Expression Atlas E-MTAB-2706).	[142]
GTEx	RNA-Seq data of 46,711 genes in 53 human tissue samples from the Genotype-Tissue Expression (GTEx) project (downloaded from EBI Expression Atlas E-MTAB-5214).	[143]
PCAWG	RNA-Seq of 46,816 genes in 76 tissues, cancer and normal, from The International Cancer Genome Project: Pan Cancer Analysis of Whole Genomes (downloaded from EBI Expression Atlas E-MTAB-5200).	https://dcc.icgc.org/pcawg
HBM	Illumina Body Map: RNA-seq of 16 Human tissues. (downloaded from EBI Expression Atlas E-MTAB-513). Used by Eisenberg and colleagues in their analysis of housekeeping/ reference genes in tissues.	[49]

Studies used for expression profiling data. To our knowledge, there are no large-scale studies to determine housekeeping genes in large, cell-line cohorts; the present paper serves to provide one. In addition, we include an experimental RT-qPCR analysis of a subset of the Gini genes.

Results

The Gini Coefficient as a robust measure of gene expression stability in multiple cell-line data sets

We previously identified a number of genes in the Human Protein Atlas (HPA) cell line data set[93] with very low expression variability and thus potential for use as reference genes[1]. However, we did not compare these Gini genes to other genes that have previously been proposed as housekeeping genes. We therefore performed a similar analysis using the potential housekeeping genes we proposed in[1] as well as other reference genes proposed in other studies (Table 2) with additional large RNA-Seq cell line data sets (Table 1). Figure 2A shows a plot of the GC of a variety of candidate Gini genes versus their median expression level in the HPA cell lines dataset set[93]. It is clear that genes we identified previously have much lower GC values in the HPA dataset than do any of the others (just two, VPS29 and CHMP2A, were also identified by Eisenberg and Levanon and another, RPL41, by Caracausi). This is not at the expense of an unusually low expression (Fig. 2A), a finding broadly confirmed when we look at the median expression levels for the CCLE dataset (Fig. 2B) and of the Klijn dataset (Fig. 2C).

Figure 2

Gini coefficient and median expression levels of proposed reference genes in the HPA cell-line dataset. (A) GC versus median expression level of HPA dataset. (B) Median expression levels of CCLE vs HPA datasets. Line of best linear fit (in log space) shown is y = 0.991 + 0.827 × (r2 = 0.606). (C) Median expression levels of CCLE vs Klijn datasets. Line of best linear fit (in log space) shown is y = 0.998 + 0.804 × (r2 = 0.593). Colour coding: red, GeneGini reference genes; blue Eisenberg & Levanon; yellow Vandesompele; green Lee; lilac both GeneGini and Eisenberg and Levanon. Figure 3 shows the GC values for the various genes in two other datasets, viz CCLE and Klijn. Our previous Gini genes have a lower GC than that of any of the other housekeeping genes in 25 out of 38 cases in Klijn (all under 0.2) and in 26 out of 40 cases for CCLE (all under 0.22). In confirmation of this, and of the correlation found above between the median expression levels in CCLE and Klijn, the GC values are also well correlated with each other for the two datasets (Fig. 3). Thus, although the absolute numbers are slightly larger than are those for the HPA dataset (unsurprisingly, given the much larger number of examples), the trend is still very clear: the GiniGenes remain the best among those variously proposed as reference genes in a variety of large and quite independent datasets. It also suggests that variations in the total amount of mRNA are not an issue either.

Figure 3

Gini coefficient of candidate reference genes in CCLE and Klijn/Genentech cell-line datasets. Left panel shows all proposed housekeeping genes considered in this study, with the right panel showing labels of those genes with a GC < 0.25. The line of best fit is y = −0.171 + 0.829 × (r2 = 0.909). Colour code as in Fig. 2. Another common statistical measure, more resistant to individual outliers, is the interquartile ratio (the ratio between the 25th and 75th percentile when expression levels are ranked); by this measure too, the Gini genes that we uncovered previously stand out as being the least varying (Fig. 4 A,B). This suggests that, as a measure of gene expression stability, the GC is robust: the GiniGenes have the lowest ratio between their maximum and minimum expression values in the HPA dataset (Fig. 4C) and also the lowest interquartile ratio in their levels of expression in all three cell line data sets explored here (Fig. 4B,C) with good correlation between these two datasets.

Figure 4

Robustness of the Gini coefficient. (A) IQR of different genes in Klijn/Genentech vs HPA cell-line dataset. Left panel shows all genes considered in this study, with right panel showing genes with IQR < 2 in both datasets. Line of best linear fit (in log space) shown is y = 0.01 + 1.11 × (r2 = 0.937). (B) IQR of different genes in CCLE vs HPA cell-line dataset. Left panel shows all genes considered in this study, with right panel showing genes with IQR < 2 in both datasets. Line of best linear fit (in log space) shown is y = 0.04 + 0.99 × (r2 = 0.930). (C) Min vs Max: Median expression levels in HPA data set. Colour code as in Fig. 2.

Use of the Gini Coefficient to find GiniGenes in an unbiased manner in cell-line data sets

Up to now, our analyses of these data sets have used a set of predefined genes to look at expression stability. We next sought to investigate whether the GC would highlight genes with high expression stability that have been reported by others or by ourselves when performing this analysis in a data-driven manner. To that end, we found 115 genes shared between the three data sets with a GC ≤ 0.2 (Figs. 5, 6). This value for the GC was chosen since reducing this to ≤0.15 meant no or very few genes were found in some data sets (e.g. no genes in the CCLE data set had a GC ≤ 0.15) and going above this meant the number of genes were unmanageable (e.g. 1051 genes with a GC ≤ 0.21 in the Klijn data set). Of the 115 genes shared between the datasets with GC < 0.2, 13 were GiniGenes and two were housekeeping genes defined by Caracausi and colleagues (Fig. 5B). When we selected the top 20 expressing genes in each data set, only 13 of these were common across these data sets; Table 3 shows some descriptive statistics of 13 of these, with descriptive statistics of all 115 genes found in Supplementary Table S1. Of these genes, two (HNRNPK and PCBP1) are GiniGenes and one (SLC25A3) is a gene previously reported by Caracausi et al. Seven out of the 13 genes (HNRNPK, HNRNPC, PCBPB, SF3B1, SRSF3, EDF1 and EIF4H) here share important roles in RNA transcription, translation and stability[94-102], are implicated in a number of diseases, including cancer[94,97,103-113], and some, such as SRSF3 are essential for embryo development[114]. Given their pivotal functions, it may be unsurprising that the expression of these genes are tightly regulated across cell lines of different tissue origins, even where these are cancer cell lines. Overall, the distribution, expression stability and important functional roles of these genes suggest that these are excellent housekeeping genes across different cell types.

Figure 5

Shared and unique genes in HPA, CCLE and Klijn/Genentech cell-line data sets. (A) Genes with a GC < 0.2 .(B) Housekeeping genes in Table 2 with GC < 0.2.

Figure 6

GC vs Median for 115 genes in. (A) HPA, (B). CCLE and C. Klijn/Genentech cell-line data sets. Colour coding: Blue, Caracausi; Green, GeneGini reference genes; Grey, neither. Shape coding: Circle, other; Triangle, SLC coding gene.

Table 3

Descriptive statistics of 13 genes common across cell-line data sets with GC < 0.2.

Gene	Gini (HPA Cells)	Gini (CCLE)	Gini (Klijn)	Median (HPA Cells)	Median (CCLE)	Median (Klijn)	RSD (HPA Cells)	RSD (CCLE)	RSD (Klijn)	GeneGini (GG)/GeNorm (V), Eisenberg (EL), Lee (L), Caracausi © /N (Cell Line Data sets)	Reference	S/A/O	Protein name	Uniprot ID	Role
ARF1	0.18	0.19	0.18	316.70	423.00	517.00	32.54	35.87	35.17	N	N	O	ADP-ribosylation factor 1	P84077	Essential and ubiquitous GTP-binding protein regulators of vesicular trafficking and actin remodeling.
CNBP	0.15	0.20	0.16	324.24	602.00	637.50	28.47	37.37	29.49	N	N	O	Cellular nucleic acid-binding protein	P62633	Zinc finger protein, function unclear (Pellizzoni et al. 1997), regulates protein translation and transcription (Wei 2018)
DYNLL1	0.17	0.19	0.16	485.97	215.50	224.00	30.73	34.50	28.50	N	N	O	Dynein light chain 1, cytoplasmic	P63167	Component of dynein involved in intracellular transport and motility
EDF1	0.16	0.19	0.18	449.42	379.00	502.50	29.69	33.83	34.30	N	N	O	Endothelial differentiation-related factor 1	O60869	Modulates transcription of genes involved in endothelial differentiation, also acts as a transcriptional coactivator (Cazzaniga 2018)
EIF4H	0.15	0.18	0.17	294.21	553.50	673.00	27.91	33.27	30.64	N	N	O	Eukaryotic translation initiation factor 4H	Q15056	Translation initiation factor
HNRNPC	0.18	0.19	0.17	800.62	314.50	409.50	32.96	34.41	29.97	N	N	O	Heterogeneous nuclear ribonucleoproteins C1/C2	P07910	RNA binding protein involved in regulation of RNA splicing, export, expression, stability, and translation.
HNRNPK	0.14	0.17	0.12	603.32	548.00	625.50	25.19	29.60	21.35	GG	1	O	Heterogeneous nuclear ribonucleoprotein K	P61978	Regulation of RNA transcription and translation, splicing, nuclear export, and decay
PCBP1	0.14	0.20	0.16	291.40	336.00	452.00	24.52	36.23	29.01	GG	1	O	Poly(rC)-binding protein 1	Q15365	Regulation of mRNA transcription, translation and stability
PFDN5	0.17	0.20	0.19	451.20	158.00	152.50	31.60	41.30	35.69	N	N	O	Prefoldin subunit 5	Q99471	Molecular protein folding cytosolic chaperone. Prevents misfolding of newly synthesised nascent polypeptides
SF3B1	0.16	0.19	0.15	179.26	143.00	164.00	29.26	33.89	27.01	N	N	O	Splicing factor 3B subunit 1	O75533	Essential RNA-protein complex involved in pre-mRNA splicing
SLC25A3	0.17	0.18	0.16	471.21	154.00	193.00	30.19	32.86	28.39	C	65	S	Phosphate carrier protein, mitochondrial	Q00325	Phosphate transport from cytoplasm to mitochondria, with protons.
SRP14	0.17	0.19	0.17	224.37	296.00	347.50	30.36	34.77	30.62	N	N	O	Signal recognition particle 14 kDa protein	P37108	Signal-recognition-particle assembly has a crucial role in targeting secretory proteins to the rough endoplasmic reticulum membrane. Required for elongation arrest by binding with SRP9 to the Alu domain.
SRSF3	0.19	0.19	0.15	260.33	164.00	207.00	35.17	33.97	28.60	N	N	O	Serine/arginine-rich splicing factor 3	P84103	splicing factor that promotes exon inclusion during alternative splicing. Regulatory roles in RNA metabolism and functions such as mRNA splicing and 3’end processing. Essential for embryo development

In addition, the protein name, as well as UniProt ID and function are shown. S/A/O refers to SLC, ABC or Other respectively.

Shared and unique genes in HPA, CCLE and Klijn/Genentech cell-line data sets. (A) Genes with a GC < 0.2 .(B) Housekeeping genes in Table 2 with GC < 0.2. GC vs Median for 115 genes in. (A) HPA, (B). CCLE and C. Klijn/Genentech cell-line data sets. Colour coding: Blue, Caracausi; Green, GeneGini reference genes; Grey, neither. Shape coding: Circle, other; Triangle, SLC coding gene. Descriptive statistics of 13 genes common across cell-line data sets with GC < 0.2. In addition, the protein name, as well as UniProt ID and function are shown. S/A/O refers to SLC, ABC or Other respectively. Of particular interest to us was finding one gene encoding a mitochondrial phosphate transporter protein (SLC25A3[115]) to be within this list of the top expressing stably expressed genes. This might seem logical since mitochondrial ATP synthesis is required by all cell types and tissues. Figure 7 shows the robustness of the GC for the subset of 115 genes common between the three data sets studied here with a low GC (<0.2). Lower Gini coefficients correlate with lower IQR and Max:median ratios (Fig. 7: only results for the Klijn data set are shown). The range of IQR values of these genes was smaller in the larger two data sets (CCLE, 1.42–1.67; Klijn, 1.30–1.64) than in the HPA data set (1.26–1.84) suggesting the measured expression values were more stable in the larger data sets (Supplementary Table S1). This may, however, be due to a larger number of cell lines in these two large datasets (934 and 622 in CCLE and Klijn) compared with the HPA data set (56 cell lines).

Figure 7

Robustness of GC for finding stably expressed genes using shared genes between HPA, CCLE and Klijn/Genentech cell-line data sets with GC < 0.2. Shown are the results for the Klijn/Genentech dataset. (A) IQR vs GC, (B). Max:Mean vs Min. Colour coding: Blue, Caracausi; Green, GeneGini reference genes; Grey, neither. Shape coding: Circle, other; Triangle, SLC coding gene.

Application of the Gini coefficient to human tissue RNA-Seq data sets

The results presented thus far are representative of human cell lines. Most reports in the literature regarding housekeeping genes refer to tissue expression data. This may be due to the cell lines being “dedifferentiated” with respect to the tissues from which they are derived[116]. In our previous report[1] we also analysed RNA-Seq data from tissues[93] and found 22 genes with a GC < 0.15, of which 3 (CHMP2A, VPS29 and PCBP1) were also found in cell line data with a GC < 0.15. The median expression level and GC of these and other candidate GiniGenes in this tissue data set are shown in Fig. 8. As with cell line data, the genes we previously identified (GGs, green dots in Fig. 8) have much lower GCs in this tissue data set than do any of the other candidate GiniGenes, with only two of these genes (VPS29 and CHMP2A) identified previously by Eisenberg & Levanon[49]. The low GC value of these GiniGenes is not at the expense of low expression: of the 22 GiniGenes, 13 are expressed at a median level of between 40 and 200 TPM (see Supplementary Table S2). Moreover, the GC was also representative of the variation in expression of these genes (albeit influenced to a lesser extent by outliers), as shown in Fig. 9A,B, with all GiniGenes having a GC < 0.15 and the lowest RSD (relative standard deviation), ranging from 24.096% to 28.66% and IQR (1.26 to 1.44) of this list of housekeeping genes. The expression of other housekeeping genes such as GAPDH, ACTB, RPL13A, SDHA, B2M was quite varied according to these measures. For example, the GC of GAPDH (a commonly used HKG) was 0.33, with a RSD of 72.4% and IQR of 2.24, and for ACTB (another commonly used HKG) these values were 0.29, 55.24%, and 2.11.

Figure 8

Figure 9

Robustness of the Gini coefficient in the HPA tissue data set. (A) RSD versus Gini coefficient of candidate reference genes. Line of best linear fit (in log space) shown is y = 2.45 + 1.24 × (r2 = 0.938) (B). IQR versus Gini coefficient of candidate reference genes. Line of best linear fit (in log space) shown is y = 0.87 + 0.96 × (r2 = 0.566). Colour code as in Fig. 8.

Gini coefficient and median expression levels of proposed reference genes in the HPA tissue dataset. Colour coding: blue, Caracausi; purple, Eisenberg and Levanon; green, GeneGini reference genes; yellow, both GeneGini and Eisenberg and Levanon; orange, Lee; black, Vandesompele. Robustness of the Gini coefficient in the HPA tissue data set. (A) RSD versus Gini coefficient of candidate reference genes. Line of best linear fit (in log space) shown is y = 2.45 + 1.24 × (r2 = 0.938) (B). IQR versus Gini coefficient of candidate reference genes. Line of best linear fit (in log space) shown is y = 0.87 + 0.96 × (r2 = 0.566). Colour code as in Fig. 8. The median expression levels of the proposed reference genes show a similar level of correlation between the data sets as was found with the cell line data (Fig. S1A–C), and GiniGenes displayed a mid-range level of expression. The GC of the tissue GiniGenes we proposed however, tended to be higher and more variable in their GC values than in the HPA dataset (Fig. S2,A–C) suggesting that those genes may be representative of the HPA data set only. As an example, in the GTEx dataset only 28 genes had a GC < 0.2, of which the majority (17) were those reported by Caracausi and colleagues, and 7 were GiniGenes. The results here are likely influenced by the number and status (disease or normal) of the tissues analysed in the various data sets compared; for example, the GTEx data come from 53 different, normal human tissues, whereas the HPA tissue data include a mixture of disease and normal tissue samples. In addition, compared to the cell line data where hundreds (in the case of the Cancer Cell Line Encyclopedia) of cell lines were analysed, the number of tissues in these data sets was fewer than 100. In the case of the data set used by Eisenberg and Levanon[49], viz. the Illumina Human Body Map (E-MTAB-513), 10 of the 11 housekeeping genes proposed here (which included 2 Gini Genes, CHMP2A and VPS29) had a GC ≤ 0.2 and were reasonably well expressed (with median expression levels between 50–270 TPM, see Supplementary Table S2 and Supplementary Fig. S4). This may be compared to the 5 other GGs with GC < 0.2 in this data set whose expression value was lower, with median expression between 19–35 TPM. This suggests that finding suitable HKGs may be dependent on the data set itself, and the type of tissue under investigation. We next sought to perform a more comprehensive and integrative analysis by filtering the tissue data sets to only include genes with a GC ≤ 0.2 to find common genes across these data sets with reasonable expression stability (Supplementary Table S3). As shown in Fig. 10 only 15 genes were shared between the four data sets with a GC ≤ 0.2, none of which has been reported previously as a housekeeping genes. Table 4 shows some descriptive statistics of these genes. In any case, the names of the proteins encoded by these 15 genes suggest these play important and essential roles. The median expression values of these genes varied from around 10–450 TPM, with SNX3 (Sorting nexin-3 (Protein SDP3)) and COX4I1 (Cytochrome c oxidase subunit 4 isoform 1) being consistently the two highest-expressing genes.

Figure 10

Table 4

Descriptive statistics of 15 common genes across tissue data sets with a GC < 0.2.

Gene	Gini	Gini	Gini	Gini	Median	Median	Median	Median	% RSD	% RSD	% RSD	% RSD	Protein name	UniProt ID	Function (UniProt)
Gene	(HPA Tissue)	(GTEx)	(PCAWG)	(HBM)	(HPA Tissue)	(GTEx)	(PCAWG)	(HBM)	(HPA Tissue)	(GTEx)	(PCAWG)	(HBM)	Protein name	UniProt ID	Function (UniProt)
CHCHD4	0.19	0.14	0.19	0.19	13.08	17	20	25.69	36.11	27.26	40.35	35.44	Mitochondrial intermembrane space import and assembly protein 40	Q8N4Q1	Functions as a chaperone and catalyses formation of disulfide bonds in substrate proteins such as COX17, COX19 and MICU1. Required for import of small cysteine-containing proteins in the mitochondrial intermembrane space.
COPS5	0.17	0.17	0.2	0.17	45.27	19	20	20	32.46	30.76	43.24	33.27	COP9 signalosome complex subunit 5 (SGN5)	Q92905	Probable protease subunit of the COP9 signalosome complex (CSN), a complex involved in various cellular and developmental processes. The CSN complex is an essential regulator of the ubiquitin (Ubl) conjugation pathway by mediating the deneddylation of the cullin subunits of the SCF-type E3 ligase complexes, leading to decrease the Ubl ligase activity of SCF-type complexes such as SCF, CSA or DDB2. The complex is also involved in phosphorylation of p53/TP53, c-jun/JUN, IkappaBalpha/NFKBIA, ITPK1 and IRF8, possibly via its association with CK2 and PKD kinases. CSN-dependent phosphorylation of TP53 and JUN promotes and protects degradation by the Ubl system, respectively. In the complex, it probably acts as the catalytic center that mediates the cleavage of Nedd8 from cullins. It however has no metalloprotease activity by itself and requires the other subunits of the CSN complex. Interacts directly with a large number of proteins that are regulated by the CSN complex, confirming a key role in the complex. Promotes the proteasomal degradation of BRSK2.
COX4I1	0.17	0.12	0.18	0.16	447.69	123	144	94.13	33.09	22.96	37.11	28.92	Cytochrome c oxidase subunit 4 isoform 1, mitochondrial	P13073	This protein is one of the nuclear-coded polypeptide chains of cytochrome c oxidase, the terminal oxidase in mitochondrial electron transport.
IDH3G	0.16	0.18	0.17	0.18	44.67	56	60	34.75	28.6	31.58	33.02	32.45	Isocitrate dehydrogenase [NAD] subunit gamma, mitochondrial	P51553	Regulatory subunit which plays a role in the allosteric regulation of the enzyme catalyzing the decarboxylation of isocitrate (ICT) into alpha-ketoglutarate. The heterodimer composed of the alpha (IDH3A) and beta (IDH3B) subunits and the heterodimer composed of the alpha (IDH3A) and gamma (IDH3G) subunits, have considerable basal activity but the full activity of the heterotetramer (containing two subunits of IDH3A, one of IDH3B and one of IDH3G) requires the assembly and cooperative function of both heterodimers.
MAP2K2	0.2	0.17	0.18	0.17	60.91	55	58.5	30.75	36.87	31.05	34.07	31.65	Dual specificity mitogen-activated protein kinase kinase 2 (MAP kinase kinase 2) (MAPKK 2) (EC 2.7.12.2)	P36507	Catalyzes the concomitant phosphorylation of a threonine and a tyrosine residue in a Thr-Glu-Tyr sequence located in MAP kinases. Activates the ERK1 and ERK2 MAP kinases (By similarity).
MTIF3	0.18	0.17	0.19	0.19	45.15	51	55	72.63	33.81	30.61	37.88	37.82	Translation initiation factor IF-3, mitochondrial (IF-3(Mt))	Q9H2K0	IF-3 binds to the 28 S ribosomal subunit and shifts the equilibrium between 55 S ribosomes and their 39 S and 28 S subunits in favor of the free subunits, thus enhancing the availability of 28 S subunits on which protein synthesis initiation begins.
MTRF1L	0.17	0.19	0.19	0.14	7.86	11	17	17	31.84	33.29	34.42	28.57	Peptide chain release factor 1-like, mitochondrial	Q9UGC7	Mitochondrial peptide chain release factor that directs the termination of translation in response to the peptide chain termination codons UAA and UAG.
NDUFB8	0.16	0.16	0.18	0.19	143.5	39	37.5	56.63	30.35	28.76	33.91	35.07	NADH dehydrogenase [ubiquinone] 1 beta subcomplex subunit 8, mitochondrial	O95169	Accessory subunit of the mitochondrial membrane respiratory chain NADH dehydrogenase (Complex I), that is believed not to be involved in catalysis. Complex I functions in the transfer of electrons from NADH to the respiratory chain. The immediate electron acceptor for the enzyme is believed to be ubiquinone.
NMT1	0.2	0.2	0.18	0.16	29.71	46	51.5	39.94	36.83	35.09	34.69	29.15	Glycylpeptide N-tetradecanoyl-transferase 1 (EC 2.3.1.97)	P30419	Enzyme catalysing transfer of myristate from CoA to proteins. Required for full expression of the biological activiteies of several N-myristoylated proteins, including the alpha subunit of the signal-transducing guanine nucleotide-binding protein (G protein) GO (GNAO1; MIM 139311)
PPID	0.16	0.17	0.17	0.19	31.11	29	32.5	44.75	29.02	32.72	34.11	33.73	Peptidyl-prolyl cis-trans isomerase D (PPIase D) (EC 5.2.1.8)	Q08752	Catalyze the cis-trans isomerization of proline imidic peptide bonds in oligopeptides and accelerate the folding of proteins. This protein has been shown to possess PPIase activity and, similar to other family members, can bind to the immunosuppressant cyclosporin A.
RTCA	0.17	0.18	0.2	0.18	26.5	24	27	33.69	30.67	35.82	42.89	33.68	RNA 3’-terminal phosphate cyclase (RNA cyclase)	O00442	Catalyzes the conversion of 3’-phosphate to a 2’,3’-cyclic phosphodiester at the end of RNA. The mechanism of action of the enzyme occurs in 3 steps: (A) adenylation of the enzyme by ATP; (B) transfer of adenylate to an RNA-N3’P to produce RNA-N3’PP5’A; (C) and attack of the adjacent 2’-hydroxyl on the 3’-phosphorus in the diester linkage to produce the cyclic end product. The biological role of this enzyme is unknown but it is likely to function in some aspects of cellular RNA processing.
SELENOK	0.19	0.16	0.18	0.18	31.07	49	49	80.94	36.89	30.39	38.19	33.31	Selenoprotein K (SelK)	Q9Y6D0	Required for Ca2 + flux in immune cells and plays a role in T-cell proliferation and in T-cell and neutrophil migration (By similarity). Involved in endoplasmic reticulum-associated degradation (ERAD) of soluble glycosylated proteins (PubMed:22016385). Required for palmitoylation and cell surface expression of CD36 and involved in macrophage uptake of low-density lipoprotein and in foam cell formation (By similarity). Together with ZDHHC6, required for palmitoylation of ITPR1 in immune cells, leading to regulate ITPR1 stability and function (PubMed:25368151). Plays a role in protection of cells from ER stress-induced apoptosis (PubMed:20692228). Protects cells from oxidative stress when overexpressed in cardiomyocytes (PubMed:16962588).
SMG5	0.19	0.16	0.19	0.18	34.89	63	64	34.13	35.95	27.52	44.99	34.09	Protein SMG5 (EST1-like protein B)	Q9UPR3	Plays a role in nonsense-mediated mRNA decay. Does not have RNase activity by itself. Promotes dephosphorylation of UPF1. Together with SMG7 is thought to provide a link to the mRNA degradation machinery involving exonucleolytic pathways, and to serve as an adapter for UPF1 to protein phosphatase 2 A (PP2A), thereby triggering UPF1 dephosphorylation. Necessary for TERT activity.
SNX3	0.17	0.18	0.19	0.18	169.22	190	208.5	327.06	30.77	31.22	39.21	33.13	Sorting nexin-3 (Protein SDP3)	O60493	Phosphoinositide-binding protein required for multivesicular body formation. Specifically binds phosphatidylinositol 3-phosphate (PtdIns(P3)). Also can bind phosphatidylinositol 4-phosphate (PtdIns(P4)), phosphatidylinositol 5-phosphate (PtdIns(P5)) and phosphatidylinositol 3,5-biphosphate (PtdIns(3,5)P2) (By similarity). Plays a role in protein transport between cellular compartments. Together with RAB7A facilitates endosome membrane association of the retromer cargo-selective subcomplex (CSC/VPS). May in part act as component of the SNX3-retromer complex which mediates the retrograde endosome-to-TGN transport of WLS distinct from the SNX-BAR retromer pathway (PubMed:21725319, PubMed:24344282). Promotes stability and cell surface expression of epithelial sodium channel (ENAC) subunits SCNN1A and SCNN1G (By similarity). Not involved in EGFR degradation. Involved in the regulation of phagocytosis in dendritic cells possibly by regulating EEA1 recruitment to the nascent phagosomes (PubMed:23237080). Involved in iron homeostasis through regulation of endocytic recycling of the transferrin receptor TFRC presumably by delivering the transferrin:transferrin receptor complex to recycling endosomes; the function may involve the CSC retromer subcomplex (By similarity). In the case of Salmonella enterica infection plays arole in maturation of the Salmonella-containing vacuole (SCV) and promotes recruitment of LAMP1 to SCVs (PubMed:20482551).
SURF1	0.18	0.15	0.2	0.17	18.3	47	57.5	45.69	34.94	26.2	38.25	32.15	Surfeit locus protein 1	Q15526	Component of the MITRAC (mitochondrial translation regulation assembly intermediate of cytochrome c oxidase complex) complex, that regulates cytochrome c oxidase assembly.

In addition, the protein name, as well as UniProt ID and function are shown.

UpSetR[139] plot showing genes with a GC < 0.2 that are variously shared and unique across the PCAWG, HBM, GTEX and HPA tissue data sets. The data underpinning this plot can be found in Supplementary Table S4. Descriptive statistics of 15 common genes across tissue data sets with a GC < 0.2. In addition, the protein name, as well as UniProt ID and function are shown. Sorting nexins are a group of cytoplasmic and membrane-associated proteins involved in the regulation of intracellular trafficking[117]. SNX3 has been reported to play a role in receptor recycling and formation of multivesicular bodies[118], and its dysregulation has been implicated in disorders of iron metabolism and the pathogenesis of some neurodegenerative diseases[119,120]. The COX4I gene encodes the nuclear-encoded cytochrome c oxidase subunit 4 isoform 1, the terminal enzyme in the mitochondrial respiratory chain. Given the key role of the mitochondrial respiratory chain in all human cells (except red blood cells), stable expression of such a gene in all tissues may not be a surprising result. Increased RNA COX4I1 levels have been reported in sperm of an obese male rat model[121] and thus may play a role in obesity-related fertility problems, and reduced expression of this subunit leads to a reduction in mitochondrial respiration as well as sensitising cells to apoptosis[122]. The small number of genes shared between these data sets with a GC < 0.2 indicates that the data in these studies are more variable compared to cell lines alone. The cause of this variation may be due to the tissue data having been obtained from different subjects[123]. Moreover, tissues are themselves a mixture of cell types with varying levels of gene expression in each cell type[124], while cell lines are nominally clonal. Our results suggest that in the case of RNA-seq tissue data sets, where gene expression tends to be more variable, an unbiased approach, using the Gini coefficient, may be more fruitful in the search for stably expressed genes with which to perform normalisation, than the other commonly used methods used until now[123,125].

RT-qPCR analysis of gene expression stability of some housekeeping genes in 10 cell lines

In order to illustrate the utility of the GC to find suitable housekeeping genes, we next chose to assess this experimentally by RT-qPCR using a small subset of candidate reference genes (40; top 32 genes from genes ordered by GC and expression value from[94], plus 8 of the most commonly used from the literature, including seven from[66] and one (RPL32) from[126,127], and 10 cell lines from a range of tissues (see Tables 5 and 6). We first set a Cq value (which is inversely proportional to expression level) cut-off of 32, above which no expression is observed, and subsequently used the Cq values of genes in cell lines as a relative expression level (Cq cut off/Cq value of gene). Descriptive statistics of the expression of each gene in individual cell lines were then calculated. As a final step, the median expression value of each gene in individual cell lines was used to calculate descriptive statistics, including the GC, of gene expression across these cell lines. Figure 11 illustrates a KNIME workflow[128-130] that we wrote for this purpose. The raw data and descriptive statistics extracted are provided in Supplementary Tables S5 and S6 respectively, and the KNIME analysis workflow in Supplementary File 1.

Table 5

Details of human cell lines used for the assessment of expression of candidate reference genes by RT-qPCR.

Cell line	Tissue	Disease	Morphology	Growth mode	Media
K562	Blood	Chronic Myeloid Leukemia	Lymphoblast	Suspension	RPMI-1640
HEK293	Kidney	Immortalized cell line obtained by transfecting sheared adenovirus 5 DNA	Epithelial	Adherent	DMEM
Panc1	Pancreas	Pancreatic carcinoma of ductal origin	Epithelial	Adherent	DMEM
SH-SY5Y	Neuroblastoma	metastasis	Neuroblast	Adherent	DMEM
T24	Bladder	bladder carcinoma	Epithelial	Adherent	McCoy’s 5A
J82	Bladder	Transitional cell carcinoma	Epithelial	Adherent	EMEM
RT-112	Bladder	Carcinoma	Epithelial	Adherent	RPMI-1640
5637	Bladder	Grade II carcinoma	Epithelial	Adherent	RPMI-1640
PC3	Prostate	Grade IV adenocarcinoma	Epithelial	Adherent	Ham’s F12
PNT2	Prostate	Immortalized with SV40	Epithelial	Adherent	RPMI-1640

Table 6

Candidate reference genes used to assess expression stability experimentally by RT-qPCR.

Gene Name	Uniprot	Gini (HPA Cell Lines)	GeneGini (GG)/GeNorm (V), Eisenberg & Levanon (EL), Lee (L), Caracausi (C), Zhang & Kriegova (ZK)	S/A/O	Reference
ACTB	P60709	0.26	V	O	[66]
B2M	P61769	0.44	V	O	[66]
BTF3	P20290	0.15	GG	O	[1]
C2orf49	Q9I8G4	0.14	GG	O	[1]
CHTOP	Q9Y3Y2	0.14	GG	O	[1]
CLINT1	Q14677	0.14	GG	O	[1]
CNOT2	Q9NZN8	0.14	GG	O	[1]
CNOT4	O95628	0.13	GG	O	[1]
GAPDH	P04406	0.27	V	O	[66]
GGNBP2	Q5SV77	0.15	GG	O	[1]
GORASP2	Q9H8Y8	0.15	GG	O	[1]
HMBS	P08397	0.26	V	O	[66]
HNRNPK	P04637	0.14	GG	O	[1]
HPRT1	P00492	0.31	V	O	[66]
IK	Q13123	0.13	GG	O	[1]
INTS14	Q96SY0	0.14	GG	O	[1]
KAT5	Q92993	0.13	GG	O	[1]
MDH1	P40925	0.15	GG	O	[1]
NACA	Q13765	0.15	GG	O	[1]
NXF1	Q9UBU9	0.12	GG	O	[1]
PARK7	Q99497	0.14	GG	O	[1]
PCBP1	Q15365	0.14	GG	O	[1]
PCBP2	Q15366	0.14	GG	O	[1]
RBM45	Q8IUH3	0.12	GG	O	[1]
RNF123	Q5XPI4	0.15	GG	O	[1]
RPL13A	P40429	0.21	V	O	[66]
RPL32	P62910	0.22	ZK	O	[126,127]
RPL41	P62945	0.15	GGC	O	[1,65]
RPRD2	Q5VT52	0.14	GG	O	[1]
SDHA	P31040	0.29	V	O	[66]
SF3B2	Q13435	0.11	GG	O	[1]
SNW1	Q13573	0.13	GG	O	[1]
SRP19	P09132	0.14	GG	O	[1]
SUPT7L	O94864	0.13	GG	O	[1]
TRNT1	Q96Q11	0.15	GG	O	[1]
TXNL1	O43396	0.14	GG	O	[1]
UBE2Q1	Q7Z7E8	0.14	GG	O	[1]
UBR2	Q8IWV8	0.14	GG	O	[1]
UXT	Q9UBK9	0.13	GG	O	[1]
VPS29	Q9UBQ0	0.15	GGEL	O	[1,49]

Included are gene name and UniProt ID, Gini coefficient as calculated using the HPA cell-line data set. S/A/O refers to SLC, ABC or Other respectively.

Figure 11

The KNIME workflow described here to calculate descriptive statistics and the Gini coefficient from RT-qPCR data. This workflow can be adapted for use with large RNA-Seq Data sets.

Details of human cell lines used for the assessment of expression of candidate reference genes by RT-qPCR. Candidate reference genes used to assess expression stability experimentally by RT-qPCR. Included are gene name and UniProt ID, Gini coefficient as calculated using the HPA cell-line data set. S/A/O refers to SLC, ABC or Other respectively. The KNIME workflow described here to calculate descriptive statistics and the Gini coefficient from RT-qPCR data. This workflow can be adapted for use with large RNA-Seq Data sets. Figure 12 uses RT-qPCR data to plot the GC of the candidate reference genes analysed here versus their relative median expression level. Three GiniGenes[94] (RBM45, TRNT1 and CNOT2) had very low and variable expression. Most of the other genes analysed showed low GC values with a range of (relative) expression values; the inset in Fig. 12 shows genes with a GC < 0.2 including a mix of 35 genes: 26 GiniGenes and 6 housekeeping genes referenced by Vandesompele and colleagues[66], one referenced by Caracausi[65] and one by Lee et al.[131]. Two of these GiniGenes, HNRNPK and PCBP1, which we also found to be stably expressed in the cell line data suggesting these may be potential stable housekeeping genes. As shown in Fig. 13 and inset, the GC is well correlated with the % RSD.

Figure 12

Figure 13

Robustness of the Gini coefficient in assessed experimentally by RT-qPCR using a small subset of proposed reference genes. Left panel shows Gini coefficient vs % RSD for all genes considered in this study, with right panel showing the same with genes with a GC < 0.2 and % RSD <10. Line of best linear fit shown is y = 0.002 + 0.004 × (r2 = 0.988). Shape coding as in Fig. 12.

Gini coefficient and median expression levels of candidate reference genes assessed by RT-qPCR. Left panel shows all genes considered in this study, with right panel showing genes with GC < 0.2. Colour coding: green, GeneGini reference genes; red, both GeneGini and Caracausi reference genes; yellow, GeneGini and Eisenberg and Levanon; orange, Lee, yellow; black, Vandesompele; purple, Zhang and Kriegova. Robustness of the Gini coefficient in assessed experimentally by RT-qPCR using a small subset of proposed reference genes. Left panel shows Gini coefficient vs % RSD for all genes considered in this study, with right panel showing the same with genes with a GC < 0.2 and % RSD <10. Line of best linear fit shown is y = 0.002 + 0.004 × (r2 = 0.988). Shape coding as in Fig. 12. More importantly, the GC of our GiniGenes was particularly low (Fig. 12). The low absolute magnitude reflected the fact that Cq value is based on a logarithmic scale. Various commonly used housekeeping genes (HPRT1, GAPDH, ACTB, SDHA, HMBS and B2M) displayed higher % RSDs and GC than other genes studied here in spite of their higher relative expression levels. This was also the case when inspecting the interquartile ratio against the GC of these (Fig. S3). The above results suggest that the GC is also applicable to RT-qPCR data, with GiniGenes having good potential (as novel “housekeeping” genes) for the normalisation of such data.

Discussion

Reference genes are commonly used to normalise gene expression data, so as to account for bias resulting from both biological and technical variability, and to enable quantification of gene expression changes or differences in the system under study. It is generally considered that such reference genes should come from pathways that are required for general metabolism, using only one gene per ‘pathway’ to avoid co-regulation which might make the gene expressions look very stable. Such reference genes are commonly referred to as ‘housekeeping’ genes (HKGs) because they are considered to participate in essential cellular functions, are ubiquitously expressed in all cells and tissue types, and their expression is considered to be stable[49-66]). A number of such genes have been proposed over the years, and genes such as GAPDH, ACTB, RPL13A, SDHA, B2M are frequently used in such studies[66]. However, the expression levels of these and other proposed HKGs have in fact been shown to vary widely between cells and tissues (e.g.[53,62,65,67-79]) and their expression has also been reported to be affected by a number of factors relating to the experiment such as cell confluence[132], pathological, experimental and tissue specific conditions[133]. As highlighted by Huggett et al.[134], despite the reports of the potential variability of expression of ‘classic’ references genes such as GAPDH and ACTB, these are still used without mention of any validation processes. Our GiniGenes are selected as reference genes through different, data-driven, criteria. Various tools have been developed to evaluate and screen reference genes from experimental datasets; these include geNorm[66], NormFinder[135], Best Keeper[136] and the comparative ΔCT finder[52]. RefFinder (http://leonxie.esy.es/RefFinder/#) and RefGenes (https://refgenes.org/rg//) can integrate these to enable a comparison and ranking of any tested candidate reference genes[137]. These tools assess expression stability of genes in different ways: geNorm determines gene stability through a stepwise exclusion or ranking process followed by averaging the geometric mean of the most stable genes from a chosen set. Python implementation: https://eleven.readthedocs.io/en/latest/. BestKeeper also uses the geometric mean but using raw data rather than copy numbers. BestKeeper[136] can be used as an Excel-based tool. It can accommodate up to 10 housekeeping genes in up to 100 biological samples. Optimal HKGs are determined by pairwise correlation analysis of all pairs of candidate genes, and the geometric mean of the top-ranking ones. http://www.gene-quantification.info. NormFinder measures variation, and ranks potential reference genes between study groups. NormFinder[135] has an add-in for Microsoft Excel and is available as an R programme. It recommends analysis of 5–10 candidate genes and at least 8 samples per group. https://moma.dk/normfinder-software. The comparative ΔCT finder requires no specialist programmes since this involves comparison of comparisons of ΔCTs between pairs of genes to find a set of genes that show least variability. RefGenes allows one to find genes that are stably expressed across tissue types and experimental conditions based on microarray data, and a comparison of results from geNorm, NormFinder and Best Keeper to find a set of reference genes. However, this is not a free service unless one searches for one gene at a time. Furthermore, the site for this tool is no longer available. Moreover, all these tools require the user to make a prior selection of such HKGs (introducing bias and potential errors) and most are cumbersome to understand and calculate. We have here shown how via a simple calculation, the GC, we can find potential reference genes, and illustrated its utility in large-scale cell-line, tissue RNA-Seq data sets and RT-qPCR data. The expression of a number of classical HKGs from a number of carefully selected publications do in fact vary much more substantially between large RNA-Seq data sets, both for tissues and cell lines. Whilst not all studies will involve large data sets such as those we have analysed here; the GC should also be of use for smaller-scale studies to select a subset of genes in a panel of cell lines or tissues relevant to the study in question. Overall we find that (i) two of these genes, HNRNPK and PCBP1, seemed to be particularly robustly and stably expressed at reasonable levels in all cell lines studied, and (ii) a data-driven strategy based on the GC represents a useful and convenient method for normalisation in gene expression profiling and related studies.

Methods

The datasets used are described and referenced below. The data, in transcripts per million (TPM) units were downloaded from the EBI expression atlas as a .tsv file. As previously[1], the Gini Index was calculated using the ineq package (Achim Zeileis (2014). ineq: Measuring Inequality, Concentration, and Poverty. R package version 0.2–13. https://CRAN.R-project.org/package=ineq) in R (https://www.R-project.org/). These calculations were incorporated into KNIME via KNIME’s R integration R Snippet node. A spreadsheet giving the extracted analyses is provided as Supplementary Tables (Tables S7 and S8).

Cell lines and culture conditions

A panel of 10 cell lines were grown in appropriate growth media: K562, PNT2 and T24 in RPMI-1640 (Sigma, Cat No. R7509), Panc1 and HEK293 in DMEM (Sigma, Cat No. D1145), SH-SY5Y in 1:1 mixture of DMEM/F12 (Gibco, Cat No. 21041025), J82 and RT-112 in EMEM (Gibco, Cat No. 51200–038), 5637 in Hyclone McCoy’s (GE Healthcare, Cat No. SH30270.01) and PC3 in Ham’s F12 (Biowest, Cat No. L0135-500). All growth media were supplemented with 10% fetal bovine serum (Sigma, Cat No. f4135) and 2 mM glutamine (Sigma, Cat No. G7513) without antibiotics. Cell cultures were maintained in T225 culture flasks (Star lab, CytoOne Cat No. CC7682-4225) kept in a 5% CO2 incubator at 37 °C until 70–80% confluent.

Harvesting Cells for RNA Extraction

Cells from adherent cell lines were harvested by removing growth media and washing twice with 5 mL of pre-warmed phosphate buffered saline (PBS) (Sigma, Cat No. D8537), then incubated in 3 mL of 0.025% trypsin-EDTA solution (Sigma Cat No. T4049) for 2–5 min at 37 °C. At the end of incubation cells were resuspended in 5–7 mL of respective media when cells appeared detached to dilute trypsin treatment. The cell suspension was transferred to 15 mL centrifuge tubes and immediately centrifuged at 300 × g for 5 min. Suspended cell lines were centrifuged directly from cultures in 50 mL centrifuge tubes and washed with PBS as above. The cell pellets were resuspended in 10–15 mL media and cell count and viability was determined using a Nexcellom Cellometer Auto 1000 Cell Viability Counter (Nexcellom Bioscience) set for Trypan Blue membrane exclusion method. Cells with >95% viability were used for downstream total RNA extraction.

RNA Extraction

Total RNA was extracted from 2–5 × 106 cells using the Qiagen RNeasy Mini Kit (Cat No. 74104) and DNAse treated using Turbo DNA-free kit (Invitrogen, Cat No. AM1907) according to the manufacturer’s instructions. Briefly, 1 X DNA buffer was added to the extracted RNA prior to adding 2U (1 µL) of DNAse enzyme. The reaction mixture was incubated at 37 °C for 30 min and inactivated for 2 min at room temperature using DNAse inactivating reagent. The mixture was centrifuged at 10,000 × g for 1.5 min and the RNA from the supernatant was transferred to a clean tube. The RNA concentration was determined using a NanoDrop® ND-1000 spectrophotometer and further validated using an Agilent 2100 bio-analyser coupled with 2100 Expert software system. Only RNA samples with an RIN (RNA Integrity Number) between 9–10 were selected for cDNA synthesis.

Reverse Transcription and cDNA Synthesis

1 µg of RNA was reverse transcribed into cDNA. Briefly, a 20 µL reaction was setup by adding 1 µL each of oligodT (50 µM, Invitrogen, cat No. 18418020) and dNTP mix (10 mM, Invitrogen, Cat No. 18427-013) followed by adding an appropriate volume for 1 µg of RNA. Nuclease free water (Ambion, Cat No. AM9937) was then added to make the volume up to 13 µL and incubated at 65 °C for 5 min then cooled on ice for 1 min. To initiate transcription 4 µL of 5 X first strand buffer (Invitrogen, Cat No. 1889832) and 1 µL each of 0.1 M DTT (Invitrogen, Cat No. 1907572), RNaseOUT™ (Invitrogen, Recombinant RNase Inhibitor, Cat No. 1905432) and SuperScript™ III RT (200 units/µL, Invitrogen, Cat No. 1685475) reverse transcriptase enzyme were added, mixed gently then incubated at 50 °C for 60 min followed by inactivation at 70 °C for 15 min. The cDNA was diluted 1:100 to be used in RT-qPCR experiment.

Validation of gene expression by geNorm

A set of candidate reference genes (40; top 32 genes from genes ordered by GC and expression value from[94], plus 8 of the most commonly used from the literature including seven from[66]). RNAseq data were selected for validation of stable gene expression using geNorm[66]. First, a typical qPCR protocol was prepared from a master mix for each gene to be tested per cell line in triplicate. This consisted of 10 µL/well made by adding 0.8 µL of nuclease free water (Ambion), 5 µL of LC480 SYBR Green I Master (2 X conc. Roche, Product No. 04887352001), 0.1 µL each of forward and reverse primers (20 µM) (for primer and amplicon sequences see Supplementary Table S9) and 4 µL of 1:100 diluted cDNA in a 384 well qPCR plate (Starlab Cat. No. E1042-9909-C). The no template controls (NTC) for each gene were produced by replacing cDNA with 4 µL of nuclease free water. Thermal cycling conditions used were: one cycle of 95 °C for 10 min followed by 40 cycles of 95 °C for 10 sec and 60 °C for 30 sec. qPCR was performed using Roche LightCycler LC480 qPCR platform. The fluorescence signals were measured in real time during amplification cycle (Cq) and also during temperature transition for melt curve analysis. The mean Cq values were converted into relative values for a gene across all cell lines using ΔCq method[138]. Briefly, the lowest Cq value in a panel of cell lines for a gene was subtracted from all the values in that panel using the equation: , where is the mean Cq value obtained for a gene in each of the cell lines and is the lowest Cq value in that panel. The relative values for each gene in a panel were then obtained by applying . These relative values were applied in geNorm Visual Basic applet for Microsoft Excel®[66] that determines the most stable reference genes from a set of genes in a given panel of cell lines.

Validation of gene expression using the Gini coefficient

To the raw RT-qPCR data a Cq value (which is inversely proportional to expression level) cut-off of 32 was set, above which no expression is observed. The Cq values of genes in cell lines were subsequently converted to a relative expression level (Cq cut off/Cq value of gene). Descriptive statistics of the expression of each gene in individual cell lines were then calculated. As a final step, the median expression value of each gene in individual cell lines was used to calculate descriptive statistics, including the GC, of gene expression across these cell lines. Figure 11 illustrates a KNIME workflow[128-130] for this purpose. The raw data and descriptive statistics extracted are provided in Supplementary Tables S5 and S6 respectively, and the KMNIME analysis workflow in Supplementary File 1.

11 in total

1. Gini Coefficients as a Single Value Metric to Define Chemical Probe Selectivity.

Authors: Andrei Ursu; Jessica L Childs-Disney; Alicia J Angelbello; Matthew G Costales; Samantha M Meyer; Matthew D Disney
Journal: ACS Chem Biol Date: 2020-07-09 Impact factor: 5.100

2. The spatially resolved transcriptional profile of acute T cell-mediated rejection in a kidney allograft.

Authors: Fadi Salem; Laura Perin; Sargis Sedrakyan; Andrea Angeletti; Gian Marco Ghiggeri; Maria Cristina Coccia; Marty Ross; Miguel Fribourg; Paolo Cravedi
Journal: Kidney Int Date: 2021-09-20 Impact factor: 18.998

3. Evaluation of low-pass genome sequencing in polygenic risk score calculation for Parkinson's disease.

Authors: Sungjae Kim; Jong-Yeon Shin; Nak-Jung Kwon; Chang-Uk Kim; Changhoon Kim; Chong Sik Lee; Jeong-Sun Seo
Journal: Hum Genomics Date: 2021-08-28 Impact factor: 4.639

Review 4. Intelligent host engineering for metabolic flux optimisation in biotechnology.

Authors: Lachlan J Munro; Douglas B Kell
Journal: Biochem J Date: 2021-10-29 Impact factor: 3.857

5. Dimensionality Reduction and Louvain Agglomerative Hierarchical Clustering for Cluster-Specified Frequent Biomarker Discovery in Single-Cell Sequencing Data.

Authors: Soumita Seth; Saurav Mallik; Tapas Bhadra; Zhongming Zhao
Journal: Front Genet Date: 2022-02-07 Impact factor: 4.599

6. What are housekeeping genes?

Authors: Chintan J Joshi; Wenfan Ke; Anna Drangowska-Way; Eyleen J O'Rourke; Nathan E Lewis
Journal: PLoS Comput Biol Date: 2022-07-13 Impact factor: 4.779

7. StanDep: Capturing transcriptomic variability improves context-specific metabolic models.

Authors: Chintan J Joshi; Song-Min Schinn; Anne Richelle; Isaac Shamie; Eyleen J O'Rourke; Nathan E Lewis
Journal: PLoS Comput Biol Date: 2020-05-12 Impact factor: 4.475

8. An untargeted metabolomics strategy to measure differences in metabolite uptake and excretion by mammalian cell lines.

Authors: Marina Wright Muelas; Ivayla Roberts; Farah Mughal; Steve O'Hagan; Philip J Day; Douglas B Kell
Journal: Metabolomics Date: 2020-10-07 Impact factor: 4.290

Review 9. The Transporter-Mediated Cellular Uptake and Efflux of Pharmaceutical Drugs and Biotechnology Products: How and Why Phospholipid Bilayer Transport Is Negligible in Real Biomembranes.

Authors: Douglas B Kell
Journal: Molecules Date: 2021-09-16 Impact factor: 4.411

10. Theil Entropy as a Non-Lineal Analysis for Spectral Inequality of Physiological Oscillations.

Authors: Ramón Carrazana-Escalona; Miguel Enrique Sánchez-Hechavarría; Ariel Ávila
Journal: Entropy (Basel) Date: 2022-03-04 Impact factor: 2.524