Literature DB >> 29845286

Computational approaches for predicting key transcription factors in targeted cell reprogramming (Review).

Guillermo-Issac Guerrero-Ramirez¹, Cesar-Miguel Valdez-Cordoba¹, Jose-Francisco Islas-Cisneros¹, Victor Trevino¹.

Abstract

There is a need for specific cell types in regenerative medicine and biological research. Frequently, specific cell types may not be easily obtained or the quantity obtained is insufficient for study. Therefore, reprogramming by the direct conversion (transdifferentiation) or re‑induction of induced pluripotent stem cells has been used to obtain cells expressing similar profiles to those of the desired types. Therefore, a specific cocktail of transcription factors (TFs) is required for induction. Nevertheless, identifying the correct combination of TFs is difficult. Although certain computational approaches have been proposed for this task, their methods are complex, and corresponding implementations are difficult to use and generalize for specific source or target cell types. In the present review four computational approaches that have been proposed to obtain likely TFs were compared and discussed. A simplified view of the computational complexity of these methods is provided that consists of three basic ideas: i) The definition of target and non‑target cell types; ii) the estimation of candidate TFs; and iii) filtering candidates. This simplified view was validated by analyzing a well‑documented cardiomyocyte differentiation. Subsequently, these reviewed methods were compared when applied to an unknown differentiation of corneal endothelial cells. The generated results may provide important insights for laboratory assays. Data and computer scripts that may assist with direct conversions in other cell types are also provided.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Substances：
Transcription Factors

Year: 2018 PMID： 29845286 PMCID： PMC6072137 DOI： 10.3892/mmr.2018.9092

Source DB: PubMed Journal: Mol Med Rep ISSN： 1791-2997 Impact factor: 2.952

Introduction

In tissue engineering and regenerative medicine, there is a need for large quantities of specific cell types (1,2). For example, in corneal disease the use of transplants is essential, although access to corneal tissues is difficult given the shortage of tissue donors. Therefore, an alternative for generating specific corneal cells is needed (3). Furthermore, specific cell types are required in research for characterization, including studies on responses to treatment or genetic regulatory networks (4–7). For these needs, stem cell technologies hold the promise of providing a sufficient number of cells of specialized linages (2). Such promise is based certain factors, including the fact that cell differentiation may be reversed, that somatic cells may be induced to be pluripotent, or that cells may be forced to alter their identity or to transdifferentiate (8). In this context, cell identity or cell state is thought to be a highly regulated process that depends on their epigenetic and transcriptional programming (9). The cell state is defined as the transcriptional output of a gene regulatory network (10). Thus, the cell state is principally controlled by the expression of transcription factors (TFs) forming specific network modules to ensure stable gene expression (7). However, genome analyses have identified approximately 2,000 TFs, and it is known that approximately one-half are expressed in a given cell (11). Thus, there is a requirement to elucidate which and how many TFs define specific cell states. The majority of the current literature in stem cells suggests that only a few TFs are required to maintain cell identity (7,12–14). For example, only four TFs (MYC proto-oncogene bHLH transcription factor, Kruppel like factor 4, SRY-box 2 and POU class 5 homeobox 1) are required to maintain the pluripotency state (8,15). These factors were identified from serial rounds of gene inclusion and withdrawal from a pool of 24 potential genes selected from studies performed on isolated genes. From this seminal work, other research groups identified several TFs for direct conversion (16–18). For example, glutamic-oxaloacetic acid transaminase 1 was used to convert fibroblasts into functional neurons (16) while GATA-binding protein 4 (GATA4), monocyte enhancer factor 2C (MEF2C) and T-box 5 (TBX5) were used to convert fibroblasts into cardiomyocytes (17). Moreover, alternative combinations of TFs may lead to very similar cell types (18), suggesting that redundancy exists in which the genetic regulatory networks characteristic of the cell identity may be established by similar or equivalent combinations of TFs. Thus, if a cell state can be defined by a combination of TFs, in theory, any source cell type may be converted into any target cell type by establishing the expression of those TFs. Thus, if the differences in expression between the source and target cells are very small, one may consider subtle methods based on stimulating or blocking connected pathways. If the differences are large, as is commonly the case in converting fibroblasts to a lineage-distant cell type, one may opt to force expression by transdifferentiation or direct conversion (19–21) or via the generation of induced pluripotent stem cells (iPSCs) following the induction of the target cell type (13). For other specific cell types, it is necessary to identify how candidate TFs may be obtained to begin with or how alternative TFs may be obtained. In the present study, the focus will be on providing simplified views of the computational approaches that have been proposed to identify a set or sets of putative TFs likely to control the cell state of the desired cell type. This proposed view may be highly illustrative for non-bioinformatics specialists for a number of reasons. Firstly, previously proposed computational methods are complex. Secondly, the literature accompanying the computational methods is highly technical. Thirdly, the descriptions of certain methods may appear vague for non-specialists. Fourthly, certain data (specifically, the networks) or computer scripts and tools described in the algorithms are currently unavailable, complicating re-implementations. Finally, the majority of approaches were proposed using ad-hoc parameters and specific datasets. In addition, for bioinformatics specialists, a succinct starting point for novel implementations was provided by the present review. To overcome the aforementioned difficulties, a simplified and unified view of current methods was provided, which may be summarized thus: i) The establishment of the population of cell types; ii) the estimation of candidate TFs from cell populations; and iii) the filtering of TF pre-candidates (the most challenging element). Derived from these summarized concepts, clues as to how the methods work are provided, in addition to knowledge as to how to overcome or approach difficulties. Possible ways in which these computational methods may be re-implemented and adapted to provide a preliminary list of TFs are additionally provided.

Identifying key cell-state transcription factors

The idea that cell states are associated with the binary decision of cell fates has long been proposed (22). However, computational approaches to identify key TFs governing cell states are more recent. In practice, an aim may be to directly convert a specific source cell type into a target cell type; therefore, the most important component is the estimation of the target cell state, since the state of the source cell type may be forced to change. The source cell type is important to be able to estimate those TFs that may be redundant and perhaps do not required manipulation; this may be easily performed by comparing expression levels. Therefore, the majority of methods primarily focus on the estimation of TFs controlling the target cell state. The following sections consider the approaches of recent studies (23–26), which are accordingly referred to as Cahan et al (23), D'Alessio et al (24), Rackham et al (25) and Okawa et al (26).

Identification of TFs via differential expression

Under the assumption that the cell identity is controlled by the gene expression level of a specific set of TFs, it follows that the identity of cell types be controlled by either different levels of the same set of TFs or a different set of TFs (7). In any case, the same operation is needed: The identification of the characteristic and distinct gene expression levels. This is best known as differential expression. Since this operation involves the comparison between at least two populations assumed to be distinct, the target cell type population and the ‘background’ population require careful selection. In theory, if these populations are well defined and the available data are highly representative and precise, it ought to be possible to create a small list of TFs. However, even today, the available data are scarce, highly noisy and contaminated with different populations of cells; the data from in vitro assays may not reflect genuine in vivo properties; and the computational and statistical tools may be imperfect. Therefore, the output of the differential expression between the defined cell populations usually generates large lists of pre-candidate TFs.

Filtering problem

Assuming that the number of TFs controlling the cell identity is small, this large list of pre-candidate TFs ought to be highly contaminated with false-positive calls representing cell-state-irrelevant TFs that require filtering out. Although certain irrelevant TFs may be easily identified by expert researchers and available biological knowledge in the literature, this process is time-consuming and may be prone to misinterpretations, errors and omissions. In addition, certain TFs may not be well studied or studied at all. Furthermore, manual filtering of the list causes difficulties in the scoring or ranking of TFs according to the scientific literature. Therefore, the systematic filtering and ranking of pre-candidate TFs is a challenging issue. This filtering process is obscured in original research articles due to the complexity of their implementations. The majority of the considered methods perform this filtering procedure analyzing the TFs within the context of biological networks. Although this may be considered to be a drawback by non-bioinformatics specialists, this step need not be very complicated to help to reduce large lists. In particular, within the examples provided, even when no filtering is used, sensible results may be obtained if target and non-target cell populations are well defined. In summary, the proposed view of the process to identify TFs likely controlling a cell state is demonstrated in Fig. 1 and is discussed in the following sections. In practice, it may be advisable to start with a specific source cell type for induction to a target cell type, whereas the majority of methods focus on the target cell type to identify TFs associated with the cell state (23–25). Thus, once the cell types have been identified, as depicted in Fig. 1, the TF expression profile of the source cell type is compared with the target to identify those TFs required to induce from that particular source.

Figure 1.

Simplified view of TF identification for cell conversion. (A) Process of defining at least two cell populations. (B) Differential expression analysis of TFs between defined populations to identify pre-candidate TFs. (C) Filtering process of pre-candidates in order to generate a short list of TFs whose overexpression will likely control the desired cell state. TF, transcription factor.

Defining the populations of cell types

The first step consists of defining at least two populations of cell types (Fig. 1A), which are referred to as target and non-target cell types. A comparison of conceptual definitions by the authors is demonstrated in Fig. 2 and discussed in the following paragraphs.

Figure 2.

Comparison of the definition of cell populations.

Datasets used

Gene expression data are required to be uniformly annotated for target and non-target cell types. Therefore, the majority of methods utilize information from the vast collections of microarray gene expression data available from the Gene Expression Omnibus (GEO) (27,28) and ArrayExpress (29,30), or from more recent next-generation sequence repositories in ENCODE (31) or FANTOM (6). The repositories used are detailed in Table I. The majority of the studies discussed in the present review used GEO microarray data, except Rackham et al (25), who used FANTOM5. They studied human data, although Cahan et al (23) additionally included murine data. The majority of the studies included numerous cell types; however, Okawa et al (26) used progenitor and daughter cell types from specific third-party authors.

Table I.

Definition of populations of cell types by all methods.

Author, year	Data	Target	Non-targets	(Refs.)
Cahan et al, 2014	GEO, queried datasets, 16–20 cell types	Several samples of the same cell or tissue type	Remaining cell types	(23)
D'Alessio et al, 2015	GEO, 504 datasets, 233 cell types	Several samples of the same cell or tissue type	Remaining cell types (balanced)	(24)
Rackham et al, 2016	FANTOM5, >700 datasets (CAGE-Seq)	Samples of the same cell type	Remaining cell types but avoiding close and distant related ones	(25)
Okawa et al, 2016	GEO, Specific data	A daughter cell type	The progenitor and sister cell types	(26)

GEO, gene expression omnibus.

Target cell type

For the target population, which is generally the easiest to delimitate, a number of considerations are noteworthy. First, the target cell type is required to be well represented. From the authors reviewed herein, various experiments were performed in vitro, while others have been obtained from tissue samples. The experiments performed in vitro have the advantage of a well-defined cell type, whilst the tissue samples may represent a mixture of distinct cell types generating an average cell state that may not properly represent the desired target. Second, the gene expression data may reflect the cell state of an individual donor instead of a population-generalizable cell state. Thus, it is desirable to include as many individuals as possible. Third, repetition is desirable as gene expression data are noisy, which is worsened by the technology used to acquire the data (particularly microarrays). In summary, the targets used for each method are mentioned in Table I. Rackham et al (25) used a hierarchical ontology definition of cell types from FANTOM5 to define a particular target cell type; they ignored closely associated cell types (Fig. 2). In this way, they favored the purity of the cell state. However, they lost generality as closed cell types may help to eliminate non-specific TFs, leading to larger lists of pre-candidate TFs if their implementation is not followed thoroughly. On the contrary, Okawa et al (26) used a specific cell type contrasted with the closest associated cell types (daughter cell types; Fig. 2). This has a number of advantages since the comparison of close, although distinct, cell types may lead to the clear identification of controlling TFs. However, this method is unable to be generalized as the type of experimental setting (well-defined progenitor and daughter cell types) required to run this approach is not as common in the data repositories and must be performed in advance to generate the data. D'Alessio et al (24) and Cahan et al (23) first defined a number of classes of tissues or cell types and compared each class against the remaining classes (Fig. 2). In each class, they used numerous samples, avoiding individual and noise effects.

Non-target cell types

Following removal of the target data, the non-target data are commonly obtained from the remaining tissue or cell types of the defined datasets (Table I). Nevertheless, Rackham et al (25) removed distantly related samples, probably due to a highly-curated cell lineage ontology. This has the advantage of removing false differentially expressed TFs that may control specialized functions in distant and target cell types, presumably via an upregulated TF. Nevertheless, this concept is only useful if the TF differential scoring depends on downregulated TFs, as in Rackham et al (25). Therefore, the removal of distant cell types may be redundant if only upregulated TFs are considered and there are no large combinational effects in TFs. In addition, the threshold required to determine distance is hard to define, complicating further tests in diverse scenarios. In Okawa et al (26), the target was one of the daughter cell types, and therefore the non-target was formed by the progenitor and the sister cell type. An issue with using large collections of samples in the non-target is that it may be highly disproportional to the number of samples. To avoid this overrepresentation, in D'Alessio et al (24), the non-target dataset was balanced by selecting a representative sample from the collection of samples of each cell type.

Identification of pre-candidate TFs

The four computational methods proposed used different approaches, which are conceptually summarized in Fig. 3. Theoretically, however, the identification of putative TFs may be obtained by identifying TFs whose expression is statistically different. Therefore, parametric, non-parametric or permutation tests may provide similar results (32–34). Statistical tests provide a P-value that is useful, although it does not represent the magnitude of the difference between two average expression levels and is sensitive to the variance and number of samples (35). Alternatively, these issues may be solved by using combinations of the P-value and fold-change, for example in Rackham et al (25), where the score per TF is based on the absolute magnitude of the fold-change multiplied by the (negative) logarithm of the P-value. Nevertheless, certain of the methods reviewed demonstrate a preference for other strategies (Table II). For example, D'Alessio et al (24) used Jensen-Shannon Divergence (JSD), which is a measure of the discrepancy between distributions. JSD was used to score differences between the observed TF expression profiles and idealized ones. These idealized profiles are formed by combining high expression in the corresponding target cell type and no expression in the remaining cell types.

Figure 3.

Comparison of conceptual definitions to identify TF differences. TF, transcription factor; mag, magnitude.

Table II.

Identification of differential expressed TF.

Author, year	Method	Comparison	(Refs.)
Cahan et al, 2014	Tissue-Specific Context Likelihood of Relatedness	Pairs of co-expressed TF	(23)
D'Alessio et al, 2015	Jensen-Shannon Divergence	Per TF	(24)
Rackham et al, 2016	Combines P-values and fold-change	Per TF	(25)
Okawa et al, 2016	Normalized Ratio Difference	Pairs of swap-expressed TF	(26)

TF, transcription factor.

Instead of comparing one TF across cell types, Okawa et al (26) and Cahan et al (23) compared pairs of TFs (Table II and Fig. 3). The comparison of pairs is based on the concept that balanced expression between two TFs is associated with cell identity (36–38). Okawa et al (26) proposed the normalized ratio difference (NRD) to score all pairs of TFs that are similarly expressed in a progenitor cell type, and highly different in and between daughter cell types. Cahan et al (23) additionally compared pairs of TFs, using the metric of the context likelihood of relatedness (CLR). The CLR is a measure that favors TFs that are highly correlated (by mutual information) and whose correlations are within the top ranked to increase the probability of genuine associations (39). Notably, while Okawa et al (26) favored pairs of TFs whose expression was different in daughter cells, Cahan et al (23) favored TFs that were co-expressed and whose expression levels were cell-type specific (as explained in more detail in the following section). These opposing views are associated with the input data: Okawa et al (26) used cell types that were extremely close in the lineage, whereas Cahan et al (23) use tissue types that are more distant (Fig. 2). By definition, these methods will generate much larger lists of pre-candidates compared with those comparing one TF at the time. For example, assuming that there are ~2,000 TFs, there would be 1,999.000 pairwise comparisons vs. 2,000 when only one TF is assessed at the time. Thus, these methods require extensive filtering.

Filtering the pre-candidate TF list

The objective of this step is to further filter the pre-candidate list to end up with a short list of candidate TFs whose overexpression will likely control the desired cell state. This step is frequently the most complex and time-consuming; it depends on the length of the pre-candidate TF list and the rules defined in the filters. In assays of one TF, including in D'Alessio et al (24) and Rackham et al (25), if 5% of the 1,000 expressed TFs are differential, a list of ~50 TFs is expected. This estimate is not far from reality, supposing that few TFs control the cell state by means of regulating further TFs, which thus regulate the downstream effector genes. Furthermore, for methods that compare pairs of TF, including in Okawa et al (26) and Cahan et al (23), and even optimistically estimating that only 0.1% of pairs are of interest, ~1,000 pairs of TFs would have to be analyzed from the ~1,000.000 TF pairs generated. Unless a shorter pre-candidate list is obtained, analyzing the TF list manually by reading scientific literature or browsing databases by hand may be arduous, prone to errors and time-consuming. Therefore, the filtering procedures proposed are focused on setting sensible rules that are approachable with current databases. Thus far, the focus has been on differential TFs; however, other non-TF genes require consideration. They are involved in signal propagation or provide cell type-specific functions and should also be considered. Therefore, to completely explain the observations, the rules must be based on maximizing the control over all observed differentially expressed genes (DEGs), irrespective of the gene function (TF or not). Thus, the rules may be easily stated as ‘show all TFs directly or indirectly controlling all DEGs.’ If all regulatory associations between TFs and other types of genes are known, this statement may be more easily implemented compared with the current methods. Nevertheless, the current databases are far from being complete, are context-specific (by culture or tissue) and are likely to include errors. Therefore, in the following paragraphs, how these rules were implemented in each method is explained and an overview is illustrated in Fig. 4. The majority of the methods make use of networks, databases and other tools to integrate information and connect the TFs with themselves and with other DEGs.

Figure 4.

Comparison of the generation of candidate TFs. TF, transcription factor; CLR, context likelihood of relatedness; JSD, Jensen-Shannon divergence; NRD, normalized ratio difference; GSEA, gene set enrichment analysis.

In Cahan et al (23), a genetic regulatory network was built upon the significantly correlated pairs of genes using the CLR. As these networks are frequently large, the InfoMap tool was used to split this large network into smaller, highly connected sub-networks (40). Furthermore, each sub-network was evaluated using gene set enrichment analysis (GSEA) (41). GSEA generates a score depending on the position of the genes in the sub-network relative to all genes. If the genes are randomly distributed, the GSEA score is low, whilst if the expression levels are more concentrated in closer positions, the GSEA score increases. If the GSEA score of a sub-network obtained from tissue A is higher compared with other tissues, this sub-network is defined as specific for tissue A. Subsequent to executing this procedure in all sub-networks present in all tissues, Cahan et al (23) ended up with ~76 tissue-specific sub-networks. Thus, on average, approximately five sub-networks were expected in each of the 15–20 tissues or cell types. From this, the study aimed to identify which sub-networks and which genes within the sub-networks were more likely to be manipulated, starting from a source cell type. To evaluate the former, the expression of each gene within the target cell type sub-network was compared against that of the source cell type; if the expression levels were similar, no larger alterations were required, whilst if the expression levels were very different, the sub-network had to be re-established and was therefore a target for manipulation. To assess the genes within the sub-networks, a network influence score (NIS) was estimated. This NIS depends on the difference in TF expression between the source and the target, the differences in the expression of the predicted genes regulated by that TF, and the number of regulated genes. In brief, a large network was split into sub-networks, filtered for tissue specificity, further filtered to detect those expressed at different levels and, finally, TFs were ranked within the resultant sub-networks. Cahan et al (23) demonstrated acceptable predictions in a number of conversion systems and suggested that direct conversions are less similar to the in vivo tissues compared with those conversions obtained from iPSCs. Elsewhere, D'Alessio et al (24) used the JSD metric against an idealized profile to evaluate each TF between the target and non-target cell types in around 233 cell types. This procedure yielded 503 TFs across these cell types. A total of ~60% of the TFs were considered to be pre-candidates in fewer than four cell types, demonstrating that most were cell-type specific. From the experiments, the study of D'Alessio et al (24). focused on the top 10 TFs for induction. This approach was validated by comparing their predictions to well-known conversion systems, including iPSCs, neural precursor cells, cardiomyocytes, hepatocytes, motor neurons, pancreatic islets cells and melanocytes. Furthermore, D'Alessio et al (24) predicted and experimentally validated their approach in the conversion of fibroblasts to retinal pigment epithelial-like cells. In Rackham et al (25), the pre-candidate list of TFs was generated using a combination of a tissue-specific P-value and the magnitude of the difference. For the filtering, two additional network influence scores were used for each TF, which were estimated from MARA (42) and STRING (43) networks. These network scores depend on how many genes are connected to each TF, how far the connection is (number of nodes), and the score of the regulated gene (P-value and magnitude). Subsequently, the ranks of these three scores were added and ranked to provide a final rank. The first filter consisted of using only the TF within the top 100 final ranks. The second filter removed the TFs that were expressed in the source and target cell types. The third filter removed those redundant TFs that shared the majority of their targets with other TFs regulating more genes. A fourth filter was applied to include the top eight TFs. The approach was validated in at least five systems, involving conversions from fibroblasts to iPSCs, myoblasts, hepatocytes and cardiac cells, and from B cells to macrophages. Finally, two novel conversions were predicted and tested experimentally, converting fibroblasts to keratinocytes and keratinocytes to microvascular endothelial cells. Okawa et al (26) used the NRD metric to evaluate and select pairs of TFs. Subsequently, the MetaCore network database (44) was used to first filter TFs with over seven connections. This was based on the observation that important TFs are highly connected in MetaCore. The next filter removed unnecessary nodes of the network, based on the assumption that a cell type may be stabilized by a gene regulatory network that was additionally stable in the two daughter cell types. For this, the study re-implemented a sub-network-finding optimization algorithm combined with Boolean networks (45). A Boolean network is a methodology that is able to identify attractor states (46). These attractors were interpreted in biological cells as stable states that may be compared with the states of daughter cell types filtering those matching sub-networks. Subsequent to running the algorithm numerous times, the following filter looked for all sub-network solutions that contained at least one upregulated TF. Subsequently, the sub-networks were ranked based on the number of NRD pairs present, NRD pairs directly connected and lesser regulatory connections. This approach was validated in five stem cell systems, including mouse embryonic stem cells, mouse and human hematopoietic stem cells, mouse neural stem cells and mouse mesenchymal stem cells. Furthermore, the induction of neuronal and astrocyte differentiation was predicted and experimentally confirmed in a mouse neuronal stem cell system.

Finding key TFs in practice

In this section, the focus is on how to estimate the key TFs for the target cell type of interest in an easy and practical way, while commenting on each approach. Ideally, the prediction would be made to manipulate a source cell type to achieve a target cell type. However, the majority of methods are restricted to specific sources, targets, or both. A summary is provided in Table III, and details are provided in the following paragraphs. An estimation of gene expression values was assumed and their annotation for the target cell type was available either from microarrays or from RNA-Seq. An overview of the available tools is provided followed by a practical example.

Table III.

Resources available for finding key TF.

Author, year	Resources and limitations	(Refs.)
Cahan et al, 2014	CellNet: Web interface and R package. Any source cell type as input but only from certain Affymetrix arrays, and Illumina arrays (in R). Only specific target cell types are available	(23)
D'Alessio et al, 2015	File for 233 cell type predictions. Manual estimations are possible for a target. Source is not used.	(24)
Rackham et al, 2016	Mogrify: Web interface. Specific for several already cataloged source and target cell types.	(25)
Okawa et al, 2016	None available.	(26)

TF, transcription factor.

Overview of available tools

Cahan et al (23) provided a web interface (cellnet.hms.harvard.edu) and an R package (pcahan1.github.io/cellnetr) termed CellNet, which may be used to feed gene expression data of the source cell type or the already manipulated cell types. The output was composed of three main sections. The first output was a classification of input samples into cell types used in CellNet. The second output demonstrated how well each cell type-specific genetic regulatory network was established across the input samples. This helped to identify the networks that were required to be manipulated to achieve a cell type. The third output demonstrated the TFs having larger differences within networks, indicating which TFs required manipulation. For the web version, only Affymetrix (Thermo Fisher Scientific, Inc., Waltham, MA, USA) microarrays were able to be used. For the R package version, Illumina (Illumina, Inc., San Diego, CA, USA) microarray data were additionally able to be used. For D'Alessio et al (24), if the target cell type was already in the list of the 233 cell types processed [available in the supplementary information of the study (Table SI)], the top-ranked TFs demonstrated were able to be used (~10). If the cell type was not demonstrated, and to avoid reconstructing the entire study, the JSD or JSD-like value for the target cell type was estimated. Spreadsheet software using the predictions available for the 233 cell types was used. First, the TF expression of the target cell type was required to provide a rank of expression. Second, for each TF, the number of times this TF was counted in the top 10 other cell types was obtained. Third, for each TF, the minimum rank of the TF in all other cell types was obtained. Fourth, scatter plots of the target rank of TFs against those in the second and third steps was displayed. Fifth, TFs that were top ranked in the target and had low counts in the first scatter plot and/or were top ranked in the target and had higher ranks in the second scatter plot were estimated. These steps attempted to provide an easy approximation of the process followed by D'Alessio et al (24) instead of an accurate calculation, although these steps may be used as an easy starting point. From Rackham et al (25), a web interface is available (www.mogrify.net) in which the source and target cell types were specified from those already considered. The top eight ranked TFs were elucidated in a few seconds. Unfortunately, to estimate the possible TFs for a non-listed cell type, it is necessary to reconstruct the study of Rackham et al (25) since no datasets are provided. Notably, in Okawa et al (26), neither implementation nor supplementary information was available. Thus, it is necessary to reconstruct the study to make predictions using this method. The MetaCore regulatory network used is not currently available, thus a different network database or another method for estimation may be used. Therefore, putative results may be different.

Validation of target cell state TFs via different approaches in a well-known system

To demonstrate that this simplified view was able to generate sensible TFs, the first two concepts were applied in a well-known system, transdifferentiation towards cardiomyocytes (CM).

Target and non-target datasets

The target CM data were obtained from the GEO/NCBI, with accession no. GSE45878, for the 62 samples annotated as ‘Heart.’ The dataset consists of 837 samples from diverse tissues. The non-target dataset was obtained from the remaining 775 samples and the number of probes was 22,704.

Data pre-processing

The two datasets were quantile normalized and scaled to a uniform distribution between 0 and 1, representing no expression and maximum expression. To recognize TFs, ‘transcription’ and ‘factor’ in the annotated description were used. Additionally, AnimalTFDB was used for the TF annotation (47). Thus, 1,392 TFs were considered. The target and the non-target datasets, in addition to tissue annotation, are available at bioinformatica.mty.itesm.mx/CEC-TF-Example.

Score implementations

A total of five scores were used, two taken from basic concepts of differential expression, and three inspired by those scores used by the methods reviewed here. ‘Delta’ is the difference in mean expression values between the target and non-target cell types. ‘t-test’ is the P-value of the unequal variance t-test applied to target and non-target cell types. ‘Rackham’ is -Log10(p-t-test)x|Delta|, as in Rackham et al (25). ‘D'Alessio’ is the sum of 100 JSD scores between the observed and the ideal profile. The observed profile was estimated using the average target expression together with k=3 random samples from non-targets (increasing values of k did not increase similarity to other scores). This process was similar, although not identical, to that implemented in D'Alessio et al (24) (details of the algorithm in the supplementary information were not clear). ‘Okawa’ was an adaptation of the Okawa et al (26) metric to cell types different from progenitor-daughter. It was estimated (TFTi-TFTk)-(TFNi-TFNk), where T and N sub-indexes refer to the mean expression values of the target and non-target cell types, respectively, i is a particular TF, and k represents all TFs. This metric generated very similar results to the NRD (which involves ratios that are more unstable, although the script provided includes the NRD estimation). To generate a single score per TF, the number of times a TF was included in differences between the top 1% of pairs was counted. The score of Cahan et al (23) was not implemented since the tissue specificity is reached following large operations in networks (the scripts and data are available at bioinformatica.mty.itesm.mx/CEC-TF-Example).

Summary of the results

Table IV illustrates the results of the top 20 genes generated by the five scoring methods. The table summarizes the most frequently mentioned TFs and previously reported experimental findings. All of the top seven TFs listed have already been used experimentally for the conversion of different cell types to cardiomyocytes, including T-box 20 (48,49), GATA4, TBX5 (50–52), NK2 homeobox 5, and heart and neural crest derivatives expressed 2 (50,52). However, a widely used TF in this conversion, MEF2C (17,50,51,53), was not present in the list. Following revision, this gene was not marked as a TF in the present databases. Even if MEF2C was added as a TF, it was not included in the top 20 of any scoring method. This TF appears to be important as its overexpression removal did not generate cells expressing important cardiac markers (17). A recent meta-analysis specific for CM differentiation did not identify MEF2C, although it did identify a family gene, MEF2A (54). Although this result may give some clues regarding MEF2C, it is difficult to conclude the extent of its importance from this data alone. On the other hand, this example demonstrates that the majority of TFs may be obtained via straightforward application of simple concepts, as depicted in detail for the top 20 TFs identified in Table IV (55–61), but also highlights that is possible that not all factors required are obtained with the current methods.

Table IV.

Top 20 genes per method for cardiomyocyte differentiation.

	Method

Author, year	Delta	t-test	Rackham	D'Alessio	Okawa	Mentions, n	TF comments (Refs.)
Kamaraj et al, 2016	TBX20	TBX20	TBX20	ZNF705A	GATA4	HAND1, 5	Computational prediction (55)
Ieda et al, 2010; Ieda et al, 2009;	GATA4	GATA4	GATA4	ZNF283	TBX20	HAND2, 5	First described in (17,53), confirmed in mouse models and increased
Addis et al, 2013; Chen et al, 2015							efficiency of CM expression markers (50,52)
Ieda et al, 2010; Ieda et al, 2009;	HAND1	HAND1	TBX5	ZSCAN4	HAND1	GATA4, 4	Key TF first described in (17,53), confirmed experimentally (50–52)
Addis et al, 2013; Chen et al, 2015;							and computationally (56)
Ebrahimi et al, 2016
Ieda et al, 2010; Ieda et al, 2009;	TBX5	TBX5	GATA6	LIN28B	TBX5	TBX5, 4	Key TF first described in (17,53), confirmed experimentally (50–52)
Addis et al, 2013; Chen et al, 2015;							and computationally (56)
Ebrahimi et al, 2016
Addis et al, 2013; Chen et al, 2015	HAND2	HAND2	HAND1	HAND2	HAND2	NKX2.5, 4	Increased efficiency of CM expression markers (50,52)
Xiang et al, 2016; Chakraborry et al, 2012	ESRRG	ESRRG	CSDC2	HAND1	ESRRG	TBX20, 4	Implicated in CM proliferation and cardiac function in mice (48,49)
Fu et al, 2013	NKX2.5	NKX2.5	NKX2.5	TFDP3	CSDC2	ESRRG, 4	Improved CM phenotype (51)
Kamaraj et al, 2016	CSDC2	CSDC2	HAND2	POU1F1	NKX2.5	HEY2, 4	[a]Computational prediction (55)
Rastegar-Pouyani et al, 2017	PROX1	PROX1	ESRRG	E2F8	PROX1	TCF21, 4	[a]Computational prediction in humans (54)
Kamaraj et al, 2016	TCF21	TCF21	PROX1	HNF4G	TCF21	GATA6, 4	[a]Computational prediction (55)
	HEY2	HEY2	HEY2	ZNF20	HEY2	CSDC2, 4	[b]Highly expressed in the heart
Risebro et al, 2009	GATA6	GATA6	NPAS2	NR1H4	GATA6	PROX1, 4	Muscle structure maintenance (57)
Kamaraj et al, 2016	NR0B2	NR0B2	TEAD2	RFX6	NR0B2	EBF2, 4	[a]Computational prediction (55)
Liu et al, 2017	EBF2	EBF2	PPARA	CDX4	EBF2	MEIS2, 4	May be important in CM (58)
Rastegar-Pouyani et al, 2017	IRX3	IRX3	MEIS2	ESX1	ID4	TEAD2, 4	[a]Similar computational prediction (54)
	ETV1	ETV1	EBF2	ZFP42	IRX3	EBF3, 3	[a]
Shekhar et al, 2016	MEIS2	MEIS2	TCF21	X.2878	ETV1	ETV1, 3	Involved in rapid impulse conduction (59)
	TEAD2	TEAD2	TEAD1	SRY	MEIS2	IRF6, 3	[a]
Koizumi et al, 2016	IRF6	IRF6	IRX4	FOXR2	TEAD2	IRX3, 3	Involved in cardiac rhythm (60)
Nam et al, 2014	EBF3	EBF3	EBF3	RFX8	IRF6	NR0B2, 3	Involved in cardiac hypertrophy (61)

TF not tested for differentiation.

GeneCards Human Gene Database, www.genecards.org/cgi-bin/carddisp.pl?gene=CSDC2. Top 20 genes by each criterion including those most frequently appearing (Mentions column). TF, transcription factor; CM, cardiomyocyte; HAND1, heart and neural crest derivatives expressed 1; HAND2, heart and neural crest derivatives expressed 2; GATA4, GATA binding protein 4; TBX5, T-box 5; NKX2.5, NK2 homeobox 5; TBX20, T-box 20; ESRRG, estrogen related receptor γ; HEY2, hes related family bHLH transcription factor with YRPW motif 2; TCF21, transcription factor 21; GATA6, GATA binding protein 6; CSDC2, cold shock domain containing C2; PROX1, prospero homeobox 1; EBF2, early B cell factor 2; MEIS2, meis homeobox 2; TEAD2, TEA domain transcription factor 2; EBF3, early B cell factor 3; ETV1, ETS variant 1; IRF6, interferon regulatory factor 6; IRX3, iroquois homeobox 3; NR0B2, nuclear receptor subfamily 0 group B member 2.

Estimation of target cell state TFs via different approaches in a novel system

To provide a practical and simple way to reproduce an example and a comparison of different approaches as a starting point, corneal endothelial cells (CEC) were used as a target cell type. The tools and data available (Mogrify, CellNet and the D'Alessio et al supplementary information) did not include CEC and therefore were not used. As the datasets represented in these tools are limited, this example represents a likely scenario for specific cell types. Re-implemented scores inspired by the revised methods and the pre-candidate lists are compared. This demonstrates that the first two steps are highly useful and relatively easy to implement. Subsequently, data are processed in R (cran.r-project.org). The scripts and the data required to reproduce the results are available in bioinformatica.mty.itesm.mx/CEC-TF-Example.

Target and non-target datasets

The target CEC data were obtained from GEO/NCBI with accession no. GSE58315 (62). The dataset consisted of 11 corneal endothelial cell samples from adults, adolescents and preschoolers. The non-target dataset was obtained from a preliminary study on gene co-expression networks (63). This dataset consisted of 445 samples representing >136 tissues from the two most popular Affymetrix platforms (HG-U133) extracted from the GEO/NCBI. The number of probes was >50,000; however, due to the different versions of Affymetrix microarrays, certain samples provided data for only 22,000 probes.

Data pre-processing

The two datasets were quantile normalized and scaled to a uniform distribution between 0 and 1, representing no expression and maximum expression. For the non-target dataset, the JetSet package was used to identify a representative probe for each gene (64). To recognize TFs, the Affymetrix annotation of the platform GPL570 was used to look for ‘transcription’ and ‘factor’ in the annotated description. Additionally, the TFs annotated in AnimalTFDB were used (47). Thus, 1,478 TFs were considered. The target and the non-target datasets along with tissue annotation are available in bioinformatica.mty.itesm.mx/CEC-TF-Example. For the target dataset and duplicated probes per gene, the probe whose standard deviation was highest was selected. Only the 16,098 genes matching in the two datasets (by gene symbol) were used, of which 1,408 were annotated as TFs.

Score implementations

A total of five scores were used, as demonstrated in the aforementioned cardiomyocyte analysis.

Comparison of resultant scores

Whether the re-implemented scores were similar to each other was investigated. Fig. 5A illustrates the results for the 1,408 annotated TFs. It is clear that Delta, a measure of differences in the averages between target and non-target expression, correlated with all other scores. Rackham, as expected, was associated with t-test (Delta and t-test are part of the calculation). D'Alessio was negatively correlated with Delta, although highly variable (lower D'Alessio scores tended to be similar to high Delta scores). The Okawa score seemed to be a proxy of Delta irrespective of the sign. Overall, these results suggested that the scores are associated with differential expression, supporting the summarized view.

Figure 5.

Results for the CEC example. (A) Comparison of the five scores. The t-test P-value is indicated as -Log10. (B) Table of the top 20 genes by each criterion including those most frequently arising (Mentions column). Genes were assigned specific colors. Genes in italics were repeated, although not in the top 20. Black genes were specific to each score. (C) Comparison of gene expression of genes in column Mentions in panel (B) across CEC and non-target cell types. CEC, corneal endothelial cells.

Comparison of the generated TF list of pre-candidates

To demonstrate an overview of the top selected genes per score, the TF identity of the top 20 TFs was investigated (Fig. 5B). It is clear that, apart from D'Alessio, the majority of the genes were frequently in the top TFs, irrespective of the score. In Delta and t-test, there was no selection for overexpressed genes and therefore some underexpressed TFs appeared, including meis homeobox 2 (MEIS2) and zinc finger protein 208. Similarly, in Okawa, the metric implemented did not favor overexpression in the target and certain genes appeared to be underexpressed, including interferon regulatory factor 8 and MEIS2 (the script available was commented so as to be able to alter this easily). The lack of similarity of the D'Alessio TFs (2 out of 20) reflected the inappropriate implementation or deficiencies in providing details for reproduction.

Specificity of TF expression

Fig. 5C demonstrates the expression of the 20 most frequent TFs, as listed in the column Mentions in Fig. 5B. It is clear that the expression of all TFs was high in CEC. Subsets of these genes, however, exhibited high expression in other cell types. This result suggested a highly specific profile for CEC. Lim homeobox transcription factor 1β, for instance, is essential for the correct development of the cornea and other eye structures in mice (65), POU class 6 homeobox 2 is required for retinal regeneration in zebrafish (66), transcription factor AP-2β has been demonstrated to control differentiated CEC markers (67), TSC22 domain family member 1 is downregulated in dry eye syndrome (68), and GLIS family zinc finger 3 has been associated with glaucoma (69). This small literature analysis suggests that the observed list of TFs is important in CEC. To select more specific TFs, however, it is necessary to perform a network analysis (summarized in Fig. 4), literature revision, comparison of this profile with the source cell type, and analysis of the gene expression levels of these TFs and other differentially expressed genes (non-TF genes).

Conclusions

In conclusion, there is a requirement for specific cell types in regenerative medicine and biological research. An interesting proposal is the direct conversion of easy-to-obtain cells, which requires a specific cocktail of TFs to induce alterations in the cell state. Despite the complexity of the computational methods proposed for this task, it was demonstrated that the strategies to identify the TFs involved in the molecular state maintenance of a cell type are relatively simple: i) Define cell populations representing diverse cell types; ii) identify differences in TF expression; and iii) apply rules to remove unlikely TFs. The present review reported that the principal complexity in the computational methods is the third of these points. It was demonstrated in a well-known cardiomyocyte example and a novel corneal endothelial cell example that applying the first two easy-to-implement ideas is likely to provide useful results, which may provide important insights and a starting point for laboratory assays. The present review may additionally inspire novel computational methods to identify TFs associated with cell identity and direct cell conversions.

66 in total

1. Genetic regulatory mechanisms in the synthesis of proteins.

Authors: F JACOB; J MONOD
Journal: J Mol Biol Date: 1961-06 Impact factor: 5.469

Review 2. Maps of random walks on complex networks reveal community structure.

Authors: Martin Rosvall; Carl T Bergstrom
Journal: Proc Natl Acad Sci U S A Date: 2008-01-23 Impact factor: 11.205

3. Using Effect Size-or Why the P Value Is Not Enough.

Authors: Gail M Sullivan; Richard Feinn
Journal: J Grad Med Educ Date: 2012-09

Review 4. Cell fate control by pioneer transcription factors.

Authors: Makiko Iwafuchi-Doi; Kenneth S Zaret
Journal: Development Date: 2016-06-01 Impact factor: 6.868

5. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles.

Authors: Aravind Subramanian; Pablo Tamayo; Vamsi K Mootha; Sayan Mukherjee; Benjamin L Ebert; Michael A Gillette; Amanda Paulovich; Scott L Pomeroy; Todd R Golub; Eric S Lander; Jill P Mesirov
Journal: Proc Natl Acad Sci U S A Date: 2005-09-30 Impact factor: 11.205

6. Transcriptome analysis of the human corneal endothelium.

Authors: Ricardo F Frausto; Cynthia Wang; Anthony J Aldave
Journal: Invest Ophthalmol Vis Sci Date: 2014-11-06 Impact factor: 4.799

7. Meta-Analysis of Transcriptome Regulation During Induction to Cardiac Myocyte Fate From Mouse and Human Fibroblasts.

Authors: Shima Rastegar-Pouyani; Niusha Khazaei; Ping Wee; Moein Yaqubi; Abdulshakour Mohammadnia
Journal: J Cell Physiol Date: 2017-03-24 Impact factor: 6.384

8. A predictive computational framework for direct reprogramming between human cell types.

Authors: Owen J L Rackham; Jaber Firas; Hai Fang; Matt E Oates; Melissa L Holmes; Anja S Knaupp; Harukazu Suzuki; Christian M Nefzger; Carsten O Daub; Jay W Shin; Enrico Petretto; Alistair R R Forrest; Yoshihide Hayashizaki; Jose M Polo; Julian Gough
Journal: Nat Genet Date: 2016-01-18 Impact factor: 38.330

9. ArrayExpress update--simplifying data submissions.

Authors: Nikolay Kolesnikov; Emma Hastings; Maria Keays; Olga Melnichuk; Y Amy Tang; Eleanor Williams; Miroslaw Dylag; Natalja Kurbatova; Marco Brandizi; Tony Burdett; Karyn Megy; Ekaterina Pilicheva; Gabriella Rustici; Andrew Tikhonov; Helen Parkinson; Robert Petryszak; Ugis Sarkans; Alvis Brazma
Journal: Nucleic Acids Res Date: 2014-10-31 Impact factor: 16.971

10. Comparison of software packages for detecting differential expression in RNA-seq studies.

Authors: Fatemeh Seyednasrollah; Asta Laiho; Laura L Elo
Journal: Brief Bioinform Date: 2013-12-02 Impact factor: 11.622

4 in total

Review 1. Molecular mechanisms of transcription factor mediated cell reprogramming: conversion of liver to pancreas.

Authors: Sebastian L Wild; David Tosh
Journal: Biochem Soc Trans Date: 2021-04-30 Impact factor: 5.407

2. Probabilistic boolean networks predict transcription factor targets to induce transdifferentiation.

Authors: Bahar Tercan; Boris Aguilar; Sui Huang; Edward R Dougherty; Ilya Shmulevich
Journal: iScience Date: 2022-08-17

Review 3. Experimental and Computational Approaches to Direct Cell Reprogramming: Recent Advancement and Future Challenges.

Authors: Rihab Gam; Minkyung Sung; Arun Prasad Pandurangan
Journal: Cells Date: 2019-10-02 Impact factor: 6.600

Review 4. Epigenetic reprogramming of cell identity: lessons from development for regenerative medicine.

Authors: Amitava Basu; Vijay K Tiwari
Journal: Clin Epigenetics Date: 2021-07-23 Impact factor: 6.551

4 in total