Literature DB >> 22798478

Recurrent transcriptional clusters in the genome of mouse pluripotent stem cells.

Abstract

A number of studies have shown that transcriptome analysis in terms of chromosomal location can reveal regions of non-random transcriptional activity within the genome. Genomic clusters of differentially expressed genes can identify genomic patterns of structural organization, underlying copy number variations or long-range epigenetic regulation such as X-chromosome inactivation. Here we apply an integrative bioinformatics analysis to a collection of 315 freely available mouse pluripotent stem cell samples to discover transcriptional clusters in the genome. We show that over half of the analysed samples (56.83%) carry whole or partial-chromosome spanning clusters which recur in genomic regions previously implicated in chromosomal imbalances. Strikingly, we found that the presence of such large-clusters is linked to the differential expression of a limited number of genes, common to all samples carrying clusters irrespectively of the chromosome where the cluster is found. We have used these genes to train and test classification models that can predict samples that carry large-scale clusters on any chromosome with over 90% accuracy. Our findings suggest that there is a common downstream activation in these cells that affects a limited number of nodes. We propose that this effect is linked to selective advantage and identify potential driver genes.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2012 PMID： 22798478 PMCID： PMC3479167 DOI： 10.1093/nar/gks663

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Approaches that take into account the chromosomal mapping of transcriptional data have been used in the past for the identification of general structural genomic features such as the regional clustering of ‘housekeeping’ genes (1) as well as transgenic insertions within cell lines (2), gross aneuploidy (3,4) and subtle chromosomal patterns around translocation breakpoints (5). Non-random changes in the expression levels of specific genomic regions can also be linked to the perturbation of normal epigenetic regulation, such as X-chromosome inactivation, or long-range epigenetic silencing in cancer (6,7). Especially in the field of cancer biology, where karyotypic abnormalities are prevalent, a number of studies have described the quantitative relationship between copy number (CN) and gene expression which affects a great percentage of the genes in the aberrant regions (3,4,8,9). The widespread genomic instability in various cancer types can be a challenge for the researcher as it is often not possible to decipher which aberrations contribute to cancer growth and which are the downstream effect of a compromised genomic stability. As a result, the combined analysis of large collections of transcriptional and genomic data from microarray platforms has been thus far a common approach for discovering new oncogenes or tumour suppressors and distinguishing them from the functionally unrelated bystanders (10). However, for the majority of published pluripotent stem cell experiments, large-scale integrated analysis of combined genomic and transcriptional data from a single sample is unattainable due to lack of available datasets. This is especially the case for model organisms besides human. For mouse pluripotent stem cells, for example, there is not a single large-scale study to-date that performs comparative analysis between genomic and transcriptional data. Two recent studies in human pluripotent stem cells have used gene expression data to identify patterns of chromosomal aberrations in embryonic stem cells (ESCs), induced pluripotent stem cells (iPSCs) and other multipotent cell types (11,12). These studies used a limited percentage of available array comparative genomic hybridization (aCGH) and single-nucleotide polymorphism (SNP) arrays to validate the observed patterns and extended the analysis to samples with no corresponding genomic data. This approach shows that by departing from the paradigm of the combined analysis, the interrogation of the large collection of readily available transcriptional data becomes possible. In addition, positional transcriptome analysis simultaneously informs on three different layers of information: genomic content, epigenome and transcriptional regulation. In mouse ESCs, small clusters of differentially expressed (DE) genes have been identified around the pluripotency marker Nanog locus as a result of complex epigenetic regulation during development (13) and at the imprinted Dlk1-Dio3 gene cluster during reprogramming due to epigenetic silencing (14). Moreover, recurring chromosomal aberrations have been primarily mapped to chromosomes 8 and 11 in mouse ESCs (15–17) and chromosome 8 and 14 in mouse iPSCs (18). Interestingly, frequent genomic alterations have been also reported in human ESCs, mapping primarily to chromosomes 12, 17 and X (19–22). Recently, it has been shown that human iPSCs also demonstrate compromised genomic integrity which is especially evident during the process of reprogramming (11,23–25). It has been suggested that specific aneuploidies tend to recur because of their ability to confer growth advantage and/or resistance to apoptosis and differentiation (26). When such aneuploidies are present in a rapidly dividing self-renewing cell in a selective environment, the affected cells can potentially outgrow normal cells and eventually dominate the cell populations. Consistent with this hypothesis, mouse ESCs with a trisomy 8 have been found to outgrow normal cells with a diploid karyotype in competitive cultures (15). Given the above mentioned evidence for positional transcriptional patterns in mouse pluripotent stem cells, we sought to investigate the chromosomal mapping of recurrent clusters of DE genes by analysing a large collection of samples. We hypothesise that, regardless of their molecular origins, recurrent clusters in multiple pluripotent stem cell populations are likely to be the result of positive selection. We used an integrative bioinformatics approach to identify candidate genes that may be driving the selection that has been previously associated with the presence of such patterns. Our findings provide evidence for a recurring set of DE genes in samples that contain large-scale clusters, independently of the genomic location of the clusters, and suggest a common downstream mechanism which may be associated with selective growth advantage.

METHODS

Data collection and processing

For the initial analysis phase, we have collected 481 public domain gene expression samples (373 ESC and 108 iPSC samples from 64 experimental designs) for the Affymetrix GeneChip Mouse Genome 430 2.0 Array from the Gene Expression Omnibus (GEO; http://www.ncbi.nlm.nih.gov/geo) and ArrayExpress (http://www.ebi.ac.uk/arrayexpress) public databases (see Figure 1A, Supplementary Table S1). The raw CEL files obtained were normalized using the Robust Multiple-Array Average (RMA) (27) and Present/Absent flags were extracted by the MAS5.0 algorithm (28), both methods from the ‘affy’ package of the Bioconductor suite (http://www.bioconductor.org/) in the R statistical environment (29).

Figure 1.

The integrative analysis workflow. (A) Collection and global normalization of 481 publicly available samples. (B) Pearson’s correlation derived distance matrix and agglomerative hierarchical clustering with average linkage of the normalized data. (C) PGE analysis with MultiLevel Otsu thresholding for identification of recurrent aberrant localized expression across the dendrogram. (D) Catalogue of recurrent DE clusters. Filtering of samples according to the expression of pluripotency and lineage-specific markers resulting to the Nanog-high subgroup of 315 pluripotent samples. Identification of DE genes between the Nanog-high Normal versus Variant group, the Normal-Chr8 versus Variant-Chr8 groups and the Normal-Chr11 versus Variant-Chr11 groups. (E) Training and testing of classification models using PAM and SVMs for the prediction of Variant samples.

Hierarchical clustering of samples

In order to obtain a measure of similarity between samples and subsequently groups of samples, we have performed agglomerative hierarchical clustering with average linkage, using a distance matrix based on the Pearson’s correlation of the samples (Figure 1B). Large data collections, such as the one analysed in this manuscript, may present variations due to differences in RNA quality or hybridization processing, culture conditions or experimental treatments between different labs (30). In order to account for this complexity, we designed an iterative strategy where each sample (or group of samples after the leaf nodes of the dendrogram) is compared with the sample (or group of samples) with the most correlated transcriptome available in the matrix in a branch-wise manner, according to the dendrogram obtained from the hierarchical clustering (Figure 1C). This approach can reveal the unique subtle changes of each sample that differentiate it from its most similar neighbour. It therefore deviates from previous methodologies in that it avoids the use of a globally averaged profile as a definition of a ‘normal’ stem cell state to represent complex stem cell expression patterns (11,12).

Identification of DE clusters

In order to identify clusters of DE genes, we have considered all the probesets for which genome mapping annotation was available (43 109 probesets in total). Multiple probesets for a single gene were replaced with their median value resulting in 26 524 probesets. For each comparison in the dendrogram, we estimated suitable fold change (FC) cut-off values for differential expression by applying a novel approach, the MultiLevel Otsu method used in image processing (31). The average cut-offs used across the dendrogram were >1.56 FC for over-expressed genes and <0.66 FC for under-expressed. In addition, for each comparison, we filtered out probesets which were absent in more than 50% of the samples in the comparison. Next, we used the Positional Gene Enrichment (PGE) algorithm (32) to identify clusters of DE genes (Figure 1C). Briefly, PGE uses an adaptive genomic window approach to identify chromosomal regions that are over-represented in user provided gene lists. We have implemented the PGE algorithm in Java and run the method with the lists of all up-regulated and all down-regulated genes from the previous step. We used the rank position of each probeset on the chromosome instead of its physical coordinates in order to minimize regional biases due to gene-dense regions or gene deserts. For each comparison of samples, the PGE algorithm corrects the P-value of the discovered clusters for multiple testing using the False Discovery Rate (FDR) (33). We filtered out clusters with an adjusted P-value ≤ 0.01. To additionally assess the statistical significance of the predicted clusters across the whole dendrogram, we calculated an empirical FDR based on randomization by generating 1000 permutations of randomized genomic mappings of the FC values, keeping the dendrogram topography constant. Finally, once the specific chromosomal clusters were discovered, the global trimmed mean of each gene was used to predict the type of cluster, i.e. up- or down-regulation (the 0.05% of outlier expression values per gene was discarded). The final list of clusters was filtered for an adjusted P-value < 1.0E-4 and cluster size of at least 10 DE genes (Figure 2).

Figure 2.

The circular karyotype of all predicted significantly over-expressed (red) and under-expressed (blue) DE clusters in the matrix and the genes that are DE between predicted Normal and Variant samples (red for up-regulated genes and blue for down-regulated). Larger effects observed in chromosomes 8, 11, 14 and X. For an example of the enhanced detection power of the approach, see also Supplementary Figure S1. For detailed description of the samples included in the analysis, see Supplementary Table S1.

Visual inspection and validation

We visually inspected the chromosomal clusters by plotting the rank position of each gene across the chromosomes using Di.S.C.O. (Discovery of Subtle Clustered Organization), a custom-developed software tool (Skylaki et al., in preparation). Expression levels were presented by a colour gradient defined by the MultiLevel Otsu-derived thresholds, whereas each gene was represented by the median value of all its corresponding probesets (see Supplementary Figure S1).

Selection of pluripotent ESCs and iPSCs samples based on markers expression

To distinguish between mouse ESCs, iPSCs and their differentiating or partially reprogrammed counterparts, we examined the available sample annotation and the expression of hallmark pluripotency genes such as Nanog (34,35), as well as a range of differentiation markers (Figure 1D). It can be hypothesized that the high expression of pluripotency markers in combination with low expression of lineage-specific genes reflects cell populations rich in pluripotent stem cells. This filtering step was essential in order to focus on transcriptional changes that are specific in pluripotent stem cells and not the obvious result of cell mixtures in different stages of differentiation or reprogramming. The resulting subset of 315 homogeneous pluripotent populations (272 ESC and 43 iPSC samples), from here on referred to as Nanog-high samples, was used at the final stage of the analysis for the identification of recurring DE genes in samples that carry DE clusters as well as the training and testing of classification models as presented hereafter.

Differential gene expression analysis

As mentioned previously, we were specifically interested in analysing the positional transcriptional patterns of the Nanog-high subgroup which more closely represents the pluripotent state. In addition, we focused on whole- or partial-chromosome spanning clusters which are likely to reflect underlying aneuploidies since co-regulation of large genomic regions is not commonly observed as a result of transcriptional regulation. The 315 Nanog-high samples were divided in two groups: the group termed as Normal consists of samples where no large-scale DE clusters could be identified in the genome, whereas the group termed as Variant comprises of samples that bear large-scale chromosomal clusters of DE genes in at least one chromosome. In order to determine whether there is a distinct transcriptional signature that can be associated with the presence of such large DE clusters in the Nanog-high Variant group, we performed differential expression analysis using a two-class Significance Analysis of Microarrays (SAMs) (36) (Figure 1D). The analysis was performed using the ‘samr’ package in R (500 permutations, FDR = 0.05). From this stage onwards, all probesets were considered and no replacement was performed, in order to account for the unique behaviour of each probeset which may represent alternative splicing or polyadenylation events. We compared (i) Normal versus Variant samples; (ii) samples with chromosome-8 specific patterns (Variant-Chr8) versus all other samples (Normal-Chr8); and (iii) samples with chromosome-11 specific patterns (Variant-Chr11) versus all other samples (Normal-Chr11). The Normal-Chr8 and –Chr11 groups also contained the rest of the samples that had DE clusters in any other chromosomes, besides 8 and 11 respectively. The lists of DE genes per comparison are presented in Supplementary Tables S2–S4 (with FC ≥ 1.5 and adjusted P-value < 0.05). Chromosomes 8 and 11 were specifically chosen for this analysis because they are the chromosomes most frequently affected by aneuploidy and, in fact, 70% of the predicted Variant samples carry whole or partial-chromosome spanning clusters on at least one of these two chromosomes (see ‘Results’ section).

Classification

To investigate whether the set of DE genes common in samples that carry large DE clusters on any chromosome and in samples that carry chromosome-8 and -11 specific clusters can be predictive of the presence of such clusters, we employed two well-established classification techniques: Prediction Analysis of Microarrays (PAMs) (37) and Support Vector Machines (SVMs) (38,39) (Figure 1E). PAM uses a nearest shrunken centroid approach to identify the genes that best separate between classes. We used the ‘pamr’ package in R (40). For the linear SVMs we used the ‘e1071’ package in R (41). Briefly, SVMs map the input data onto a high-dimensional space, where classification can be achieved by defining a hyperplane that separates the data points of the two classes. For the construction of the SVM classifiers, we used a subset of 187 samples and 37 samples for training and validation, respectively. After selecting the best scoring classifier, we merged the training and validation subsets to train the classifiers again and obtain the final accuracy score on a test dataset of 91 entirely independent samples (the remainder of the complete data collection). Our decision was based on accuracy and F1 score, defined as follows: where TP = True Positive, TN = True Negative, FP = False Positive and FN = False Negative. For the chromosome-specific classifiers, we additionally accounted for differences in the number of input samples per class by adjusting the weight parameters of the SVM to be proportional to the number of samples in each class. The Recursive Feature Elimination (RFE) method (42) was applied to linear SVMs to obtain small subsets of predictive genes.

GO analysis

Gene ontology (GO) enrichment (GOTERM_BP_5) was calculated using the DAVID functional annotation bioinformatics tool (43,44). For the GO analysis only probesets with FC > 1.5 and Q-value < 0.05 (from SAM) were considered. Enrichment significance was limited to a very stringent Benjamini–Hochberg adjusted P-value < 0.01.

RESULTS

A catalogue of DE clusters in mouse ESCs and iPSCs

The PGE analysis performed across the dendrogram generated a large set of DE clusters (Figure 2). The most prevalent recurring intervals that we have observed map to chromosomes 6, 8, 11, 14 and most commonly in chromosome X (Figure 2). It is plausible that a percentage of the observed clusters on chromosome X correspond to varying states of X chromosome inactivation (XCI), whereas others to DNA CN alterations. However, it should be noted that in mouse ESCs, all lines for which sample annotation was available (∼70%) were annotated as male. Interestingly, 75.43% of the identified clusters are up-regulated, which implies that amplifications or activation events are much more frequent than deletions or coordinated down-regulation. A strikingly similar percentage of copy number variations (CNVs) in human ESCs have been reported to correspond to amplifications (72%) (45). Focusing on the subgroup of the 315 Nanog-high samples (Figure 3A and B), we could identify whole or partial-chromosome spanning clusters in 179 samples, 56.83% of the group. We further validated these clusters by plotting the gene expression levels on the chromosomes (see ‘Methods’ section and Supplementary Figure S1). Large expression domains are good predictors of underlying aneuploidy. The percentage of samples that carry such large-scale clusters of DE genes in the Nanog-low subset is much lower (30%, Figure 3A) than the one in the Nanog-high subgroup (56.83%). This difference may reflect differences in the frequency of pluripotent cells in cultures or the inability to detect these subtle signatures in mixtures of differentiating cells such as the ones in the Nanog-low group. These findings are consistent with previous cytogenetic studies in mouse pluripotent stem cells which also highlight recurrent changes of chromosome 8, 11 and 14 (Figure 3C) (15,17,18). Our method additionally identified a high number of clusters in chromosomes 6 and X and frequently recurring pairs of large chromosomal clusters which tend to appear across many different experiments. The latter include clusters on chromosomes 8 and 11 (hypergeometric, P-value = 0.001), chromosomes 8 and 14 (hypergeometric, P-value = 3.20E-06), chromosomes 11 and 6 (hypergeometric, P-value = 0.019) and chromosomes 14 and 17 (hypergeometric, P-value = 4.00E-11). A detailed breakdown of the specific percentages of predicted clusters per chromosome for the Nanog-high subgroup is presented in Figure 3E.

Figure 3.

Description of the large-scale chromosome spanning DE clusters in the Nanog-high subgroup. (A) Percentages of Variant and Normal samples in the differentiating or partially reprogrammed Nanog-low group of samples (n = 166). (B) Percentages of Variant and Normal samples in the Nanog-high pluripotent group of samples (n = 315). The downstream analysis was focused on this subgroup of 315 samples. (C) Comparison of the frequencies of predicted abnormalities per chromosome in the present study and two independent cytogenetic studies of mouse ESCs (15,17). (D) Venn-diagram representing the co-occurrence of large DE clusters between chromosomes 6, 8, 11 and 14. Figure constructed in Venny (46). (E) Breakdown of percentages for the aberrant chromosomes and the associated aberrant chromosome pairs. For a detailed comparison between mouse ESCs and iPSCs, see also Supplementary Figure S2. Finally, a comparison between ESC and iPSC-specific clusters on the autosomes revealed that in both cases more than half of the samples carry at least one large-scale chromosomal cluster (58% of samples for ESCs and 51% for iPSCs) (Supplementary Figure S2). Interestingly, chromosome 11 patterns are mostly present in ESCs. In iPSCs, the chromosome X changes, which are predicted gains or up-regulations, could reflect differences between male and female lines such as different states of XCI. Unfortunately, we were unable to obtain the annotation for the sex of the line for the majority of iPSC samples studies and thus, sex chromosomes have been excluded from further analysis.

Recurring DE genes in samples carrying large DE clusters

The SAM analysis performed between Nanog-high Normal and Variant groups, Normal-Chr8 and Variant-Chr8 groups and Normal-Chr11 and Variant-Chr11 groups revealed sets of DE genes for each comparison (see Supplementary Tables S2–S4). A heatmap representation of the top 50 DE genes from each comparison is presented in Figure 4A–C.

Figure 4.

Heatmap representation of the top 50 genes generated from SAM analysis. The panel of the three core pluripotency genes (Nanog, Pou5f1 (Oct4) and Sox2) at the bottom of each heatmap demonstrates the independency of the large DE clusters from the core pluripotency program in the stem cell populations. Figure constructed in GenePattern (47). (A) Heatmap of the global set where the Variant group consists of samples with any type of large-scale DE cluster. (B) Heatmap of the chromosome 8-specific set where the Variant-Chr8 group consists of any sample with a chromosome 8-specific DE cluster. (C) Heatmap of the chromosome 11-specific set where the Variant-Chr11 group consists of any sample with a chromosome 11-specific DE cluster. For the SAM-derived lists of DE genes for each comparison, refer to Supplementary Tables S2–S4. The presence of a recurring set of DE genes across all Variant samples suggests that there is a common downstream effect in these samples independent of the genomic location of the DE cluster they carry. Importantly, the identified DE genes are not necessarily members of the identified clusters. We hypothesize that these cells operate under a positive selection mechanism the downstream consequences of which manifest at the transcriptional level despite their different types of DE clusters. The top up-regulated list (Table 1) is typified by genes linked to pluripotency, genomic integrity and cell cycle. An example of this type of gene is Pramel7 that has been recently reported to promote self-renewal in the absence of exogenous LIF in mouse ESCs (48). Other interesting examples of differentially over-expressed genes are Crxos1, a homeoprotein that has been shown to play a dual role in self-renewal and differentiation (49), the non-homologous end-joining repair gene Lig4 (50), the genome maintenance regulator Zscan4 (51) as well as the cell-growth modulator Lin28 (52). The function of these genes is consistent with the properties of genes expected to drive positive selection in competitive cultures.

Table 1.

Functional categories of the top 50 over- and under-expressed genes in the Variant feature set

Functional category	Up-regulated genes (Variant)	Down-regulated genes (Variant)
Cell cycle/growth	Lin28, Ccnb1ip1, Dnajc2, Anapc10,Syce1	Grb10
Survival	Pou4f2, Mras	–
Protein metabolic process	St8sia1, Anapc10, Dub1, Eif1a, Hck, Map2k6, Rpl39l, Eif2s2	Rps9
Genomic integrity	Lig4, Zscan4	–
Cell death	Plagl1, Map2k6, Xaf1	Serpinh1 (Hsp47), Cdh11, Cyr61 (Ccn1)
Stem cells	Lin28, Mras, Pramel7, Crxos1, Zfp42 (Rex1)	–
Cancer	Ceacam1, St8sia1	Malat1, Fus
ECM	–	Bgn, Col1a1, Col1a2, Col3a1, Col5a2, Lox, Tnc, App
Other/unknown function	Calcoco2, Xlr3, Xlr4, 100043292, Pramel6, AU015836, LOC639910, LOC100038935, Spesp1, Hck, H19,Gsta3, Glod5, Snrpn /// Snurf, 2200001I15Rik,Snhg3, 2410004A20Rik, Glrx, Cox7a1, St8sia1, Sec23ip, Zfp560, Sdc4, 666185, Glrx, Gprc5b	Acta2, Thbs1, Mid1, Tagln, Fstl1, Atrx, Prss23, Ptprf, Cd44, Cdk7, Hs6st2, Prtg, Pkdcc, LOC72520, F630007L15Rik, Axl, Fstl1, Lpp, Meg3, Prtg, Sox11, Ptgs2, A130040M12Rik

The top 50 up- and down-regulated genes (ranked by FC) in the Global feature set (which in total includes 128 over-expressed and 543 under-expressed genes). In bold: candidates with literature evidence that supports functional significance in ESC self-renewal.

Functional categories of the top 50 over- and under-expressed genes in the Variant feature set The top 50 up- and down-regulated genes (ranked by FC) in the Global feature set (which in total includes 128 over-expressed and 543 under-expressed genes). In bold: candidates with literature evidence that supports functional significance in ESC self-renewal.

Classification of samples carrying large DE clusters

Given the high percentage of samples in our analysis that carries large DE clusters and the presence of distinct set of DE genes in these samples, we investigated the prediction power of these sets by training classification models using PAM and SVMs. The results are presented in Table 2. In all three case studies, that is Variant (any type of cluster), Variant-Chr8 and Variant-Chr11, we achieved predictive accuracy higher than 80% using linear SVM classifiers with just a limited number of DE genes.

Table 2.

Performance of classifiers

Classifier	Set	Feature selection	Accuracy	F1 score
PAM	Global	None	0.82	0.88
PAM	Global	SAM All	0.87	0.90
SVM	Variant	None	0.86	0.89
SVM	Global	SAM All	0.92	0.94
SVM	Global	RFE_SVM Top 100	0.89	0.92
SVM	Global	RFE_SVM Top 50	0.91	0.94
SVM	Global	RFE_SVM Top 10	0.55	0.59
SVM	Chr8	None	0.73	0.68
SVM	Chr8	SAM All	0.80	0.78
SVM	Chr8	RFE SVM Top 50	0.81	0.78
SVM	Chr8	RFE SVM Top 10	0.80	0.79
SVM	Chr8	RFE SVM - No Chr8	0.71	0.63
SVM	Chr11	None	0.73	0.29
SVM	Chr11	SAM All	0.93	0.79
SVM	Chr11	RFE_SVM Top 50	0.95	0.81
SVM	Chr11	RFE_SVM Top 10	0.90	0.61

Best performing classifiers (with bold we highlight the classifier trained with the top 50 features in each set). Feature selection was performed from the SAM output list by RFE. In the RFE SVM—No Chr8 feature set, genes mapped to chromosome 8 were excluded from the up-regulated list. Global: Normal and Variant, Chr8: Normal-Chr8 and Variant-Chr8, Chr11: Normal-Chr11 and Variant-Chr11.

Performance of classifiers Best performing classifiers (with bold we highlight the classifier trained with the top 50 features in each set). Feature selection was performed from the SAM output list by RFE. In the RFE SVM—No Chr8 feature set, genes mapped to chromosome 8 were excluded from the up-regulated list. Global: Normal and Variant, Chr8: Normal-Chr8 and Variant-Chr8, Chr11: Normal-Chr11 and Variant-Chr11. Remarkably, by applying the RFE method, we could identify small subsets of candidate genes that demonstrate a high class prediction power. For the Variant set, the top 50 genes are sufficient to predict the presence of DE clusters with an accuracy of 91%. In the case of chromosome-specific SVMs it was possible to narrow our selection down to the top 10 ranked genes while still maintaining a high accuracy (over 80%). The top 10 up-regulated genes in the Variant-Chr8 group include the anti-apoptotic Bag4 as well as Lsm1, both described as breast cancer oncogenes in the 8p11-p12 recurrent amplicon in human. BAG4 and LSM1, in combination with C8ORF4, influence growth factor independence and anchorage-independent growth of MCF10A breast cancer cells (53). Interestingly, a recent study has implicated another anti-apoptotic gene, BCL2L1, in conferring growth advantage to human pluripotent cells carrying the 20q11.21 amplicon (54). The Bag4 and Bcl-2 anti-apoptotic protein families interact to regulate cell survival (55). The up-regulation of different members of the anti-apoptotic pathways in both mouse (present study) and human (54) may indicate the existence of a common reserved path towards selective growth in both organisms. Finally, a selection of solely non-chromosome 8 mapped genes could still be used to train the classifier for chromosome 8 clusters with up to 71% accuracy (Table 2). This result suggests that there is a non-chromosome 8-specific program that is affected by the presence of the DE cluster on chromosome 8, further supporting the evidence for a secondary mechanism independent of the chromosomal location of the clusters.

DISCUSSION

In summary, we have used a sensitive integrative method to analyse the transcriptome of the largest collection to-date of mouse ESCs and iPSCs samples. We were able to quantify the number of samples that carry a large-scale cluster of genes with concordant changes in expression levels and assign the greatest percentage of these intervals to chromosomes 8 and 11. These findings are consistent with cytogenetic studies reporting recurrent aberrations on chromosomes 8 and 11 in murine ESC populations (15,17). A subset of the smaller recurrent intervals may be due to co-regulated functional gene clusters as has been previously observed for the Nanog locus in mouse ESCs (13), whose up-regulation was also detected in our analysis. The prediction power of the method and the large scale of the data analysis revealed a complex pattern of genomic regions which are prone to be concordantly DE, such as the chromosome pairs 8 and 11, 6 and 11, 8 and 14, and 14 and 17. Importantly, many of the events identified here are likely to be of a functional significance, since they have been repeatedly selected for in culture. Our analysis shows that in a set of 315 pluripotent samples selected for high Nanog expression, 56.83% carry large-scale clusters of DE genes. As large-scale clusters of DE genes can be indicative of underlying aneuploidies, we hypothesise that the majority of these clusters, which overlap with previously reported hotspots of aneuploidy, can in fact be the effect of acquired chromosomal aberrations. The presence of such clusters is not a universal characteristic of normal pluripotent stem cells as the remaining 43.17% of the pluripotent samples carry no such large-scale changes and still demonstrate high expression of pluripotency markers. Therefore, these clusters are not essential for the survival of pluripotent stem cells under normal conditions but they rather may contribute towards the dominance of the affected cells in a selective culture environment, possibly through the deregulation of a small set of driver genes. It should be also noted that the majority of these clusters do not span the Nanog locus. A recent study has indicated that the occurrence of trisomy 12 in human iPSCs is a result of the up-regulation of the NANOG-GDF3 cluster on chromosome 12 (11). The authors proposed that this is a likely mechanism for the driving the aneuploidy, since over-expression of NANOG leads to enhanced self-renewal. Such an effect may be possible in the presence of NANOG-spanning clusters, however, in our data, there is a great number of Nanog-high Variant samples, irrespectively of the genomic position of the DE cluster they carry. It is likely that a change that promotes cell growth and/or blocks differentiation and apoptosis, would be selected in a self-renewing, Nanog-positive cell in culture in order to eventually dominate the entire cell population. As a result, the generated mixture of cells will show a bias towards self-renewing pluripotent state and therefore carry markers of such cells including Nanog. The comparison between Normal and Variant profiles has revealed a set of DE genes highly connected to pluripotency, cell cycle and apoptosis. It has been proposed before that positive selection in culture can occur through multiple mechanisms, in particular via cell cycle progression and deregulation of the p53 pathway or activation of anti-apoptotic pathways (26). Prominent delegates of these processes are present in the selected features (Lin28, Mras, Pramel7, Crxos1, Rex1, Lig4 and Zscan4 among others). In addition, it is interesting to note that in some aneuploid cells there is compensation for the adverse effects of higher DNA CNs by modulating pathways involved in balancing protein stoichiometry such as ribosome biogenesis and protein degradation (56). A similar effect is observed in the case of chromosome 8 clusters which demonstrate enrichment in the GO categories related to RNA processing (Benjamini adjusted P-value = 2.23E-03). Importantly, there is a recurring set of DE genes present in Variant samples, irrespectively of the genomic mapping of the cluster they carry. It is in fact possible to use this limited number of genes to train highly accurate classifiers in order to assess the transcriptional integrity of pluripotent cultures. We speculatively suggest that the presence of a recurring transcriptional signature indicates a downstream response mechanism that confers selective advantage to the affected cells and can be detected by the expression of a limited number of nodes. It could be additionally used for the identification of core pathways that can be subsequently targeted to develop anti-selective culture conditions for aneuploidy. Such an approach has been effectively applied in trisomic mouse embryonic fibroblasts (MEFs) and human cancer cell lines with compounds that are anti-selective for karyotypically abnormal cells (57).

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–4 and Supplementary Figures 1 and 2.

FUNDING

Biotechnology and Biological Sciences Research Council (BBSRC) and the EU seventh framework program EuroSyStem. Funding for open access charge: EU FP7 program EuroSyStem. Conflict of interest statement. None declared.

48 in total

1. High-resolution DNA analysis of human embryonic stem cell lines reveals culture-induced copy number changes and loss of heterozygosity.

Authors: Elisa Närvä; Reija Autio; Nelly Rahkonen; Lingjia Kong; Neil Harrison; Danny Kitsberg; Lodovica Borghese; Joseph Itskovitz-Eldor; Omid Rasool; Petr Dvorak; Outi Hovatta; Timo Otonkoski; Timo Tuuri; Wei Cui; Oliver Brüstle; Duncan Baker; Edna Maltby; Harry D Moore; Nissim Benvenisty; Peter W Andrews; Olli Yli-Harja; Riitta Lahesmaa
Journal: Nat Biotechnol Date: 2010-03-28 Impact factor: 54.908

2. Identification and classification of chromosomal aberrations in human induced pluripotent stem cells.

Authors: Yoav Mayshar; Uri Ben-David; Neta Lavon; Juan-Carlos Biancotti; Benjamin Yakir; Amander T Clark; Kathrin Plath; William E Lowry; Nissim Benvenisty
Journal: Cell Stem Cell Date: 2010-10-08 Impact factor: 24.633

3. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources.

Authors: Da Wei Huang; Brad T Sherman; Richard A Lempicki
Journal: Nat Protoc Date: 2009 Impact factor: 13.491

4. Lin28 modulates cell growth and associates with a subset of cell cycle regulator mRNAs in mouse embryonic stem cells.

Authors: Bingsen Xu; Kexiong Zhang; Yingqun Huang
Journal: RNA Date: 2009-01-15 Impact factor: 4.942

Review 5. Aneuploidy: cells losing their balance.

Authors: Eduardo M Torres; Bret R Williams; Angelika Amon
Journal: Genetics Date: 2008-06 Impact factor: 4.562

6. Aberrant silencing of imprinted genes on chromosome 12qF1 in mouse induced pluripotent stem cells.

Authors: Matthias Stadtfeld; Effie Apostolou; Hidenori Akutsu; Atsushi Fukuda; Patricia Follett; Sridaran Natesan; Tomohiro Kono; Toshi Shioda; Konrad Hochedlinger
Journal: Nature Date: 2010-04-25 Impact factor: 49.962

7. Cloning of complementary DNAs encoding structurally related homeoproteins from preimplantation mouse embryos: their involvement in the differentiation of embryonic stem cells.

Authors: Koichi Saito; Hajime Abe; Masato Nakazawa; Emiko Irokawa; Masafumi Watanabe; Yusuke Hosoi; Miki Soma; Kano Kasuga; Ikuo Kojima; Masayuki Kobayashi
Journal: Biol Reprod Date: 2009-12-16 Impact factor: 4.285

8. An improved method for detecting and delineating genomic regions with altered gene expression in cancer.

Authors: Björn Nilsson; Mikael Johansson; Anders Heyden; Sven Nelander; Thoas Fioretos
Journal: Genome Biol Date: 2008-01-21 Impact factor: 13.583

9. Zscan4 regulates telomere elongation and genomic stability in ES cells.

Authors: Michal Zalzman; Geppino Falco; Lioudmila V Sharova; Akira Nishiyama; Marshall Thomas; Sung-Lim Lee; Carole A Stagg; Hien G Hoang; Hsih-Te Yang; Fred E Indig; Robert P Wersto; Minoru S H Ko
Journal: Nature Date: 2010-03-24 Impact factor: 49.962

10. CD30 expression reveals that culture adaptation of human embryonic stem cells can occur through differing routes.

Authors: Neil J Harrison; James Barnes; Mark Jones; Duncan Baker; Paul J Gokhale; Peter W Andrews
Journal: Stem Cells Date: 2009-05 Impact factor: 6.277

3 in total

1. Simple derivation of transgene-free iPS cells by a dual recombinase approach.

Authors: Anna Pertek; Florian Meier; Martin Irmler; Johannes Beckers; Stavroula Skylaki; Max Endele; Wolfgang Wurst; Nilima Prakash; Ralf Kühn
Journal: Mol Biotechnol Date: 2014-08 Impact factor: 2.695

2. Integrity of Induced Pluripotent Stem Cell (iPSC) Derived Megakaryocytes as Assessed by Genetic and Transcriptomic Analysis.

Authors: Kai Kammers; Margaret A Taub; Ingo Ruczinski; Joshua Martin; Lisa R Yanek; Alyssa Frazee; Yongxing Gao; Dixie Hoyle; Nauder Faraday; Diane M Becker; Linzhao Cheng; Zack Z Wang; Jeff T Leek; Lewis C Becker; Rasika A Mathias
Journal: PLoS One Date: 2017-01-20 Impact factor: 3.240

3. Intrinsic factors and the embryonic environment influence the formation of extragonadal teratomas during gestation.

Authors: Constantinos Economou; Anestis Tsakiridis; Filip J Wymeersch; Sabrina Gordon-Keylock; Robert E Dewhurst; Dawn Fisher; Alexander Medvinsky; Andrew J H Smith; Valerie Wilson
Journal: BMC Dev Biol Date: 2015-10-09 Impact factor: 1.978

3 in total