| Literature DB >> 31412285 |
Christine A Wells1, Jarny Choi2.
Abstract
Transcriptional profiling is a powerful tool commonly used to benchmark stem cells and their differentiated progeny. As the wealth of stem cell data builds in public repositories, we highlight common data traps, and review approaches to combine and mine this data for new cell classification and cell prediction tools. We touch on future trends for stem cell profiling, such as single-cell profiling, long-read sequencing, and improved methods for measuring molecular modifications on chromatin and RNA that bring new challenges and opportunities for stem cell analysis.Entities:
Keywords: bioinformatics; pluripotent stem cell; reprogramming; single-cell sequencing; transcriptome
Mesh:
Substances:
Year: 2019 PMID: 31412285 PMCID: PMC6700522 DOI: 10.1016/j.stemcr.2019.07.008
Source DB: PubMed Journal: Stem Cell Reports ISSN: 2213-6711 Impact factor: 7.765
Figure 1Future Platforms for Molecular Profiling of Stem Cells
(A and B) Current platforms for stem cell profiling include (A) assays of chromatin modifications using chromatin immunoprecipitation (ChIP) and chromatin accessibility using the assay for transposase accessible chromatin sequyencing (ATAC). Future modifications (B) will involve real-time measurements of the dynamics of protein phosphorylation during transcriptional programs.
(C and D) (C) Transcription start sites (TSS) are currently measured by capped analysis of gene expression (CAGE), which relies on capture of the methyl-G mRNA cap. Future platforms (D) in single cells will allow discrimination of allelic differences in transcription initiation.
(E and F) (E) Alternate splicing is currently predicted by computational alignment of short sequencing reads across exon boundaries, but these are poor at resolving unique transcripts and commonly result in consensus transcripts. Long-read sequencing, stretching over 1 kb or more are now evolving to explore transcript isoforms. The next iteration of alternate splicing (F) will be computational, moving from gene-centric to isoform-centric interaction networks and enabling the annotation of higher-resolution stem cell pathways.
(G and H) (G) Short-read RNA-seq is the most widely adopted method of measuring transcriptional activity from a locus. Future applications of RNA-seq (H) will be the compilation of gold standard transcriptional atlases that allow users to upload and benchmark their own data.
(I) Current methods for measuring nucleotide modifications involve bisulfite DNA sequencing to convert unmethylated-cytosine to uracil, or antibody-based immunoprecipitation methods that bind methylated adenosine or variants of methylated cytosine on RNA (RNA immunoprecipitation [RIP]) or DNA (ChIP).
(J) Future methods will expand the repertoire of metabolites capable of modifying chromatin proteins or RNA, building more immediate linkages between the cell transcriptome and metabolome.
Figure 2Seven Deadly Sins of Data Analysis
1. Replication. Technical replication measures the reliability of the platform but are not informative in a statistical analysis of biological group differences. These statistical tests broadly assess whether the variance in expression between two groups is greater than the variance expected within that group. Therefore, well-designed studies provide enough replication to properly assess biological variability. In a stem cell context, this would include profiling of multiple stem cell lines, rather than the same line multiple times. As the novelty of scRNA-seq studies dissipates, and the cost of running the experiments decrease, a group of cells from a single individual will no longer be considered sufficient replication of a model. 2. Experimental design: confounding “batch” and “biology.” There is no bioinformatic way to separate experimental variable from biological variable if biological groups have been “batched” separately. When groups are batched in this way biological signal can be confounded by experimental variables such as RNA kit, amplification method, platform differences, or even sequencing date. This accounts for >10% of datasets reviewed and failed by the Stemformatics pipeline. 3. Normalization strategies that predetermine group membership before testing group differences. Data integration needs careful consideration. Normalization strategies that preassign groups that are “similar” or “different” will adjust expression values to harmonize members of a group, and this is particularly problematic if the study design is unbalanced (for example, if a study is comparing in-house data with a subset of exemplar samples from an external dataset). This can result in a self-fulfilling prophecy that samples expected to be similar share patterns of expression (for example, group close to one another on a principal-component analysis), because those similarities are enforced by the normalization strategy. Likewise, differences between the groups should be expected to be exaggerated (Nygaard et al., 2016). 4. Misuse of signature genes to prove cell identity. Stem cell researchers may be most comfortable when using antibodies to visualize expression of individual genes in a cell, particularly as methods such as flow cytometry allow us to classify cells based on positive and negative molecular gates. There is an entire literature that spuriously claims that stromal cells are pluripotent because of an anti-OCT4 antibody signal in the cultured cells (Warthemann et al., 2012, Xu et al., 2015). However, computational predictors of cell identity do not work in the same way. The output of a machine-learning classifier is a vector of gene expression, in which the presence or absence of any single molecule is not able to substitute for the whole. These types of classifiers cannot be validated using single-gene PCR measurements or antibody staining, but rather require application of the whole signature for accurate classification. Validation of these signatures rely on their application to new datasets, and continuous assessment of the stability of the signature, and the false-positive/false-negative rates as it is applied to new data. 5. Gene set enrichment: few genes drive many pathways. Results of a gene set enrichment analysis should be interpreted with care, because it is easy to find many gene sets being enriched with low p values due to a small number of the same genes occurring in multiple sets (Mar et al., 2011). This can lead to a false interpretation that many gene sets drive a process, whereas they may be passengers in another process, which confounds the analysis (Venet et al., 2011). 6. Metadata mismanagement. A crucial yet often overlooked aspect of transcriptome data generation is the importance of metadata management. Mislabeled samples or errors in data entry can commonly occur without being detected at all, leading to potentially erroneous conclusions on the data. “Sample swaps” are common in the public databases, in which samples have been clearly assigned to an incorrect group. It is thus very important to give due consideration to this issue from the beginning of the experimental design phase. 7. Missing data. Unfortunately, it is not uncommon to find published studies where some of the crucial raw data are missing from the public repositories where they should reside. A frequent scenario is deposition of partial information (e.g., control samples only) to obtain an accession number. This accounts for more than one-third of the publications reviewed and rejected by the Stemformatics platform. Regardless of the underlying intension behind such missing data, this highlights a serious flaw in the current system of reviews carried out by journals.