| Literature DB >> 35646329 |
Abstract
Genome biology shows substantial progress in its analytical and computational part in the last decades. Differential gene expression is one of many computationally intense areas; it is largely developed under R programming language. Here we explain possible reasons for such dominance of R in gene expression data. Next, we discuss the prospects for Python to become competitive in this area of research in coming years. We indicate that Python can be used already in a field of a single cell differential gene expression. We pinpoint still missing parts in Python and possibilities for improvement. Copyright:Entities:
Keywords: R; differential gene expression; limma; python; single cell expression
Mesh:
Year: 2021 PMID: 35646329 PMCID: PMC9130758 DOI: 10.12688/f1000research.53842.2
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Programming languages supporting biological packages, their names and major focus.
| Programming language | Section | Major applications |
|---|---|---|
| C++ | Bio++ | Sequencing and phylogenetics |
| Java | BioJava | DNA/RNA/Protein sequence analysis |
| JavaScript | BioJS | Mostly Sequence analysis, some elements of GO and visualizations |
| Perl | BioPerl | Mostly sequencing related |
| PHP | BioPHP | Mostly sequencing related |
| Python | BioPython
| Mostly sequencing related
|
| Ruby | BioRuby | Mostly sequencing related |
| R | Bioconductor | Huge collection of different kinds, no specific subject. Not really for sequencing |
Snakemake is python based workflow managing system, in other words pipelines organizing software, which is more than a regular package (compared to others mentioned in this table).
It is also worth mentioning Bioconda installation package, which assists finding and installing various tools for biological data analysis. It is a sort of a spin-off the Anaconda installation package for Python, but with extended spectrum of options and possibilities.
Steps and functions for differential expression microarrays analysis in R and analogues in Python.
| Step | R package/function | Python analogue |
|---|---|---|
| Fetch data from GEO | GEOquery (Bioconductor) | GEOparse |
| Visualize data | hist(), boxplot() | plt.hist(), plt.boxplot() (Matplotlib) |
| Log2 transform | log2() | log2 (Math) |
| Quantile normalization | normalizeBetweenArrays() (Limma), normalize.quantiles() (preprocessCore) | Not directly available, the procedure is described in detail, it can be written as custom code |
| Model fit | lmFit(), contrasts.fit() (Limma) | Not directly available, may be made from statsmodels package functions. |
| Calculate significance | eBayes() (Limma) | Missing |
| Generate differential expression table | topTable() (Limma) | Missing, but can be written as custom script |
| Extra visualizations | Volcanoplot (Limma), PCA (multiple packages) | Basic plots in Matplotlib, plt.scatter(), PCA, MDS, in SciKit-learn |
Note: packages for functions are in brackets behind the function.
Steps and functions for RNAseq DE analysis in edgeR and analogues in Python.
| Step | R package/function | Python analogue |
|---|---|---|
| Read the data from file | read.csv(), read.table() | Read_csv (pandas) |
| Visualize data | hist(), boxplot() | plt.hist(), plt.boxplot() (Matplotlib) |
| Convert to special data format | DGElist() | Not used |
| Calculate normalizing factors (normalize and log-transform) | calcNormFactors() | not directly available, the procedure described for deseq2 can be written as custom code |
| Estimate dispersion | Many kinds of estimateDispersion() | Not available |
| Calculate significance | exactTest() | Not available in this context |
| Generate differential expression table | topTable() (Limma) | Missing, but can be written as custom script |
| Extra visualisations | Volcanoplot (Limma), PCA (multiple packages) | Basic plots in Matplotlib, plt.scatter(), PCA, MDS, in SciKit-learn |
Note: deseq2 protocol makes steps from normalization to differential expression table in one function.
Steps and functions for SC-RNAseq DE analysis in Scater, Scanpy and regular Python.
| Step | Seurat | Scanpy | Python |
|---|---|---|---|
| Read the data from file | read.csv()
| scanpy.read_csv | pandas.read_csv () |
| Convert to special data format | CreateSeuratObject() | Already converted as AnnData | Keep as pd. DataFrame |
| Filter off outliers | Regular R functions | FilterCells(), FilterGenes() | Use general pandas functions for subsetting by threshold values |
| Normalize and log-transform | NormalizeData() | normalize_total() | normalize from Sklearn or self-made script |
| Remove invariant genes | FindVariableFeatures() | highly_variable_genes() | Use pandas DataFrame filter by
|
| Scale gene expressions to 0-1 interval | ScaleData() | scale() | Normalize() in Sklearn |
| Run PCA, estimate significant components | RunPCA(), JackStraw() | pca() | Sklearn PCA() |
| Find or use predefined clusters | FindNeighbors(), FindClusters() | Import leiden, other options possible | Different options in Sklearn.cluster |
| Run tSNE, visualize clusters | RunTSNE(), TSNEplot() | Prefers UMAP (as imported package) | tSNE and other options in sklearn.manifold |
| Perform differential expression check | FindMarkers(), FindAllMarkers() | Build in options for Wilcoxon, t-test, logistic regression | t-test, oneway ANOVA, Wilcoxon, Kruskal-Wallis etc. in scipy.stats, RandomForest, ADAboost in sklearn |
read.csv() in Seurat used for regular table read. Read10X() is for reading matrix data format.