| Literature DB >> 29467011 |
Anna Leśniewska1, Joanna Zyprych-Walczak2, Alicja Szabelska-Beręsewicz2, Michal J Okoniewski3.
Abstract
BACKGROUND: The experience with running various types of classification on the CAMDA neuroblastoma dataset have led us to the conclusion that the results are not always obvious and may differ depending on type of analysis and selection of genes used for classification. This paper aims in pointing out several factors that may influence the downstream machine learning analysis. In particular those factors are: type of the primary analysis, type of the classifier and increased correlation between the genes sharing a protein domain. They influence the analysis directly, but also interplay between them may be important. We have compiled the gene-domain database and used it for analysis to see the differences between the genes that share a domain versus the rest of the genes in the datasets.Entities:
Keywords: Biomarkers; Data Analysis; Genomic signatures; Machine Learning; Protein domains; RNA sequencing; Statistics
Mesh:
Year: 2018 PMID: 29467011 PMCID: PMC5822623 DOI: 10.1186/s13062-018-0205-x
Source DB: PubMed Journal: Biol Direct ISSN: 1745-6150 Impact factor: 4.540
Fig. 1Graphs visualized in Gephi, depicting genes interconnected with domains. Left - the global picture, right - a single disconnected sub-graph. It shows that the interconnection of domains in the genes are not regular and trivial
Fig. 2Spearman’s correlation distribution and violinplots of percentage of misclassified samples for genes with and without domains in CAMDA neuroblastoma dataset. On the left the red color is for the histogram-based distribution of Spearman’s correlation coefficient for a random selection of gene pairs without domains. Green color stands for Spearman’s correlation coefficient for the genes that share a PFAM domains (database built with AceView genes). Shades in the line are ranges from 100 simulations of the distribution. On the right there is violin plot of percentage of misclassified samples for 4 classifiers based on DEG with and without domains. Total number of samples in dataset was 302
Fig. 3Division of genes based on number of reads aligned to those genes. Barplots of the number of genes with the division of number of reads assigned for the genes for three datasets from the NCBI GEO public database, aligned with three different mappers (Hisat2, Star, Subread) were generated. Colors in barplots mean the ranges of number of reads that are aligned to the genes
Number of differentially expressed genes (DEG) with and without domains for considered datasets and mappers
| Mapers | No of DEG | Datasets | ||
|---|---|---|---|---|
| GSE22260 | GSE50760 | GSE87340 | ||
| Hisat | Total | 359 | 7182 | 11048 |
| With/no domains | 245 / 114 | 5141 / 2041 | 7839 / 3209 | |
| Star | Total | 430 | 7264 | 11619 |
| With/no domains | 271 / 159 | 5165 / 2055 | 7985 / 3634 | |
| Subread | Total | 579 | 7918 | 11402 |
| With/no domains | 369 / 210 | 5350 / 2568 | 8029 / 3373 | |
For each dataset and mapper the number of total number of DEG, as well as number of DEG with and without domains was calculated. In each case there were more DEG with domains
Fig. 4Spearman’s correlation distribution for the pairs of genes with and without domains. Red color is reserved for the histogram-based distribution of a correlation between random selection of 25000 gene pairs without domains. Green color is connected with Spearman’s correlation coefficient for 25000 genes that share a PFAM domains. Lines in the middle are the mean distributions of correlation based on 100 simulations of the choice of genes. Shades in lines signify minimum and maximum values based on 100 simulations. Genes with domains have shifted correlation to the right
Fig. 5Violinplot of misclassified samples for 4 classifiers based on DEGs with and without domains. From the differentially expressed genes with the significance level α=0.05 we choose two subsets: the first one was the genes that share one particular domain (with the biggest number of genes connected to this domain) and the second was the genes that share no domain. Validation was performed with 5 fold cross-validation. Percentages of misclassified samples are mostly lower for the cases where genes with no domains are taken into account