Chengfeng Xu1, Ruochi Zhang2, Meiyu Duan2, Yongming Zhou1, Jizhang Bao1, Hao Lu1, Jie Wang1, Minghui Hu1, Zhaoyang Hu3, Fengfeng Zhou2, Wenwei Zhu1. 1. Department of Hematology, Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine, 110 Ganhe Road, Hongkou District, Shanghai 200437, China. 2. College of Computer Science and Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin 130012, China. 3. Fun-Med Pharmaceutical Technology (Shanghai) Co., Ltd., RM. A310, 115 Xinjunhuan Road, Minhang District, Shanghai 201100, China.
Abstract
Immune thrombocytopenia (ITP) is an autoimmune disease with the typical symptom of a low platelet count in blood. ITP demonstrated age and sex biases in both occurrences and prognosis, and adult ITP was mainly induced by the living environments. The current diagnosis guideline lacks the integration of molecular heterogenicity. This study recruited the largest cohort of platelet transcriptome samples. A comprehensive procedure of feature selection, feature engineering, and stacking classification was carried out to detect the ITP biomarkers using RNA sequencing (RNA-seq) transcriptomes. The 40 detected biomarkers were loaded to train the final ITP detection model, with an overall accuracy 0.974. The biomarkers suggested that ITP onset may be associated with various transcribed components, including protein-coding genes, long intergenic non-coding RNA (lincRNA) genes, and pseudogenes with apparent transcriptions. The delivered ITP detection model may also be utilized as a complementary ITP diagnosis tool. The code and the example dataset is freely available on http://www.healthinformaticslab.org/supp/resources.php.
Immune thrombocytopenia (ITP) is an autoimmune disease with the typical symptom of a low platelet count in blood. ITP demonstrated age and sex biases in both occurrences and prognosis, and adult ITP was mainly induced by the living environments. The current diagnosis guideline lacks the integration of molecular heterogenicity. This study recruited the largest cohort of platelet transcriptome samples. A comprehensive procedure of feature selection, feature engineering, and stacking classification was carried out to detect the ITP biomarkers using RNA sequencing (RNA-seq) transcriptomes. The 40 detected biomarkers were loaded to train the final ITP detection model, with an overall accuracy 0.974. The biomarkers suggested that ITP onset may be associated with various transcribed components, including protein-coding genes, long intergenic non-coding RNA (lincRNA) genes, and pseudogenes with apparent transcriptions. The delivered ITP detection model may also be utilized as a complementary ITP diagnosis tool. The code and the example dataset is freely available on http://www.healthinformaticslab.org/supp/resources.php.
Immune thrombocytopenia (ITP), previously known as immune thrombocytopenic purpura, is an acquired immune-mediated disease characterized by a blood platelet count less than 100 × 109 per liter. ITP may develop in both children and adults, and female young adults are more prevalent among ITP patients., Pediatric ITP may be fundamentally different from adult ITP since the rate of chronic ITP in adults is much higher than that in children. Symptoms like platelet aggregations in ITP patients may be partly treated by anti-platelet glycoprotein VI phage antibodies,, while the phages originated as bacterial virulence factors.ITP has three clinical phases. The first 3 months after the diagnosis is the newly diagnosed phase. The second phase refers to persistent ITP lasting between 3 and 12 months after diagnosis. The last phase is the chronic ITP phase, in which the patient carries the symptoms beyond 12 months. The first phase is sometimes called acute ITP, and patients may develop severe bleeding symptoms that require immediate interventions. Most adult ITP patients will progress into the chronic phase. The heterogeneous causes of thrombocytopenia make ITP diagnosis a major challenge in haematology.Different molecular biomarkers are observed to be associated with ITP diagnosis and prognosis. Most of the transcriptomic biomarkers are investigated in T cells. The interleukin (IL) genes IL-10 and IL-17 were differentially expressed in CD4+ T cells in the corticosteroid refractory ITP. The C-X-C motif chemokine ligand 13 (CXCL13) has elevated expression levels in the plasma of ITP children, and its transcription regulation may be repressed by the CD4+ T cells with miR-125-5p inhibitors. Serum proteins may also serve as good indicators of ITP treatment responses. The protein level of haptoglobin (Hp) in serums is measured by matrix-assisted laser desorption/ionization time-of-flight (MALDI-TOF) mass spectrometer (MS) technology and is observed to be positively correlated with the platelet count after the invasive treatment splenectomy.This study presents the largest cohort of platelet ITP samples and generates RNA sequencing (RNA-seq) transcriptomes for the detection of ITP biomarkers. A procedure of feature selection, feature engineering, and classification is comprehensively evaluated. The best ITP detection model using only 40 transcriptome features is delivered, and it achieves an overall accuracy 0.974. Some interesting biological inferences are also discussed.
Results
Sample summarization
This study recruited a cohort of 59 ITP patients and 55 controls, as shown in Table 1. There were 46 female and 13 male ITP patients. The control group consists of 31 female and 24 male samples. The control samples were recruited to match the age distribution of the ITP patients, with the t test p value 0.4752 (>0.05).
Table 1
Summary of the recruited cohort
ITP
Control
Female
Male
Female
Male
Samples
46
13
31
24
Averaged age
55.69565
62.84615
48.19355
47.375
There are two groups of participants: the ITP patients and the controls. Each group consists of female and male samples. The numbers of samples are given in the row “Samples,” and the averaged ages of these sample groups are in the row “Averaged age.”
Summary of the recruited cohortThere are two groups of participants: the ITP patients and the controls. Each group consists of female and male samples. The numbers of samples are given in the row “Samples,” and the averaged ages of these sample groups are in the row “Averaged age.”The experimental protocol of this study was approved by the ethics committee of the Yueyang Hospital. All participants in this study signed the informed-consent forms. This study was approved by the Ethical Committee of the Yueyang Hospital in accordance with the 1964 Helsinki Declaration. Written informed consent was obtained from all participants.
Comparison of different classification algorithms
Firstly, we fixed the feature-selection module with the following parameter-value choices for the RNA-seq transcriptome data. The transcriptome features were screened by the L1-regularization algorithm least absolute shrinkage and selection operator (Lasso), and only the features with non-zero weights were kept for further analysis. Then, the pairwise evaluation of the inter-feature Pearson correlation coefficient (PCC) was carried out. For a pair of features, F(i) and F(j), F(j) was removed if PCC(F(i), F(j)) > threshold and weight(i) > weight(j), where weight(i) and weight(j) were the Lasso weights of these two features.We used a stratified method to split the samples into a training set (90%) and an independent test set (10%). Stratified k-fold cross validation is a variation of the k-fold cross validation that returns the stratified folds by preserving the same percentage of samples for each class in each fold. On the training set, we utilize the stratified 5-fold cross validation (S5FCV) strategy to train and tune the parameters and, finally, tested our model on the test set. As shown in Figure 1, all five classifiers achieved at least 0.884 in comparison of accuracy (ACC), suggesting that the biomarker-detection module in this study works effectively. The classifiers Logistic Regression (LR) and Support-Vector Machine (SVM) achieved the best receiver operating characteristic (ROC) (0.978) for the ITP diagnosis model, while another classifier, Random Forest (RF), achieved the best ACC (0.935). The classifier RF also performed the best in Sn (0.915) and Sp (0.962). LR only achieved an Sn of 0.881, which may not be a good choice with more mis-classified ITP patients (the positive samples). So, the following sections used RF as the default classifier.
Figure 1
Evaluation of different classifiers on the ITP diagnosis problem
The five classifiers are evaluated, including LR, RF, SVM, KNN, and AdaBoost. The horizontal axis lists the classification performance metrics, i.e., Sn, Sp, MCC, ROC, and Acc. The vertical axis is the value of the performance metrics. Feature engineering and step 5 optimizes the diagnosis model. All boxed text is functional module names utilized in the pipeline.
Evaluation of different classifiers on the ITP diagnosis problemThe five classifiers are evaluated, including LR, RF, SVM, KNN, and AdaBoost. The horizontal axis lists the classification performance metrics, i.e., Sn, Sp, MCC, ROC, and Acc. The vertical axis is the value of the performance metrics. Feature engineering and step 5 optimizes the diagnosis model. All boxed text is functional module names utilized in the pipeline.
Comparison of different feature-selection algorithms
Different feature-selection algorithms were evaluated for the S5FCV performances of the default classifier RF, as shown in Figure 2. The framework SelectFromModel of the Python package sklearn with default parameters was carried out to evaluate the six feature-selection algorithms in Figure 2, i.e., Ttest, Ftest, Ridge, Lasso, Variance Threshold (VThresh), and Laplacian score (LS). Lasso and Ridge achieved the best ACC (0.935) for the ITP diagnosis model, but Ridge had the worst performance on Sp (0.925). Ttest and Ridge performed similarly well in ACC and ROC. Ridge outperformed Ttest in Sn, and Ttest outperformed Ridge in Sp. Ttest performed the worst in Sn among the four supervised feature-selection algorithms, and the data suggested that this popular statistical-evaluation algorithm didn’t work well on selecting a subset of features with the best classification performance. The performance of the two unsupervised feature-selection algorithms was lower than that of all the supervised learning feature-selection algorithms. Figure 3A gives the Ttest ranks of the 50 Lasso-selected features. Only 11 features are ranked as top 50 by Ttest, the last feature is ranked 23,696 out of the total 33,493 features, and 34 out of 50 (68.0%) Lasso-selected features are ranked more than 100. Figure 3B demonstrates that the performance of the prediction model keeps being improved by adding the 50 Lasso-selected features one by one in their Ttest ranks. This observation and Figure 2 suggest that Lasso may select a feature subset with better prediction performances than Ttest, Ftest, and Ridge.
Figure 2
Performances of different feature-selection algorithms on detecting the ITP biomarkers
The four evaluated feature-selection algorithms are Ttest, Ftest, Ridge, Lasso, VThresh, and LS. The horizontal axis lists the performance metrics Sn, Sp, MCC, ROC, and Acc. The vertical axis is the value of these performance metrics.
Figure 3
Ttest evaluation of the 50 Lasso-selected features
(A) The Ttest ranks of the 50 features selected by Lasso. The vertical axis is the Ttest rank of each of the 50 Lasso-selected features.
(B) The S5FCV prediction performances of the Incremental Feature Selection (IFS) strategy of the 50 Lasso-selected features. The vertical axis gives the value of the prediction-performance metrics. The horizontal axis lists the 50 Lasso-selected features by their Ttest ranks in ascending order for both subfigures.
Performances of different feature-selection algorithms on detecting the ITP biomarkersThe four evaluated feature-selection algorithms are Ttest, Ftest, Ridge, Lasso, VThresh, and LS. The horizontal axis lists the performance metrics Sn, Sp, MCC, ROC, and Acc. The vertical axis is the value of these performance metrics.Ttest evaluation of the 50 Lasso-selected features(A) The Ttest ranks of the 50 features selected by Lasso. The vertical axis is the Ttest rank of each of the 50 Lasso-selected features.(B) The S5FCV prediction performances of the Incremental Feature Selection (IFS) strategy of the 50 Lasso-selected features. The vertical axis gives the value of the prediction-performance metrics. The horizontal axis lists the 50 Lasso-selected features by their Ttest ranks in ascending order for both subfigures.
Evaluation of the contributions of LDA and SVD
The 50 Lasso-selected transcriptome features (denoted as Lasso50) were further refined by two feature-engineering algorithms, linear discriminant analysis (LDA) and singular value decomposition (SVD), as shown in Figure 4. SVD calculated the first two components as the engineered features, while for the LDA, the number of components needs to be less than the number of the class minus one; therefore, we get one component for LDA. LDA + SVD denotes the union of the one LDA component and the two SVD components. The Lasso50-based prediction model achieved an ACC of 0.929. LDA achieved the same score to the Lasso50 in ACC, and LDA + SVD improved the LDA model by 0.017 in Sn. The engineered component features of LDA + SVD performed better (0.008 in Sn) than the Lasso50-based model, but both LDA (0.002) and LDA + SVD (0.019) models outperformed the Lasso50-based model in Matthews correlation coefficient (MCC). We looked into the detailed percentage of variance explained by each SVD component, as shown in Figure 4B. The data suggested that the first SVD component alone explained 38% of the total sample variances, and the first two components explained almost half of the sample variances, so it is reasonable to observe that only the first two components of LDA and SVD may simultaneously improve model performance and reduce feature dimensions.
Figure 4
Prediction performances of the feature-engineering algorithms LDA and SVD
(A) The horizontal axis lists the prediction performance metrics Sn, Sp, MCC, ROC, and ACC. The vertical axis is the value of these performance metrics.
(B) The percentage of variance explained by each of the first 20 SVD components.
Prediction performances of the feature-engineering algorithms LDA and SVD(A) The horizontal axis lists the prediction performance metrics Sn, Sp, MCC, ROC, and ACC. The vertical axis is the value of these performance metrics.(B) The percentage of variance explained by each of the first 20 SVD components.Dot plots were generated to demonstrate the discriminative powers of the SVD components and the raw transcriptome features, as shown in Figure 5. Figure 5A gives an intuitive illustration that the first two SVD components separate well the two groups of samples, while the top-two Ttest-ranked features generate some mis-classifications, as shown in Figure 5B. It is interesting to see that both of the top-two Ttest-ranked features are from mitochondrion. The first ranked feature ENSG00000210082 encodes a ribosomal RNA gene MT-RNR2, and the second ranked feature ENSG00000198888 encodes the protein-coding gene MT-ND1. This observation supports the previous observation that ITP patients demonstrated various platelet mitochondrial abnormalities,, but the Ttest-ranked features may be further improved by more sophisticated feature selections and engineering algorithms for their ITP prediction performances.
Figure 5
Dot-plot visualization of SVD and Ttest
(A) Dot plot of the first two components calculated by SVD. The horizontal and vertical axes are the first and second components calculated by SVD.
(B) Dot plot of the top two features ranked by Ttest. The horizontal and vertical axes are the top-ranked first and second features by Ttest. Dots represent the ITP samples and controls in red and blue, respectively.
Dot-plot visualization of SVD and Ttest(A) Dot plot of the first two components calculated by SVD. The horizontal and vertical axes are the first and second components calculated by SVD.(B) Dot plot of the top two features ranked by Ttest. The horizontal and vertical axes are the top-ranked first and second features by Ttest. Dots represent the ITP samples and controls in red and blue, respectively.
Optimization the parameters of the prediction model
The pipeline in this study was further refined by evaluating different value choices for the following three parameters, as shown in Figures 6 and S1. All prediction-performance metrics were calculated using the S5FCV strategy. The parameter nFeatures is the number of features with the largest weights assigned and selected by Lasso. There are five values, {30, 40, 50, 60, 70}, for this parameter. The second parameter, nComponents, is the number of the first few components calculated by both LDA and SVD, and its value choices are {1, 2, 3, 4, 5}. The third parameter, nEstimators, is the number of decision trees used in the classifier RF. The five values {100, 200, 300, 400, 500} are evaluated for nEstimators. A grid-search strategy was carried out for all value combinations of these three parameters. Due to the limited space in this work, the detailed data are illustrated in Figures 6 and S1.
Figure 6
Optimizing the parameters of nComponents and nEstimators when nFeatures = 40
The matrices of the different parameter choices are colored in the grayscale background. A darker background suggests a better value of that metric.
Optimizing the parameters of nComponents and nEstimators when nFeatures = 40The matrices of the different parameter choices are colored in the grayscale background. A darker background suggests a better value of that metric.Figure 6 illustrates the value choice nFeatures = 40 when the best accuracy, ACC = 0.956, was achieved. This best model chose nFeatures = 40, nComponents = 2, and nEstimators = 100 and achieved Sn = 0.932, Sp = 0.982, and MCC = 0.914.
The five classifiers were stacked as a better classifier
A stacking prediction strategy was utilized to build a better ITP prediction model, and its prediction performance was compared with the previous best model in Figure 7. The prediction results of all the five classifiers LR/SVM/K-nearest neighbor (KNN)/adaptive boost (AdaBoost) were loaded as the input to an additional classifier to generate the final prediction result. The additional classifier was also selected in LR/SVM/KNN/AdaBoost/RF. The samples were randomly split into a training dataset (80%) and a test dataset (20%). Each classifier was exerted on the training dataset for 20 random runs, and its prediction result was averaged over the 20 runs. Overall, the final stacked RF model achieved ACC = 0.974. Figure 7 demonstrated that the stacking model (StackingModel) outperformed the best model in the above sections (PrevBestModel) on four out of five metrics (0.051 in Sn, 0.054 in MCC, 0.06 in ROC, and 0.039 in ACC).
Figure 7
Performance of the stacking model
The horizontal axis lists the prediction-performance metrics. The vertical axis is the value of these performance metrics. The two series of histograms are for the stacking model (StackingModel) and the best prediction model in the above sections (PrevBestModel).
Performance of the stacking modelThe horizontal axis lists the prediction-performance metrics. The vertical axis is the value of these performance metrics. The two series of histograms are for the stacking model (StackingModel) and the best prediction model in the above sections (PrevBestModel).In this study, in order to better verify the ability of the stacking model, except for the above-mentioned five machine-learning models, we used five additional machine-learning models, including Linear Regression (LiR), Extremely Randomized Trees (ET), Gaussian Process Classifier (GP), Gradient Boosting Classifier (GB), and Naive Bayes (NB). We found that the prediction model only achieved 0.962 in ACC, which is lower than the previous best ACC = 0.974. The experimental results suggested that more meta-learners may not deliver better results.
Biological inferences from the detected biomarkers
The final model used 40 RNA-seq transcriptome features of which 19 features are annotated as protein-coding genes and 3 are as long intergenic non-coding RNAs (lincRNAs), as shown in Table 2. It is interesting to observe that 15 biomarkers are annotated as pseudogenes, but their expressions contribute to the ITP classification.
Table 2
Gene annotations of the 40 detected biomarkers
ENSG_code
Chr
Gene
Gene type
ENSG00000259834
1
AL365361.1
lincRNA
ENSG00000213026
1
CFL1P4
processed_pseudogene
ENSG00000162873
1
KLHDC8A
protein_coding
ENSG00000229344
1
MTCO2P12
unprocessed_pseudogene
ENSG00000213110
2
AC019178.1
processed_pseudogene
ENSG00000226247
2
SUPT4H1P1
processed_pseudogene
ENSG00000229758
2
DYNLT3P2
processed_pseudogene
ENSG00000118997
2
DNAH7
protein_coding
ENSG00000269028
3
MTRNR2L12
protein_coding
ENSG00000248360
4
LINC00504
lincRNA
ENSG00000223908
5
AC068657.1
processed_pseudogene
ENSG00000271043
5
MTRNR2L2
processed_pseudogene
ENSG00000164576
5
SAP30L
protein_coding
ENSG00000196821
6
C6orf106
protein_coding
ENSG00000146587
7
RBAK
protein_coding
ENSG00000253276
7
CCDC71L
protein_coding
ENSG00000226824
7
AC006001.2
sense_intronic
ENSG00000229897
9
SEPT7P7
processed_pseudogene
ENSG00000175787
9
ZNF169
protein_coding
ENSG00000213260
10
YWHAZP5
processed_pseudogene
ENSG00000254616
11
AP001775.1
processed_pseudogene
ENSG00000188997
11
KCTD21
protein_coding
ENSG00000214753
11
HNRNPUL2
protein_coding
ENSG00000258359
12
PCNPP1
processed_pseudogene
ENSG00000250091
12
DNAH10OS
protein_coding
ENSG00000205240
13
OR7E36P
unprocessed_pseudogene
ENSG00000140254
15
DUOXA1
protein_coding
ENSG00000263177
16
MTND1P8
processed_pseudogene
ENSG00000235554
17
AC005822.1
processed_pseudogene
ENSG00000129673
17
AANAT
protein_coding
ENSG00000185262
17
UBALD2
protein_coding
ENSG00000267541
18
MTCO2P2
processed_pseudogene
ENSG00000125735
19
TNFSF14
protein_coding
ENSG00000260032
20
NORAD
lincRNA
ENSG00000210082
MT
MT-RNR2
Mt_rRNA
ENSG00000210049
MT
MT-TF
Mt_tRNA
ENSG00000198712
MT
MT-CO2
protein_coding
ENSG00000198786
MT
MT-ND5
protein_coding
ENSG00000228253
MT
MT-ATP8
protein_coding
ENSG00000069535
X
MAOB
protein_coding
The column “ENSG_code” lists the ensemble IDs of features generated in the RNA-seq transcriptome processing. The columns “Chr” and “Gene” give the chromosome this gene locates and the gene symbol. The column “Gene type” is which category this gene belongs to.
Gene annotations of the 40 detected biomarkersThe column “ENSG_code” lists the ensemble IDs of features generated in the RNA-seq transcriptome processing. The columns “Chr” and “Gene” give the chromosome this gene locates and the gene symbol. The column “Gene type” is which category this gene belongs to.Firstly, we evaluated the enriched Gene Ontology (GO) categories of the 40 detected biomarkers using the online system DAVID. The statistical significance p value was corrected by the Bonferroni method. The biological processes GO: 0010729 (positive regulation of hydrogen peroxide biosynthetic process), 0042773 (ATP synthesis coupled electron transport), and 1900118 (negative regulation of execution phase of apoptosis) were enriched in the 40 biomarkers with the Bonferroni-corrected p values 8.63 × 10-4, 1.21 × 10-3, and 1.08 × 10-2, respectively. ITP was known to be associated with the platelet apoptosis but remained to be confirmed for its connections with hydrogen-peroxide synthesis and ATP synthesis. It is also interesting to observe that the molecular function GO: 0048019 (receptor antagonist activity) was also enriched in the 40 biomarkers, as confirmed by the elevated plasma levels of IL-1 receptor antagonist (Ra) in acute pediatric and adult ITP patients.T cell dysfunction may stimulate auto-antibody productions in ITP., The biomarker gene AANAT (aralkylamine N-acetyltransferase; transcriptome feature Ensembl: ENSG00000129673) is located on chromosome 17 and showed rhythmic expressions on the expression levels and phosphorylation levels in T cells of the bone marrow and spleen. Another biomarker gene, TNFSF14 (tumor necrosis factor superfamily member 14; transcriptome feature Ensembl: ENSG00000125735), facilitates the increased production of CD8 central memory t cells in vivo. Manresa et al. observed that the increased production of TNFSF14 by T cells induced the transcription of inflammatory genes in the esophageal fibroblasts in eosinophilic esophagitis.One of the symptoms of ITP is low platelet production in the blood, and ITP is mainly diagnosed by a low platelet count in a blood test., The detected biomarker gene MAOB (monoamine oxidase; transcriptome feature Ensembl: ENSG00000069535) was not observed to be associated with ITP, but its low activities in platelets are associated with various impulsive behaviors and alcoholism. A case study demonstrated the connection between the gene monoamine oxidase inhibitor (MAOI) and the resolving of non-ITP. So, the aberrant expression of MAOB in platelets of ITP patients may be worth further experimental validation.The altered methylation of the biomarker mitochondrial gene MT-CO2 (mitochondrially encoded cytochrome-C-oxidase II; transcriptome feature Ensembl: ENSG00000198712) in platelets is associated with the prognosis of various complex diseases, e.g., cardiovascular disease , and Parkinson’s disease. This study demonstrated that the expression level of MT-CO2 is significantly associated with ITP by p = 6.78 × 10-26. MT-CO2 may serve as a candidate diagnosis biomarker of ITP on the transcriptome level.
Enriched functions in the close neighbors of the biomarkers
We further investigated the direct neighbors to the 40 detected biomarkers based on the STRING database. There were 5,296 genes interacting with the 40 detected biomarkers. We assumed that these direct neighbors were closely associated with these ITP biomarkers and the functions enriched in these neighbors together with the biomarkers. DAVID does not allow for the analysis of more than 3,000 genes, so this section used the tool ShinyGO to detect the enriched functions within these 5,296 + 40 = 5,336 genes. There were 79 molecular functions enriched (Bonferroni-corrected p value < 0.05) in the gene list, as shown in the Table S1. It is interesting to obsere that the molecular function GO: 0009055 (electron transfer activity) was enriched with the Bonferroni-corrected p value = 1.18 × 10-26. The top-ranked GO terms included binding capabilities to small molecule (GO: 0036094, corrected p = 2.91 × 10-33), RNA (GO: 0003723, corrected p = 1.47 × 10-25) and nucleoside phosphage (GO: 1901265, corrected p = 7.88 × 10-24), etc.
Discussion
This study screened ITP biomarkers using RNA-seq platelet transcriptomes. A series of comprehensive machine-learning algorithms were utilized to find the best subset of biomarkers. The final ITP prediction model was based on 40 transcriptome features and two clinical variables (age and sex) and achieved an overall accuracy of 0.974. It is of interest to observe that both protein-coding genes and lincRNA genes contribute to the ITP prediction model. Some of the biomarker genes are closely associated with the two major symptoms of ITPs, i.e., T cell dysfunctions and aberrant platelet activities.The gene-expression patterns of all 40 biomarker genes across different human tissues were obtained from the GTEx database, as shown in Figure S2. Many biomarkers showed highly tissue-specific expression patterns, e.g., the genes DNAH7 and AANAT were only highly expressed in the testis, while the gene KLHDC8A was highly expressed only in the ovary. Multiple genes showed high expressions in different brain regions, including DNAH10OS, NORAD, MT-ATP8, HNRNPUL2, MT-RNR2, and MT-CO2, but most of the 40 biomarkers were expressed at relatively low levels in the whole blood. Combined with their ITP-specific expression patterns, the data suggested that the abnormal expressions of these tissue-specific expressed genes might have contributed to ITP’s onset and progression, and it is important to investigate ITP’s molecular mechanisms using the platelet cells.As far as we know, this study presents the largest cohort of ITP RNA-seq platelet transcriptomes. This dataset and these experimental data may facilitate a better understanding of ITP onset and developmental mechanisms.
Materials and methods
Ethics approval and consent to participate
This study was approved by the Ethical Committee of the Yueyang Hospital in accordance with the 1964 Helsinki Declaration. Written informed consent was obtained from all participants.
Clinical sample collection
This cohort of ITP patients and healthy controls was recruited between August 10, 2017, and February 21, 2019, at the Shanghai Yueyang Integrated Traditional Chinese Medicine and Western Medicine Hospital (abbreviated as Yueyang Hospital), affiliated with the Shanghai University of Traditional Chinese Medicine. This study was approved by the Ethical Committee of the Yueyang Hospital in accordance with the 1964 Helsinki Declaration. Written informed consent was obtained from all participants.The disease ITP was diagnosed based on the criteria of the American Society of Hematology. Thrombocytopenia was defined as a platelet count <100 × 109 platelets per liter. All ITP patients enrolled in this study either had no treatment history or had not received glucocorticoids for at least 3 months. Patients were excluded from this cohort if they carried these complications, i.e., diabetes, hypertension, cardiovascular diseases, pregnancy, active infection, or autoimmune diseases other than ITP.
Platelet isolation and RNA extraction
Peripheral whole-blood samples of all recruited participants were drawn and kept in tubes with ethylenediamine tetraacetic acid (EDTA) at the Yueyang Hospital. To maintain platelet RNA quantity and quality, the samples were collected and processed for platelets within 24 h. The platelet-rich plasma was processed by a 20-min, 1-kilo RPM (kRPM) centrifugation at 4°C for platelet isolation. To avoid hemocyte contamination in the platelets, only 9/10th platelet-rich plasma was transferred to the 1.5-mL tubes. Then, the platelet samples were pelleted by a 20-min, 3-kRPM centrifugation. The platelets were then white precipitated by removing the supernatant. The platelets were washed with PBS and extracted with 15-min 3-kRPM centrifugation. The PBS was removed by a 10-μL pipette. The Thermo Scientific’s stabilizer RNAlater was used to resuspend the precipitated platelets. The extracts were frozen at -20°C. A hematology analyzer was used to quantitatively measure the hemocyte contamination in the samples.This study used Eppendorf tubes.
Sample purity
The sampling quality-control method was utilized to process sequencing samples. Quality control was performed on 10 separated platelet samples, and the average purity obtained by the test was 93.7%.
Library construction and mRNA sequencing
The cDNA library construction and mRNA sequencing were carried out following the procedures described in the previous study.
Sequencing data processing and differential analysis
The software FASTP was used to preprocess raw sequencing data in the format fastq through quality control and adapter/primer removals. Clean data were aligned to human Ensembl v.91.FASTP was used to purify raw data of fastq format through removal of adapters, PCR primers, and low-quality reads. Clean data were aligned to human Ensembl 91 using the software STAR v.2.4.2a. The generated alignment files in the format BAM were loaded to the software HTSeq v.0.6.1 to estimate the gene-expression levels. Differential expression analysis was performed by the R Bioconductor package DESeq2. The parameters of DESeq2 were set to base mean >100, |log2(fold change)| (|log2(FC)|) >1, and false discovery rate (FDR) <0.05.
Feature-selection algorithms
Feature-selection algorithms were utilized to find a set of biomarkers with the best discrimination power of ITP samples from the controls. The so-called Occam’s Razor principle suggests that a simpler model is preferred over a complicated one if these two models perform similarly., A feature-selection algorithm may significantly reduce the dimensionality of the transcriptome datasets for a binary disease diagnosis model.51, 52, 53, 54 It is also anticipated that the training and prediction times of a disease diagnosis model will be shortened by selecting a subset of features from the transcriptome dataset. This study evaluated the following feature-selection algorithms.The analysis of variance (ANOVA) algorithm measures the difference between the means of two groups of samples. ANOVA uses F-test to calculate the statistical significance of rejecting the null hypothesis of the mean equality. The F-test assumes that the data follow the F-distribution.T test (Ttest) is a popular statistical test when a feature has different values in two groups of samples with the null hypothesis that the two variables have the same normal distribution., The statistical p value measures how probable it is that the two variables will follow the same normal distribution. Ttest has been widely used in many different data types, including electrocardiograms (ECGs), RNA-seq, and metaproteomics.Ridge (Ridge) is a regression model that assigns balanced weights to the features, and the features are ranked in descending order by their weights, while the Lasso algorithm exerts both regularization and feature selection., Lasso assigns very sparse weights to the features, and only the features with non-zero weights are selected for the regression model. The features with non-zero weights are ranked in descending order by their weights.The above feature-selection algorithms are all based on supervised learning. To avoid the bias of supervised signals, we also evaluated two unsupervised feature-selection algorithms. VThresh is an unsupervised feature-selection algorithm that removes all low-variance features. LS utilizes the observation that samples from the same class are often similar to each other. It evaluates the power of locality preserved within the candidate feature subsets.
Binary-classification algorithms
The binary-classification problem was investigated between ITP patients and the control samples. The following binary classifiers are utilized for this task.The statistical classifier LR uses the logistic function to describe the binary-classification task., The logistic function evaluates the existence probability of ITP disease.SVM is another popular binary classifier. SVM attempts to maximize the margins of a hyperplane between the control samples and the patients. SVM may be used on various data types, e.g., one-hot encoded data, imaging data, etc.KNN is another simple and effective supervised learner., KNN evaluates the distances of the query samples against all training samples and collects the top k-closest neighbors to make the prediction. The query sample is assigned to the most frequently appearing class label of these top k-closest neighbors.RF is a tree-based ensemble-learning algorithm. This classifier RF assembles the predictions of multiple decision tree classifiers, and the final result of RF is determined by the majority voting strategy.Another ensemble-classification framework, AdaBoost, is also evaluated for its prediction performance on the investigated ITP diagnosis problem. AdaBoost may be used with many supervised-learning algorithms, and this study chose the decision tree to work with AdaBoost. AdaBoost iteratively tunes the weight of multiple weakly boosted classifiers and may perform less susceptibility to the overfitting challenge.
Feature-engineering algorithms
The final step of the experimental procedures is to calculate new features with enriched discrimination information from the original features.74, 75, 76, 77 Some datasets may be very sparse, and the individual features may not contribute significant discrimination powers to the final prediction models. This study hypothesizes that the feature-decomposition technique may generate new features with enriched discrimination powers.Firstly, this study uses truncated SVD., SVD calculates the singular value decomposition without centering the data matrix and performs very well with sparse data. Also, SVD may efficiently calculate the user-defined number of singular values.The second feature-engineering algorithm is the LDA. LDA doesn’t calculate the overall covariance matrix and may calculate the singular values for large datasets.Both SVD and LDA may be used to calculate new features, and the feature dimension may be reduced by choosing the first few engineered features.Deep learning has become a research hotspot in recent years for its powerful representation-learning capability. Representation-learning-based disease diagnosis models are also gaining increased attention in the literature. Su et al. proposed the Siamese response deep factorization machine (SRDFM) algorithm, based on a Siamese network, to learn a feature vector of the drug property and gene expression for personalized anti-cancer drug recommendations. Peng et al. proposed a novel subnetwork representation-learning method to uncover disease-disease relationships. Lv et al. used deep-representation-learning features for the identification of sub-Golgi protein localization. It is also worth mentioning that deep-learning-based representation-learning models are often poor in the interpretability of learned features.
Performance evaluation metrics
A binary-classification model is usually evaluated by the following variables and performance metrics.,, ITP patients are defined as positive samples, and their number is defined as P. The controls are negative samples, and their number is N. ITP patients are true positives (TPs) if they are predicted as the ITP class; otherwise, they are the false negatives (FNs). The control samples are defined as true negatives (TNs) if they are correctly predicted as controls; otherwise, they are defined as false positives (FP). The numbers of samples in these groups of samples are also denoted as TPs, FNs, TNs, and FPs.A binary classifier is evaluated by its specificity, sensitivity, and accuracy. Specificity is the rate of correctly predicted controls, i.e., Sp = TN/(TN + FP). Sensitivity is defined as the rate of correctly predicted ITP patients, i.e., Sn = TP/(TP + FN). The overall prediction accuracy is defined as ACC=(TN + TP)/(FP + TN + FN + TP). All three metrics are between 0 and 1. A better prediction model has larger values for Sp, Sn, and ACC.The ROC curve is a 2-dimensional plot between Sn and (1-Sp). The area under the ROC curve is defined as the area under the curve (AUC) value for a binary-classification model. The metric AUC is widely used to measure a binary-classification model independent of model thresholds.
Stacking weak classifiers
Stacking is an effective classification and regression model used in machine learning. It builds multiple base models (meta-learner) in multiple layers so that the output of the previous model can be used as the input of the model of the later layer. In recent years, machine-learning models based on stacking technology have also been widely used in disease modeling. StackTADB is a stacking model for predicting the boundaries of topologically associating domains (TADs) accurately. Wu et al. and Khoei et al. propose a stacking-based ensemble-learning model with a genetic algorithm for detecting early stages of Alzheimer’s disease. Rahman et al. propose a coronavirus disease 2019 (COVID-19) detection system based on a stacking model. The main difference between our model and theirs is that feature selection is integrated into the model building process and forms an end-to-end learning process.
The overall pipeline of diagnosis model of ITP
This study processed two clinical features separately from the RNA-seq transcriptomes. The RNA-seq transcriptomes and the feature age were normalized by standardization. The feature sex was a category variable and was formatted by the one-hot encoding strategy.Due to the high dimensionality of the RNA-seq transcriptomes, an additional processing of feature selection was carried out on the transcriptomes. The RNA-seq transcriptome has 33,493 features, and there are only 114 samples in total. Most of these transcribed features are involved in various biological processes that are not associated with ITP, so the L1-regularization algorithm Lasso was used to remove those features with zero weights in the regression model with the class label. The redundant features were removed according to the inter-feature PCCs. Then, the normalized feature age and the encoded feature sex were integrated with the selected RNA-seq transcriptome features. The feature-engineering algorithms SVD and LDA were used to further enrich the information into fewer singular values.An ensembled classifier, RF, was used to build the final ITP diagnosis model by integrating the results of multiple first-line classification models. Our experimental data suggest that the ensembled classifier demonstrated a better performance than the first-line classifiers. The workflow of the ITP diagnosis modeling in this study can been seen in Figure 8.
Figure 8
Workflow of ITP diagnosis modeling in this study
Step 1 normalizes the data, step 2 selects the best features, step 3 fuses the transcriptome feature with age and sex, step 4 features engineering, and step 5 optimizes the diagnosis model. All boxed text is the functional module names utilized in the pipeline.
Workflow of ITP diagnosis modeling in this studyStep 1 normalizes the data, step 2 selects the best features, step 3 fuses the transcriptome feature with age and sex, step 4 features engineering, and step 5 optimizes the diagnosis model. All boxed text is the functional module names utilized in the pipeline.
Availability of data and material
Detailed information on optimizing the parameters of nFeatures, nComponents, and nEstimators can been found in Figure S1. The dataset used in this study was archived in the NCBI SRA database. The full dataset of the platelet RNA-seq next-generation sequencing (NGS) data for the 59 ITP patients was archived in the NCBI SRA database with the project accession NCBI: PRJNA664615. The RNA-seq NGS data of the 55 health controls is available as the project accession NCBI: PRJNA668820. Researchers of interest may explore their scientific hypothesis in this dataset after the embargo period. The source code is released at the web site http://www.healthinformaticslab.org/supp/. Any future collaborations are welcome.
Authors: Francesco Rodeghiero; Roberto Stasi; Terry Gernsheimer; Marc Michel; Drew Provan; Donald M Arnold; James B Bussel; Douglas B Cines; Beng H Chong; Nichola Cooper; Bertrand Godeau; Klaus Lechner; Maria Gabriella Mazzucconi; Robert McMillan; Miguel A Sanz; Paul Imbach; Victor Blanchette; Thomas Kühne; Marco Ruggeri; James N George Journal: Blood Date: 2008-11-12 Impact factor: 22.113
Authors: Andrew L Frelinger; Rachael F Grace; Anja J Gerrits; Michelle A Berny-Lang; Travis Brown; Sabrina L Carmichael; Ellis J Neufeld; Alan D Michelson Journal: Blood Date: 2015-07-02 Impact factor: 22.113
Authors: Xiaoli Jiao; Brad T Sherman; Da Wei Huang; Robert Stephens; Michael W Baseler; H Clifford Lane; Richard A Lempicki Journal: Bioinformatics Date: 2012-04-27 Impact factor: 6.937
Authors: Myron G Best; Nik Sol; Sjors G J G In 't Veld; Adrienne Vancura; Mirte Muller; Anna-Larissa N Niemeijer; Aniko V Fejes; Lee-Ann Tjon Kon Fat; Anna E Huis In 't Veld; Cyra Leurs; Tessa Y Le Large; Laura L Meijer; Irsan E Kooi; François Rustenburg; Pepijn Schellen; Heleen Verschueren; Edward Post; Laurine E Wedekind; Jillian Bracht; Michelle Esenkbrink; Leon Wils; Francesca Favaro; Jilian D Schoonhoven; Jihane Tannous; Hanne Meijers-Heijboer; Geert Kazemier; Elisa Giovannetti; Jaap C Reijneveld; Sander Idema; Joep Killestein; Michal Heger; Saskia C de Jager; Rolf T Urbanus; Imo E Hoefer; Gerard Pasterkamp; Christine Mannhalter; Jose Gomez-Arroyo; Harm-Jan Bogaard; David P Noske; W Peter Vandertop; Daan van den Broek; Bauke Ylstra; R Jonas A Nilsson; Pieter Wesseling; Niki Karachaliou; Rafael Rosell; Elizabeth Lee-Lewandrowski; Kent B Lewandrowski; Bakhos A Tannous; Adrianus J de Langen; Egbert F Smit; Michel M van den Heuvel; Thomas Wurdinger Journal: Cancer Cell Date: 2017-08-14 Impact factor: 31.743
Authors: Tawsifur Rahman; Amith Khandakar; Farhan Fuad Abir; Md Ahasan Atick Faisal; Md Shafayet Hossain; Kanchon Kanti Podder; Tariq O Abbas; Mohammed Fasihul Alam; Saad Bin Kashem; Mohammad Tariqul Islam; Susu M Zughaier; Muhammad E H Chowdhury Journal: Comput Biol Med Date: 2022-02-12 Impact factor: 4.589