| Literature DB >> 31456818 |
Lei Chen1,2,3, Xiaoyong Pan4, Yu-Hang Zhang1, Xiaohua Hu5, KaiYan Feng6, Tao Huang1, Yu-Dong Cai7.
Abstract
Patient-derived tumor xenograft (PDX) mouse models are widely used for drug screening. The underlying assumption is that PDX tissue is very similar with the original patient tissue, and it has the same response to the drug treatment. To investigate whether the primary tumor site information is well preserved in PDX, we analyzed the gene expression profiles of PDX mouse models originated from different tissues, including breast, kidney, large intestine, lung, ovary, pancreas, skin, and soft tissues. The popular Monte Carlo feature selection method was employed to analyze the expression profile, yielding a feature list. From this list, incremental feature selection and support vector machine (SVM) were adopted to extract distinctively expressed genes in PDXs from different primary tumor sites and build an optimal SVM classifier. In addition, we also set up a group of quantitative rules to identify primary tumor sites. A total of 755 genes were extracted by the feature selection procedures, on which the SVM classifier can provide a high performance with MCC 0.986 on classifying primary tumor sites originated from different tissues. Furthermore, we obtained 16 classification rules, which gave a lower accuracy but clear classification procedures. Such results validated that the primary tumor site specificity was well preserved in PDX as the PDXs from different primary tumor sites were still very different and these PDX differences were similar with the differences observed in patients with tumor. For example, VIM and ABHD17C were highly expressed in the PDX from breast tissue and also highly expressed in breast cancer patients.Entities:
Keywords: Monte Carlo feature selection; Patient-derived tumor xenograft; gene expression profile; rule learning algorithm; support vector machine
Year: 2019 PMID: 31456818 PMCID: PMC6701289 DOI: 10.3389/fgene.2019.00738
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Number of samples for each of the eight tissues.
| Tissue | Number of samples |
|---|---|
| Breast | 79 |
| Kidney | 41 |
| Large intestine | 121 |
| Lung | 99 |
| Ovary | 52 |
| Pancreas | 94 |
| Skin | 46 |
| Soft tissue | 62 |
| Total | 594 |
Figure 1The entire procedures to investigate the gene expression data of samples in eight PDX tumor tissues. These data were first analyzed by the Monte Carlo feature selection method, producing a feature list and informative features. The feature list was used in the incremental feature selection method to extract optimal features for support vector machine (SVM) and construct the optimal SVM classifier. For informative features, the Johnson reducer and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithms were applied on them to generate classification rules.
Figure 2Confusion map for classifying samples into eight tissues via the classification rules yielded by Johnson reducer and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithms, evaluated by 10-fold cross-validation thrice.
Figure 3The individual and overall accuracies of the classification rules yielded by Johnson reducer and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) algorithms, evaluated by self-consistency and 10-fold cross-validation.
Sixteen produced classification rules for distinguishing samples from different tissues.
| Rules | Criteria | Tissues |
|---|---|---|
| Rule-1 | ANGPTL4 ≥ 6.409 | Kidney |
| Rule-2 | UPK1A ≥ 6.474 | Kidney |
| Rule-3 | PAX3 ≥ 3.401 | Skin |
| Rule-4 | BHMT2 ≥ 5.125 | Skin |
| Rule-5 | PAX8 ≥ 3.217 | Ovary |
| Rule-6 | TRADD ≤ 3.210 | Ovary |
| Rule-7 | CPVL ≥ 7.240 | Ovary |
| Rule-8 | F11R ≤ 4.935 | Soft tissue |
| Rule-9 | HSD17B11≤5.122 | Breast |
| Rule-10 | VIM ≥ 8.697 | Breast |
| Rule-11 | ADAM28 ≥ 3.637 | Pancreas |
| Rule-12 | CXCL5 ≥3.927 | Pancreas |
| Rule-13 | LOC102724689 ≥ 7.396 | Pancreas |
| Rule-14 | MSN ≥ 5.037 | Lung |
| Rule-15 | TP73-AS1 ≥ 3.462 | Lung |
| Rule-16 | Other conditions | Large intestine |
Figure 4Curves illustrating the performance of SVM on different feature sets. The X-axis represents the number of features participating in the classification; the Y-axis represents the MCC. (A) The whole curve illustrating the performance of SVM on feature sets containing multiples of 10 top features. (B) Part of the curve between X-axis 10 and 2000. When the top 780 features are used, the MCC reaches the highest (0.986). (C) The curve illustrating the performance of SVM on feature sets containing 700–900 top features. When the top 755 features are used, the MCC reaches the highest (0.986).
Figure 5Bar chart illustrating the individual accuracy on each tissue and overall accuracy yielded by the optimal SVM classifier and the classifier with informative features.
Figure 6MCCs obtained by the optimal SVM classifier and 1000 SVM classifiers on 1000 randomly generated feature subsets. The red circle represents the MCC yielded by the optimal SVM classifier and black circles represent MCCs produced by SVM classifiers on randomly generated feature subsets. The blue line represents the threshold of high significance level (p value < 0.05).
Top 10 features (genes) yielded by the MCFS method.
| Rank | Gene symbol | Description | RI |
|---|---|---|---|
| 1 | IFFO1 | Intermediate Filament Family Orphan 1 | 0.4515 |
| 2 | CDX1 | Caudal Type Homeobox 1 | 0.4263 |
| 3 | HSD17B11 | Hydroxysteroid 17-Beta Dehydrogenase 11 | 0.4047 |
| 4 | CHMP4C | Charged Multivesicular Body Protein 4C | 0.4042 |
| 5 | CLIP4 | CAP-Gly Domain Containing Linker Protein Family Member 4 | 0.4025 |
| 6 | PAX8 | Paired Box 8 | 0.4024 |
| 7 | GUCY2C | Guanylate Cyclase 2C | 0.4023 |
| 8 | MLANA | Melan-A | 0.3857 |
| 9 | F11R | F11 Receptor | 0.3689 |
| 10 | NR3C1 | Nuclear Receptor Subfamily 3 Group C Member 1 | 0.3646 |
Figure 7Rule networks for 16 classification rules generated by Ciruvis.