| Literature DB >> 30205473 |
JiaRui Li1, Lei Chen2,3, Yu-Hang Zhang4, XiangYin Kong5, Tao Huang6, Yu-Dong Cai7.
Abstract
Tissue-specific gene expression has long been recognized as a crucial key for understanding tissue development and function. Efforts have been made in the past decade to identify tissue-specific expression profiles, such as the Human Proteome Atlas and FANTOM5. However, these studies mainly focused on "qualitatively tissue-specific expressed genes" which are highly enriched in one or a group of tissues but paid less attention to "quantitatively tissue-specific expressed genes", which are expressed in all or most tissues but with differential expression levels. In this study, we applied machine learning algorithms to build a computational method for identifying "quantitatively tissue-specific expressed genes" capable of distinguishing 25 human tissues from their expression patterns. Our results uncovered the expression of 432 genes as optimal features for tissue classification, which were obtained with a Matthews Correlation Coefficient (MCC) of more than 0.99 yielded by a support vector machine (SVM). This constructed model was superior to the SVM model using tissue enriched genes and yielded MCC of 0.985 on an independent test dataset, indicating its good generalization ability. These 432 genes were proven to be widely expressed in multiple tissues and a literature review of the top 23 genes found that most of them support their discriminating powers. As a complement to previous studies, our discovery of these quantitatively tissue-specific genes provides insights into the detailed understanding of tissue development and function.Entities:
Keywords: feature selection; support vector machine; tissue classification; tissue-specific expressed genes; transcriptome
Year: 2018 PMID: 30205473 PMCID: PMC6162521 DOI: 10.3390/genes9090449
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
The 25 tissue samples.
| Tag | Tissue | Number of Samples | Tag | Tissue | Number of Samples | ||
|---|---|---|---|---|---|---|---|
| Training Dataset | Test Dataset | Training Dataset | Test Dataset | ||||
|
| Adipose tissue | 577 | 237 |
| Adrenal gland | 145 | 50 |
|
| Blood | 511 | 54 |
| Blood vessel | 689 | 242 |
|
| Brain | 1259 | 455 |
| Breast | 214 | 84 |
|
| Colon | 345 | 169 |
| Esophagus | 686 | 348 |
|
| Heart | 412 | 201 |
| Liver | 119 | 58 |
|
| Lung | 320 | 123 |
| Muscle | 430 | 155 |
|
| Nerve | 304 | 122 |
| Ovary | 97 | 39 |
|
| Pancreas | 171 | 82 |
| Pituitary | 103 | 82 |
|
| Prostate | 106 | 48 |
| Skin | 890 | 342 |
|
| Small intestine | 88 | 52 |
| Spleen | 104 | 60 |
|
| Stomach | 192 | 75 |
| Testis | 172 | 91 |
|
| Thyroid | 323 | 139 |
| Uterus | 83 | 32 |
|
| Vagina | 96 | 27 | Total | - | 8436 | 3367 |
Figure 1The incremental feature selection (IFS) curve illustrating the performance of the classification models using different numbers of features. Red diamonds represent the performance when the top 23 genes and 432 features were used for building the classification models.
Figure 2The performance of the optimal support vector machine (SVM) classification model, SVM model using all tissue enriched genes and SVM model using top 432 tissue enriched genes, including accuracy on each tissue and overall accuracy. The optimal SVM classification model gave better performance.
The top 23 genes selected for further investigation via a literature review.
| Rank | Gene | Description | The Human Protein Atlas [ | Expression Atlas of EMBL-EBI [ |
|---|---|---|---|---|
| 1 |
| A-Raf Proto-Oncogene, Serine/Threonine Kinase | Expressed in all | Multiple tissues |
| 2 |
| Integrin Subunit Alpha 3 | Mixed | Multiple tissues |
| 3 |
| SLAIN Motif Family Member 2 | Expressed in all | Multiple tissues |
| 4 |
| Zinc Finger Protein 532 | Mixed | Multiple tissues |
| 5 |
| Peptidylprolyl Isomerase C | Mixed | Multiple tissues |
| 6 |
| KDEL Endoplasmic Reticulum Protein Retention Receptor 1 | Expressed in all | Multiple tissues |
| 7 |
| Neuroblastoma 1, DAN Family BMP Antagonist | Expressed in all | Multiple tissues |
| 8 |
| Proteolipid Protein 2 | Expressed in all | Multiple tissues |
| 9 |
| Signal Transducer and Activator of Transcription 6 | Expressed in all | Multiple tissues |
| 10 |
| Rho GTPase Activating Protein 23 | Mixed | Multiple tissues |
| 11 |
| Leucine Rich Repeats And Immunoglobulin Like Domains 3 | Tissue enhanced (thyroid gland) | Multiple tissues |
| 12 |
| Mannosidase Beta Like | Expressed in all | Multiple tissues |
| 13 |
| Protein Tyrosine Phosphatase, Receptor Type A | Expressed in all | Multiple tissues |
| 14 |
| Yes Associated Protein 1 | Mixed | Multiple tissues |
| 15 |
| Chloride Intracellular Channel 1 | Expressed in all | Multiple tissues |
| 16 |
| Transmembrane Protein 109 | Expressed in all | Multiple tissues |
| 17 |
| Molybdenum Cofactor Synthesis 2 | Expressed in all | Multiple tissues |
| 18 |
| Protein Tyrosine Phosphatase, Receptor Type F | Mixed | Multiple tissues |
| 19 |
| Myosin IC | Expressed in all | Multiple tissues |
| 20 |
| Family with Sequence Similarity 127 Member B | Expressed in all | Multiple tissues |
| 21 |
| Thyroid Hormone Receptor Interactor 10 | Expressed in all | Multiple tissues |
| 22 |
| Serpin Family G Member 1 | Expressed in all | Multiple tissues |
| 23 |
| Target of Myb1 Like 2 Membrane Trafficking Protein | Expressed in all | Multiple tissues |