| Literature DB >> 30930932 |
Ming Li1, Yu Guo1, Yuan-Ming Feng1,2, Ning Zhang1.
Abstract
Triple-negative breast cancer (TNBC) is a special subtype of breast cancer that is difficult to treat. It is crucial to identify breast cancer-related genes that could provide new biomarkers for breast cancer diagnosis and potential treatment goals. In the development of our new high-risk breast cancer prediction model, seven raw gene expression datasets from the NCBI gene expression omnibus (GEO) database (GSE31519, GSE9574, GSE20194, GSE20271, GSE32646, GSE45255, and GSE15852) were used. Using the maximum relevance minimum redundancy (mRMR) method, we selected significant genes. Then, we mapped transcripts of the genes on the protein-protein interaction (PPI) network from the Search Tool for the Retrieval of Interacting Genes (STRING) database, as well as traced the shortest path between each pair of proteins. Genes with higher betweenness values were selected from the shortest path proteins. In order to ensure validity and precision, a permutation test was performed. We randomly selected 248 proteins from the PPI network for shortest path tracing and repeated the procedure 100 times. We also removed genes that appeared more frequently in randomized results. As a result, 54 genes were selected as potential TNBC-related genes. Using 14 out the 54 genes, which are potential TNBC associated genes, as input features into a support vector machine (SVM), a novel model was trained to predict high-risk breast cancer. The prediction accuracy of normal tissues and TNBC tissues reached 95.394%, and the predictions of Stage II and Stage III TNBC reached 86.598%, indicating that such genes play important roles in distinguishing breast cancers, and that the method could be promising in practical use. According to reports, some of the 54 genes we identified from the PPI network are associated with breast cancer in the literature. Several other genes have not yet been reported but have functional resemblance with known cancer genes. These may be novel breast cancer-related genes and need further experimental validation. Gene ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed to appraise the 54 genes. It was indicated that cellular response to organic cyclic compounds has an influence in breast cancer, and most genes may be related with viral carcinogenesis.Entities:
Keywords: SVM; gene; protein-protein interaction network; proteins; triple-negative breast cancer
Year: 2019 PMID: 30930932 PMCID: PMC6428707 DOI: 10.3389/fgene.2019.00180
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
FIGURE 1The analysis flowchart for this study. This method integrated breast cancer gene expression data and PPI data. Firstly, we regard each gene as a feature in the data and used mRMR to rank the importance of the genes. Then we selected the top 248 genes from the mRMR results. We searched the shortest paths between every pair of the 248 coding proteins by the Dijkstra algorithm in the PPI network. Shortest path proteins were retrieved and were ranked in descending order. After that, 54 of the shortest path proteins were selected and were considered as the potential triple-negative breast cancer-related genes. Finally, using the C-SVC model for classification in order to achieve satisfactory results, we used the grid Search method to select the appropriate parameters.
The 54 candidate breast cancer-related genes and betweenness.
| hgnc_symbol | ensp | Betweenness | Reference |
|---|---|---|---|
| ENSP00000360525 | 1777 | ||
| ENSP00000264033 | 1548 | ||
| ENSP00000433821 | 1380 | ||
| ENSP00000393312 | 1286 | ||
| ENSP00000400175 | 1237 | ||
| ENSP00000263253 | 1048 | ||
| ENSP00000348461 | 871 | ||
| ENSP00000378699 | 852 | ||
| ENSP00000261769 | 848 | ||
| ENSP00000275493 | 815 | ||
| ENSP00000360266 | 811 | ||
| ENSP00000277541 | 803 | ||
| ENSP00000309555 | 795 | ||
| ENSP00000362824 | 786 | ||
| ENSP00000296122 | 786 | ||
| ENSP00000003084 | 767 | ||
| ENSP00000269571 | 763 | ||
| ENSP00000338018 | 762 | ||
| ENSP00000206249 | 745 | ||
| ENSP00000362649 | 705 | ||
| ENSP00000272317 | 672 | ||
| ENSP00000384273 | 658 | ||
| ENSP00000387699 | 582 | ||
| ENSP00000256442 | 565 | ||
| ENSP00000353483 | 561 | ||
| ENSP00000350941 | 559 | ||
| ENSP00000263036 | 558 | ||
| ENSP00000303351 | 553 | ||
| ENSP00000341885 | 549 | ||
| ENSP00000226574 | 512 | ||
| ENSP00000354632 | 508 | ||
| ENSP00000354982 | 508 | ||
| ENSP00000282050 | 508 | ||
| ENSP00000351446 | 466 | ||
| ENSP00000262367 | 466 | ||
| ENSP00000376176 | 445 | ||
| ENSP00000365439 | 429 | ||
| ENSP00000359206 | 408 | ||
| ENSP00000228307 | 406 | ||
| ENSP00000317159 | 394 | ||
| ENSP00000307786 | 391 | ||
| ENSP00000401303 | 383 | ||
| ENSP00000346389 | 381 | ||
| ENSP00000384018 | 362 | ||
| ENSP00000447488 | 347 | ||
| ENSP00000368438 | 336 | ||
| ENSP00000282441 | 335 | ||
| ENSP00000261681 | 331 | ||
| ENSP00000361027 | 331 | ||
| ENSP00000348577 | 323 | ||
| ENSP00000306245 | 316 | ||
| ENSP00000354394 | 313 | ||
| ENSP00000363822 | 308 | ||
| ENSP00000405965 | 297 |
FIGURE 2The protein-protein interaction network of the proteins encoded by the 54 candidate genes. Shortest path proteins were retrieved from the shortest paths between every protein pair coded by the top 248 genes selected from the mRMR table. The shortest path between every protein pair was searched by the Dijkstra algorithm in the network. Finally, the 54 shortest path proteins were obtained, the related genes of which were considered as candidate genes. The PPI network of the 54 shortest path proteins is depicted, in which the nodes represent proteins, and the lines between nodes represent protein interactions.
Results of the GO enrichment analysis.
| Go term entry ID | Description | Count | |
|---|---|---|---|
| GO:0071407 | Cellular response to organic cyclic compound | 1.217E-12 | 16 |
| GO:0006979 | Response to oxidative stress | 9.108E-11 | 13 |
| GO:0048511 | Rhythmic process | 1.467E-10 | 12 |
| GO:0071396 | Cellular response to lipid | 1.897E-10 | 14 |
| GO:0048732 | Heart development | 2.055E-10 | 14 |
| GO:0009612 | Gland development | 2.349E-10 | 13 |
| GO:0009314 | Response to mechanical stimulus | 3.549E-10 | 10 |
| GO:0009314 | Response to radiation | 3.666E-10 | 13 |
| GO:0038095 | Fc-epsilon receptor signaling pathway | 5.244E-10 | 9 |
| GO:0000302 | Response to reactive oxygen species | 6.401E-10 | 10 |
Results of the KEGG enrichment analysis.
| KEGG term entry ID | Description | Count | |
|---|---|---|---|
| hsa05200 | Pathways in cancer | 7.835E-13 | 20 |
| hsa05167 | Kaposi’s sarcoma-associated herpesvirus infection | 1.406E-10 | 13 |
| hsa05161 | Hepatitis B | 2.393E-10 | 12 |
| hsa05168 | Herpes simplex infection | 3.262E-10 | 13 |
| hsa05203 | Viral carcinogenesis | 9.155E-10 | 13 |
| hsa04520 | Adherens junction | 1.430E-09 | 9 |
| hsa05215 | Prostate cancer | 7.949E-09 | 9 |
| hsa04024 | cAMP signaling pathway | 9.423E-09 | 12 |
| hsa05205 | Proteoglycans in cancer | 1.249E-08 | 12 |
| hsa01522 | Endocrine resistance | 1.915E-08 | 9 |
FIGURE 3The GO enrichment analysis. The top 10 terms from the GO enrichment analysis ranked by p-value, shown as a bar chart. The GO terms by name are listed on the y-axis. The shared number of terms is shown as the length of histogram. The different colors represent the different p-values.
FIGURE 4The KEGG enrichment analysis. The top 10 pathways from the KEGG enrichment analysis ranked by p-value, shown as a bar chart. The terms of the KEGG pathways are depicted on the y-axis. The shared number of pathways is shown as the length of histogram. The different colors represent the different p-values.
The performance of the high-risk breast cancer classification model.
| ACC | Precision | Recall | F-measure | |
|---|---|---|---|---|
| Normal and TNBC | 95.394% | 88.889% | 100% | 94.118% |
| II and III | 86.597% | 80.952% | 100% | 89.474% |