Literature DB >> 24718460

Identification of thyroid carcinoma related genes with mRMR and shortest path approaches.

Yaping Xu1, Yue Deng1, Zhenhua Ji1, Haibin Liu1, Yueyang Liu1, Hu Peng1, Jian Wu1, Jingping Fan1.   

Abstract

Thyroid cancer is a malignant neoplasm originated from thyroid cells. It can be classified into papillary carcinomas (PTCs) and anaplastic carcinomas (ATCs). Although ATCs are in an very aggressive status and cause more death than PTCs, their difference is poorly understood at molecular level. In this study, we focus on the transcriptome difference among PTCs, ATCs and normal tissue from a published dataset including 45 normal tissues, 49 PTCs and 11 ATCs, by applying a machine learning method, maximum relevance minimum redundancy, and identified 9 genes (BCL2, MRPS31, ID4, RASAL2, DLG2, MY01B, ZBTB5, PRKCQ and PPP6C) and 1 miscRNA (miscellaneous RNA, LOC646736) as important candidates involved in the progression of thyroid cancer. We further identified the protein-protein interaction (PPI) sub network from the shortest paths among the 9 genes in a PPI network constructed based on STRING database. Our results may provide insights to the molecular mechanism of the progression of thyroid cancer.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 24718460      PMCID: PMC3981740          DOI: 10.1371/journal.pone.0094022

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Thyroid tumors include encapsulated benign tumors and carcinomas, and carcinomas can be classified into papillary carcinomas (PTC) and anaplastic carcinomas (ATC). Although frequency of ATC is low (<5%), it is in a very aggressive status of thyroid carcinomas, responsible for about half of its death and its patients have a short survival time after diagnosis (6 month in average) [1]. ATC is evolved from PTC, and they are found to share genetic alterations [2]. However, limited studies reported their difference at transcriptome level [2]–[5], resulting a lack of systematic analysis of its tumor evolution. In order to bring insight into the progression of thyroid carcinomas at systems level, we adopted a two-step computational strategy [6]. By using an effective machine learning method – mRMR (maximum relevance, minimum redundancy), we first identify genes responsible for the progressing transcriptome difference among normal tissue, PTC and ATC using the mRNA microarray data from Hebrant et al.'s study [5]. The machine learning method mRMR does not only identify genes with independent effect along, but also take the redundancy effect among genes selected into account. Additional to the pipeline used by Li et al. [6], we applied different validation methods, such as leave-one-out validation, 10 fold cross validation and stratified 10 fold cross validation, to determine the number of genes which separate the three tissue status, due to one validation method along may provide biased information of prediction accuracy of the machine learning model. Second, we address the function of these genes at systems level by integrating known protein-protein interaction (PPI) from STRING database. A network of shortest paths among the genes from a background PPI network could be further revealed.

Materials and Methods

Transcriptome Array Dataset

We adopted the gene expression data of thyroid cancer from Hebrant et al.'s study [5], which include the transcriptome array data of 11 anaplastic thyroid carcinomas (ATC), 49 papillary thyroid carcinomas (PTC) and 45 normal thyroids (Normal) based on Affymetrix Human Genome U133 Plus 2.0 Array. This dataset was retrieved from NCBI Gene Expression Omnibus (GEO) with an accession number GSE33630. The array platform is with 54,675 probes corresponding to 20,283 protein coding genes. The array signals were normalized with RMA using the Affymetrix Bioconductor package. For the expression value of a gene, we used the average value of normalized signals of its corresponding probes.

STRING PPI data

The PPI data was retrieved from STRING database (version 9.0) (http://string.embl.de/) [7]. The PPI data includes both known and predicted protein interactions. We constructed a PPI network based on the STRING PPI data using a R package ‘igraph’ [8]. In the network, proteins are presented as nodes of the networks and edges corresponding to the protein-protein interactions.

The mRMR algorithm

We used mRMR (maximum relevance minimum redundancy) method to define a gene set which can separate the three sample sets (ATC, PTC and Normal). The mRMR was first used in analyzing microarray data by Peng et al. [9]. Its idea is to rank features according to their relevance to the target sample variable, and meanwhile take redundancy among the features into consideration. So genes in the selected gene set has the best trade-off between maximum relevance to phenotype and minimum redundancy within genes in the selected set. Using mutual information (MI) defined using equation (1), we quantified relevance as well as redundancy,where p(x, y) is a joint probabilistic density of vectors x and y, and p(x) and p(y) are marginal probabilistic densities. Relevance D between a gene f and its target variable c is defined as, And redundancy R between gene f and genes in gene set T is defined as,where m is the number of genes in T. The trade-off between relevance and redundancy is obtained as follows, Repeating the above calculation a gene set is selected to distinguish target variables under mRMR condition with a given number N of genes. Using incremental feature selection (IFS), the number N can be determined. Its idea is to compare prediction accuracy defined in the following selection among different Ns, and choose the one with highest accuracy.

Prediction of phenotypes

We used the widely used Nearest Neighbor Algorithm (NNA) to predict the target variable [10]. “Nearness” is calculated as follows,where x and x are two vectors of genes representing two samples. The smaller N(x1, x2) is, the more similar the two samples are [11], [12].

Model Validation

In Li et al. 's study [6]., leave-one-out validation was applied to validate the prediction accuracy of the study. Although the advantages of this validation method is explain in some studies [6], [13], we noticed that there are other theoretical studies demonstrated there are bias in the estimation of accuracy in the leave-one-out validation in many circumstances [14], [15]. In order to provide more information of the accuracy of the prediction model and to give an accurate estimation of the number of genes separate different tumor status, we applied two additional validation methods – 10 fold cross validation [14] and stratified 10 fold cross validation because of the stratification of tumor status (normal, PTC and ATC) [15].

Shortest paths tracing

Genes do not function only by itself, but also by its interaction with others as well as environmental factors. Protein-protein interaction (PPI) network would bring us insights into the comprehensive biological systems. We attempted to provide such insights by searching the shortest paths which link the genes selected using mRMR and IFS in PPI network constructed according to STRING PPI data. The shortest paths were estimated using Dijkstra's algorithm [16].

Enrichment analysis

GO (Gene Ontology) term enrichment and KEGG pathway enrichment were performed using DAVID tools [17]. We estimated the P values, corrected P values with Benjamin multiple testing correction which controlled family-wide false discovery rate, and fold enrichment values for each functional or pathway terms.

Results

Ten candidate genes identified by mRMR, NNA and IFS

On the basis of mRMR estimation, we tested the predictor of NNA described in the Materials and Methods section, with one feature, two features, … to 400 features. The result of IFS curve representing prediction accuracy estimated by leave-one-out, 10 fold and stratified 10 fod cross validation, compared with the number of features is shown in Figure 1. We noticed that although the estimation accuracies different among the three different methods, but the minimum number of genes required separating tumor status is approximately the same – about 9 or 10 (Figure 1 and Table S1). We selected 10 genes to include more candidates for further analysis and studies, and the accuracy was 0.848, 0.857 and 0.877 for leave-one-out, 10 fold and stratified 10 fold cross validation separately. The top 10 genes selected using mRMR include 9 known genes (BCL2, MRPS31, ID4, RASAL2, DLG2, MY01B, ZBTB5, PRKCQ, PPP6C), and a miscRNA (miscellaneous RNA, LOC646736) (Table 1). Interestingly, the 10 candidate genes have no overlap with the 9 differentially expression gene between ATC and PTC identified in the Hebrant et al.'s study. One of the possible reasons is that in our detection, we considered the variation in transcriptome differences in normal tissue, ATC and PTC together.
Figure 1

IFS curve of the classification of ATCs, PTCs and normal tissue samples.

The X-axis indicate the number of genes used for classification/prediction, and Y-axis is the prediction accuracies by NNA evaluated using leave-one-out (orange), 10 fold (green) and stratified 10 fold (blue) cross validation.

Table 1

The 10 Genes selected using mRMR and IFS.

Gene NameEntrez Gene IDmRMR score
BCL25961.09662945
MRPS31102400.222372096
ID434000.32164204
RASAL294620.390513354
DLG217400.334284222
MY01B44300.354486787
ZBTB599250.384452316
LOC6467360.339571667
PRKCQ55880.359410448
PPP6C55370.340892868

IFS curve of the classification of ATCs, PTCs and normal tissue samples.

The X-axis indicate the number of genes used for classification/prediction, and Y-axis is the prediction accuracies by NNA evaluated using leave-one-out (orange), 10 fold (green) and stratified 10 fold (blue) cross validation.

Shortest path genes

We constructed an undirected network based on PPI data from STRING using ‘igraph’ [8]. Then we traced shortest path between each pair of two genes from the 9 candidate genes identified using mRMR, in the PPI network using Dijkstra's algorithm [16]. There are 16 genes located on the shortest path among the 9 candidate genes, and we presented them according to their network betweenness in the shortest paths composed sub-PPI network (Table 2 and Figure 2).
Table 2

Proteins selected on the shortest paths among the mRMR selected proteins.

Ensembl Gene IDEnsembl Protein IDAssociated Gene Namebetweenness
ENSG00000091831ENSP00000206249ESR15
ENSG00000010610ENSP00000011653CD44
ENSG00000150991ENSP00000344818UBC4
ENSG00000143933ENSP00000272298CALM23
ENSG00000132170ENSP00000287820PPARG3
ENSG00000029363ENSP00000031135BCLAF12
ENSG00000050820ENSP00000162330BCAR12
ENSG00000100906ENSP00000216797NFKBIA2
ENSG00000106588ENSP00000223321PSMA22
ENSG00000112365ENSP00000230122ZBTB242
ENSG00000115956ENSP00000234313PLEK2
ENSG00000141510ENSP00000269305TP532
ENSG00000204519ENSP00000282296ZNF5512
ENSG00000154342ENSP00000284523WNT3A2
ENSG00000158092ENSP00000288986NCK12
ENSG00000147044ENSP00000367408CASK2
ENSG00000074071ENSP00000380531MRPS342
Figure 2

17 shortest paths genes among the 9 genes identified with mRMR methods.

We identified 17 genes located on the shortest paths of STRING PPI network among the 9 mRMR identified genes. Genes in blue are those identified with mRMR methods, and genes in red are located on their shortest paths. The network is constructed based on STRING PPI data.

17 shortest paths genes among the 9 genes identified with mRMR methods.

We identified 17 genes located on the shortest paths of STRING PPI network among the 9 mRMR identified genes. Genes in blue are those identified with mRMR methods, and genes in red are located on their shortest paths. The network is constructed based on STRING PPI data.

Enrichment of the 9 candidate genes and shortest paths genes

Using DAVID tools [17], we analyzed the functional enrichment of the 9 candidate genes together with 16 shortest path genes in KEGG pathway and GO term separately. For KEGG enrichment, the 25 genes are enriched in 7 KEGG pathways listed with their P value and fold enrichment value in Table 3. Interestingly, we found most of these pathways are important pathways related with cancer, such as T cell receptor signaling pathway, apoptosis, pathways in cancer, small cell lung cancer, prostate cancer, and thyroid cancer. T Cell Receptor (TCR) activation promotes several important signals that determine cell fate through regulating cytokine production, cell survival, proliferation, and differentiation. And T cells are especially important in cell-mediated immunity, which is the defense against tumor cells. More detailed functions of TCR in cancer is reviewed in Reference [18]. Moreover, thyroid cancer pathway was also found enriched by the set of the 25 genes. For GO term enrichment, 262 GO terms are enriched (Table S2). Several of them are related with cancer progression, like GO:0042127 regulation of cell proliferation, GO:0042980 regulation of apoptosis and GO:0043067 regulation of programmed cell death. These results provide circumstantial evidence supporting our data analysis pipeline.
Table 3

KEGG pathway enrichment of the 25 genes selected on the shortest paths.

TermGene Count P ValueFold Enrichment
T cell receptor signaling pathway40.00228200413.45238095
Neurotrophin signaling pathway40.00338503511.71658986
Pathways in cancer50.0076263545.536803136
Small cell lung cancer30.01869031712.97193878
Apoptosis30.01997110112.52463054
Prostate cancer30.0208452512.24317817
Thyroid cancer20.07173680525.04926108

Discussion

Genes identified by mRMR and IFS

We identified 9 genes, BCL2, MRPS31, ID4, RASAL2, DLG2, MY01B, ZBTB5, PRKCQ and PPP6C, and a miscRNA LOC646736 related with thyroid carcinoma in this study. Many of them are previously known important genes with thyroid development or cancer progression. BCL2, B-cell CLL/lymphoma 2, is a protein coding gene preventing cell apoptosis, and found in many Eukaryotic species. In our mRMR results, it has the highest mRMR score (1.097), suggesting it is the most important feature to separate ATC, PTC and normal tissues. Damage to BCL2 has been identified as a cause of a number of cancers, including ovarian [19], breast [20], prostate [21], chronic lymphocytic leukemia [22]. It has also been found to be differentially expressed between PTCs and normal tissues [23], and genetic variants in BCL2 could contribute to the risk of thyroid cancer [24]. Inhibitor of DNA binding/Inhibitor of differentiation 4 (ID4) is a critical factor for cell proliferation and differentiation in normal vertebrate development [25]. Its protein belongs to a family of helix-loop-helix (HLH) proteins (ld1, ld2, ld3 and ld4). ID proteins contain a HLH domain enabling interaction with other basic HLH (bHLH)-proteins, and act as dominant negative inhibitors of gene transcription [26]. Family members of ID genes have critical row in the tumor genesis of thyroid cancer. For example, ID1 regulates growth and differentiation in thyroid cancer cells [27], and ID3 was also identified as an early response protein and tumor marker for thyroid carcinomas [26]. ID4 is most recently discovered member of ID genes, mainly express in thyroid and several other tissues [28], and a previous study has already reported it as a marker in breast cancer [25].

Genes identified on PPI shortest paths

ESR1, EStrogen Receptor 1, is the gene with the largest betweenness in the PPI network of shortest paths. It encodes estrogen receptor alpha (ERα), which mediates interaction between estrogens and its target sites together with ERβ. ERα and ERβ are both expressed in thyroid cancer cells, and the proliferation of thyroid cancer cells is promoted by an ERα agonist and reduced by enhanced expression of ERβ or by an ERβ agonist [29]. Polymorphisms in ESR are also involved in tumor oncogenesis in several tissues (e.g. breast, prostate, ovary and thyroid), and may alter responsiveness of the tissues to estrogens [30]–[33]. PPARG, peroxisome proliferator-activated receptor gamma, encodes a member of the peroxisome proliferator-activated receptor (PPAR) subfamily of nuclear receptors. It is a regulator of adipocyte differentiation, and has been found in the pathology of numerous disease. Alterations of PPARG have been discovered in a large number of thyroid cancer samples, such as PAX8/PPARG fusion oncogene in follicular thyroid carcinoma and PTCs [34]–[37], and another PPARG agonist (RS54444) in ATCs [38].

Conclusion

In this study, we focused on transcriptome of the progression of thyroid cancer, by applying a machine learning methods to identify candidate genes separating three status of thyroid, normal, PTC and ATC. The transcriptome data includes from 11 ATCs, 49 PTCs and 45 normal tissues. We identified 9 genes (BCL2, MRPS31, ID4, RASAL2, DLG2, MY01B, ZBTB5, PRKCQ and PPP6C) and a miscRNA (LOC646736) related with thyroid cancer progression, additional to the genes identified previously [5]. We further revealed the PPI network of the proteins coded by these genes by estimating the shortest path of the interactions based on a background PPI network constructed based on SRING database. Our results may provide important insights to understand the mechanism of the thyroid cancer progression at transcriptome level. Prediction accuracy of three validation methods. (XLSX) Click here for additional data file. GO enrichment of the 25 genes on the shortest paths. (TXT) Click here for additional data file.
  34 in total

1.  Study of gene expression in thyrotropin-stimulated thyroid cells by cDNA expression array: ID3 transcription modulating factor as an early response protein and tumor marker in thyroid carcinomas.

Authors:  S Deleu; V Savonet; J Behrends; J E Dumont; C Maenhaut
Journal:  Exp Cell Res       Date:  2002-09-10       Impact factor: 3.905

2.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers.

Authors:  Kuo-Chen Chou; Hong-Bin Shen
Journal:  J Proteome Res       Date:  2006-08       Impact factor: 4.466

3.  DNA polymorphism in B-domain of the estrogen receptor-alpha among Japanese women.

Authors:  J Fujimoto; R Hirose; S Ichigo; H Sakaguchi; T Tamaya
Journal:  Steroids       Date:  1998-03       Impact factor: 2.668

4.  Genotype distribution of estrogen receptor-alpha gene polymorphisms in Italian women with surgical uterine leiomyomas.

Authors:  F Massart; L Becherini; L Gennari; V Facchini; A R Genazzani; M L Brandi
Journal:  Fertil Steril       Date:  2001-03       Impact factor: 7.329

5.  Molecular profiling related to poor prognosis in thyroid carcinoma. Combining gene expression data and biological information.

Authors:  C Montero-Conde; J M Martín-Campos; E Lerma; G Gimenez; J L Martínez-Guitarte; N Combalía; D Montaner; X Matías-Guiu; J Dopazo; A de Leiva; M Robledo; D Mauricio
Journal:  Oncogene       Date:  2007-09-17       Impact factor: 9.867

Review 6.  The role of the PAX8/PPARgamma fusion oncogene in the pathogenesis of follicular thyroid cancer.

Authors:  Norman L Eberhardt; Stefan K G Grebe; Bryan McIver; Honey V Reddi
Journal:  Mol Cell Endocrinol       Date:  2009-10-31       Impact factor: 4.102

7.  An estrogen receptor genetic polymorphism and a history of spontaneous abortion--correlation in women with estrogen receptor positive breast cancer but not in women with estrogen receptor negative breast cancer or in women without cancer.

Authors:  S P Lehrer; R K Schmutzler; J M Rabin; B S Schachter
Journal:  Breast Cancer Res Treat       Date:  1993       Impact factor: 4.872

8.  cDNA cloning, tissue distribution and chromosomal localization of the human ID4 gene.

Authors:  M Rigolet; T Rich; M S Gross-Morand; D Molina-Gomes; E Viegas-Pequignot; C Junien
Journal:  DNA Res       Date:  1998-10-30       Impact factor: 4.458

9.  Some remarks on protein attribute prediction and pseudo amino acid composition.

Authors:  Kuo-Chen Chou
Journal:  J Theor Biol       Date:  2010-12-17       Impact factor: 2.691

10.  mRNA expression in papillary and anaplastic thyroid carcinoma: molecular anatomy of a killing switch.

Authors:  Aline Hébrant; Geneviève Dom; Michael Dewaele; Guy Andry; Christophe Trésallet; Emmanuelle Leteurtre; Jacques E Dumont; Carine Maenhaut
Journal:  PLoS One       Date:  2012-10-24       Impact factor: 3.240

View more
  12 in total

1.  Downregulation of RASAL2 promotes the proliferation, epithelial-mesenchymal transition and metastasis of colorectal cancer cells.

Authors:  Zeming Jia; Weidong Liu; Liansheng Gong; Zhongfu Xiao
Journal:  Oncol Lett       Date:  2017-01-10       Impact factor: 2.967

2.  Screening for Core Genes Related to Pathogenesis of Alzheimer's Disease.

Authors:  Longxiu Yang; Yuan Qin; Chongdong Jian
Journal:  Front Cell Dev Biol       Date:  2021-04-22

3.  MRPL15 is a novel prognostic biomarker and therapeutic target for epithelial ovarian cancer.

Authors:  Haoya Xu; Ruoyao Zou; Feifei Li; Jiyu Liu; Nannan Luan; Shengke Wang; Liancheng Zhu
Journal:  Cancer Med       Date:  2021-05-02       Impact factor: 4.452

4.  miR-136 suppresses tumor invasion and metastasis by targeting RASAL2 in triple-negative breast cancer.

Authors:  Meisi Yan; Xiaobo Li; Dandan Tong; Changsong Han; Ran Zhao; Yan He; Xiaoming Jin
Journal:  Oncol Rep       Date:  2016-04-25       Impact factor: 3.906

5.  Analysis and Identification of Aptamer-Compound Interactions with a Maximum Relevance Minimum Redundancy and Nearest Neighbor Algorithm.

Authors:  ShaoPeng Wang; Yu-Hang Zhang; Jing Lu; Weiren Cui; Jerry Hu; Yu-Dong Cai
Journal:  Biomed Res Int       Date:  2016-02-03       Impact factor: 3.411

6.  Genome-wide linkage and association analysis of cardiometabolic phenotypes in Hispanic Americans.

Authors:  Jacklyn N Hellwege; Nicholette D Palmer; Latchezar Dimitrov; Jacob M Keaton; Keri L Tabb; Satria Sajuthi; Kent D Taylor; Maggie C Y Ng; Elizabeth K Speliotes; Gregory A Hawkins; Jirong Long; Yii-Der Ida Chen; Carlos Lorenzo; Jill M Norris; Jerome I Rotter; Carl D Langefeld; Lynne E Wagenknecht; Donald W Bowden
Journal:  J Hum Genet       Date:  2016-08-18       Impact factor: 3.172

7.  Bioinformatics analysis of key genes and latent pathway interactions based on the anaplastic thyroid carcinoma gene expression profile.

Authors:  Yun Huang; Yiming Tao; Xinying Li; Shi Chang; Bo Jiang; Feng Li; Zhi-Ming Wang
Journal:  Oncol Lett       Date:  2016-11-30       Impact factor: 2.967

8.  Identification of pivotal lncRNAs in papillary thyroid cancer using lncRNA-mRNA-miRNA ceRNA network analysis.

Authors:  Weiwei Liang; Fangfang Sun
Journal:  PeerJ       Date:  2019-09-19       Impact factor: 2.984

9.  Transcriptome and Gene Fusion Analysis of Synchronous Lesions Reveals lncMRPS31P5 as a Novel Transcript Involved in Colorectal Cancer.

Authors:  Anna Panza; Stefano Castellana; Giuseppe Biscaglia; Ada Piepoli; Luca Parca; Annamaria Gentile; Anna Latiano; Tommaso Mazza; Francesco Perri; Angelo Andriulli; Orazio Palmieri
Journal:  Int J Mol Sci       Date:  2020-09-27       Impact factor: 5.923

Review 10.  Deep Learning in Head and Neck Tumor Multiomics Diagnosis and Analysis: Review of the Literature.

Authors:  Xi Wang; Bin-Bin Li
Journal:  Front Genet       Date:  2021-02-10       Impact factor: 4.599

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.