Literature DB >> 35035785

LncRNA functional annotation with improved false discovery rate achieved by disease associations.

Yongheng Wang^1,2, Jincheng Zhai¹, Xianglu Wu³, Enoch Appiah Adu-Gyamfi², Lingping Yang³, Taihang Liu^1,2, Meijiao Wang², Yubin Ding^2,3, Feng Zhu⁴, Yingxiong Wang², Jing Tang^1,2.

Abstract

The long non-coding RNAs (lncRNAs) play critical roles in various biological processes and are associated with many diseases. Functional annotation of lncRNAs in diseases attracts great attention in understanding their etiology. However, the traditional co-expression-based analysis usually produces a significant number of false positive function assignments. It is thus crucial to develop a new approach to obtain lower false discovery rate for functional annotation of lncRNAs. Here, a novel strategy termed DAnet which combining disease associations with cis-regulatory network between lncRNAs and neighboring protein-coding genes was developed, and the performance of DAnet was systematically compared with that of the traditional differential expression-based approach. Based on a gold standard analysis of the experimentally validated lncRNAs, the proposed strategy was found to perform better in identifying the experimentally validated lncRNAs compared with the other method. Moreover, the majority of biological pathways (40%∼100%) identified by DAnet were reported to be associated with the studied diseases. In sum, the DAnet is expected to be used to identify the function of specific lncRNAs in a particular disease or multiple diseases.

Entities: Chemical

Keywords: Coefficient of variation; Disease-associated SNPs; Functional prediction; Long non‐coding RNA; WGCNA

Year: 2021 PMID： 35035785 PMCID： PMC8724965 DOI： 10.1016/j.csbj.2021.12.016

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Introduction

Long non‐coding RNA (lncRNA) is broadly defined as a type of non-coding RNA with a length of more than 200 nucleotides [1]. Tremendous evidences have shown that lncRNA can carry out diverse functions in biological processes [2] and is associated with many diseases [3], such as cancers [4], cardiovascular diseases [5], neurodegenerative diseases [6], metabolic diseases [7], and inflammatory diseases [8]. Currently, many computational methods for predicting lncRNA function have been developed [9], for instance, the differential expression analysis (DEA) combined with the weighted correlation network analysis (WGCNA) [10]. This method has been frequently employed for identifying co-regulatory relationships among lncRNAs and mRNAs in polycystic ovary syndrome [11] and discovering the cis-regulatory lncRNAs involved in vascular inflammation [12]. However, analysis based on co-expression usually results in a large number of false positive function assignments [9]. Currently, the lncRNA-disease association data supported by experiments are quite limited in the publications [13]. Specifically, only about 6,000 of over 90,000 lncRNAs have been characterized by experiments as “disease-associated” in human genome [14], [15]. This may be attributed to the complex characteristics of lncRNA, including the higher expression variability across disease conditions [16], [17], [18], the susceptibility on expression/secondary structure to genetic variants [19], [20], [21], and the various levels of regulation on the coding genes (cis/trans) [2], [18], etc. So far, the analysis considering disease specificity into lncRNA functional annotation can improve the discovery of diseased-associated lncRNA [16]. In particular, lncRNA-disease associations can be well-established via the single nucleotide polymorphisms (SNPs) type of genetic variants within lncRNAs [16] and condition-specific analysis estimated by the coefficient of variation (CV) [17], [22]. Moreover, lots of lncRNAs have been reported to regulate the expression of their neighboring genes (act in cis) [23], [24], [25]. The co-expression of the cis-regulatory lncRNAs and their neighboring protein-coding genes led to the discovery of functional lncRNAs in given disease [26]. It is therefore crucial to develop a new approach integrating diseased associations for obtaining lower false discovery rate (FDR) [16]. In this study, a novel strategy termed DAnet which combining disease associations with cis-regulatory network was developed. In particular, disease-associated SNPs were first integrated for screening disease-associated lncRNAs. And then the CV of these lncRNAs was estimated to assess the condition-specific expression of lncRNAs in a specific disease. Moreover, the WGCNA-based co-expression network between lncRNAs and their neighboring protein-coding genes and Kyoto Encyclopedia of Genes and Genome (KEGG) pathway enrichment analysis were further conducted for identifying the function of the lncRNAs involved. Furthermore, experimentally verified lncRNA-disease associations were curated to evaluate the performance of this newly proposed strategy across 24 datasets involving eight types of disease based on classification of the ICD-11. Overall, the findings of this study can facilitate the discovery of disease-associated lncRNAs and their function in the specific disease.

Methods

Collection of the benchmark datasets for the analysis

For the function analysis of lncRNA in different type of diseases, a variety of microarray/RNA-seq data were collected by searching disease names in Gene Expression Omnibus (GEO) [27] and The Cancer Genome Atlas (TCGA) [28]. We considered several criteria: (1) the gene expression profiling was conducted using high throughput sequencing or lncRNA microarray for “Homo sapiens”, (2) the dataset consist of patient and control groups, (3) the raw data or normalized data were available, (4) the number of lncRNAs identified by disease-associated SNPs was more than zero, (5) the experimentally validated disease associated lncRNAs, which obtained from 5 public databases (LncRNAWiki [29], LncRNADisease [14], LncRNA2Target [30], Lnc2Cancer [31], and EVLncRNAs [32]), were available for the diseases and (6) multiple types of disease based on classification of the ICD-11. In total, 22 benchmark datasets were collected from GEO and two datasets were collected from TCGA, which included 16 diseases, divided into 8 types of disease according to the classification of ICD-11. Then, the lncRNA and mRNA expression matrices obtained from the 24 datasets of control-case studies were used for subsequent analysis. Table 1 demonstrates the disease type (ICD-11 code), dataset ID, the numbers of sample, the expression unit, and the number of lncRNAs and mRNAs for each dataset.

Table 1

Twenty-four datasets of eight disease types were collected for function analysis of lncRNA. The first 22 datasets were collected from GEO and the last two datasets were collected from TCGA. MDD: major depressive disorder; VHD: valvular heart disease; AF-VHD: valvular heart disease with atrial fibrillation; SLE: systemic lupus erythematosus; ALL: acute lymphoblastic leukemia; TPM: Transcripts Per Million; Normalized: DESeq normalized; nRPKM: normalized Reads Per Kilobase of transcript, per Million mapped reads; FPKM: Fragments Per Kilobase of exon per Million; RPKM: Reads Per Kilobase of transcript per Million reads mapped; Normalized signal intensity: Quantile normalization using the GeneSpring software.

Type of Disease	Dataset ID	No. of Sample in the specific dataset	Expression Unit (Experiment type)	No. of lncRNAs & mRNAs
8A20	GSE113524 [72]	19 Alzheimer disease20 Healthy controls	TPM (RNA-Seq)	12,937 lncRNAs & 18,969 mRNAs
8A20	GSE104704 [73]	12 Alzheimer disease10 Healthy controls	Normalized (RNA-Seq)	2,199 lncRNAs & 17,965 mRNAs
8A20	GSE125583 [74]	219 Alzheimer disease70 Healthy controls	nRPKM (RNA-Seq)	2,803 lncRNAs & 18,852 mRNAs
6A70	GSE101521 [75]	30 MDD29 Healthy controls	Normalized (RNA-Seq)	11,109 lncRNAs & 18,754 mRNAs
6A70	GSE102556 [76]	26 MDD22 Healthy controls	FPKM (RNA-Seq)	12,718 lncRNAs & 18,793 mRNAs
6A20	GSE112523 [77]	29 Schizophrenia28 Healthy controls	Reads Count (RNA-Seq)	12,179 lncRNAs & 18,437 mRNAs
BA41	GSE65705 [78]	32 Myocardial infarction2 Healthy controls	RPKM (RNA-Seq)	1,351 lncRNAs & 17,801 mRNAs
BA41	GSE127853 [79]	3 Myocardial infarction3 Healthy controls	FPKM (RNA-Seq)	503 lncRNAs & 10,216 mRNAs
BD40	GSE97210 [80]	3 Atherosclerosis3 Healthy controls	Normalized signal intensity (Microarray)	10,347 lncRNAs & 18,604 mRNAs
BD40	GSE120521 [81]	4 Atherosclerosis unstable4 Atherosclerosis stable	FPKM (RNA-Seq)	10,343 lncRNAs & 18,381 mRNAs
BC81	GSE113013 [27]	5 AF-VHD5 VHD	Normalized signal intensity (Microarray)	10,347 lncRNAs & 18,604 mRNAs
BC81	GSE108660 [27]	5 Atrial fibrillation5 Non-atrial fibrillation	Normalized signal intensity (Microarray)	8,090 lncRNAs & 18,807 mRNAs
CA23	GSE106388 [82]	15 Mild asthma4 Healthy controls	Reads Count (RNA-Seq)	8,036 lncRNAs & 17,244 mRNAs
CA23	GSE96783 [83]	21 Asthma30 Healthy controls	Reads Count (RNA-Seq)	10,451 lncRNAs & 18,324 mRNAs
DD71	GSE128682 [84]	14 Ulcerative colitis16 Healthy controls	Reads Count (RNA-Seq)	1,756 lncRNAs & 17,355 mRNAs
4A40	GSE131525 [85]	3 SLE3 Healthy controls	Reads Count (RNA-Seq)	6,031 lncRNAs & 16,972 mRNAs
5A10	GSE131526 [85]	12 Type-1 diabetes3 Healthy controls	Reads Count (RNA-Seq)	6,798 lncRNAs & 16,458 mRNAs
5B81	GSE129398 [86]	12 Obesity10 Controls	Reads Count (RNA-Seq)	822 lncRNAs & 14,300 mRNAs
5B81	GSE145412 [87]	8 Obesity8 Controls	TPM (RNA-Seq)	6,896 lncRNAs & 16,595 mRNAs
5A11	GSE133099 [27]	6 Type-2 diabetes6 Lean controls	Reads Count (RNA-Seq)	8,843 lncRNAs & 17,480 mRNAs
2B33	GSE141140 [88]	13 ALL4 Healthy controls	Reads Count (RNA-Seq)	867 lncRNAs & 16,297 mRNAs
2B91	GSE144259 [89]	6 Colorectal cancer3 Healthy controls	FPKM (RNA-Seq)	3,249 lncRNAs & 18,604 mRNAs
2C6Z	TCGA-BC [28]	115 Breast cancer113 Healthy controls	FPKM (RNA-Seq)	14,097 lncRNAs & 19,631 mRNAs
2D10	TCGA_TC [28]	510 Thyroid cancer58 Healthy controls	Reads Count (RNA-Seq)	13,618 lncRNAs & 19,493 mRNAs

Collection of the SNP-disease association data for the identification of potential disease-associated lncRNAs

The SNP-disease association data were collected and used to identify potential disease-associated lncRNAs. First, we collected the 16 diseases associated SNPs and their locations from three well-known sources: GRASP2 [33], NHGRI-EBI GWAS Catalog [34], and GWASdb [35]. The significance level with p less than 5.0 × 10-8 is widely accepted in the genome-wide association studies [34]. Since many susceptible loci may only show moderate significance in association analysis, a p value of less than 1.0 × 10-3 was applied for collecting the disease-associated SNPs [35]. Then, we downloaded the chromosome information of lncRNAs from the GENCODE (v31, human reference genome hg38) [36] to map the disease-associated SNPs to the lncRNA region. In total, we collected 124,428 associations between 101,360 SNPs and the 16 diseases for further analyses, and 4,435 unique lncRNAs were found to be potentially associated with these diseases. Data details on the number of disease-associated SNPs and lncRNAs are shown in Supplementary Table S1. Finally, we exacted expression level of these lncRNAs in each dataset from raw lncRNA expression matrix, and the number of the exacted lncRNAs based on disease-associated SNPs for each dataset is listed in Table 2.

Table 2

Optimization for the KCV and CD across different datasets. When the Nexp was maximum, the lower KCV/CD was identified as the optimal value. Nexp: the number of experimental verified lncRNAs; KCV: the top number of lncRNAs with the higher variabilities; NA: Not available.

Disease Name	Dataset ID	No. of lncRNA in the specific dataset	No. of lncRNA based on disease-associated SNP	No. of experimental verified lncRNA	K_CV cutoff	CD cutoff
Alzheimer disease	GSE113524	12,937	1680	5	400	400 kb
Alzheimer disease	GSE104704	2199	407	5	200	5 kb
Alzheimer disease	GSE125583	2803	537	5	400	50 kb
Major depressive disorder	GSE101521	11,109	1043	2	600	5 kb
Major depressive disorder	GSE102556	12,718	1098	2	1000	5 kb
Schizophrenia	GSE112523	12,179	917	3	300	5 kb
Myocardial infarction	GSE65705	1351	35	2	35	100 kb
Myocardial infarction	GSE127853	503	16	2	16	NA
Atherosclerosis	GSE97210	10,347	163	1	100	NA
Atherosclerosis	GSE120521	10,343	120	1	100	5 kb
Atrial fibrillation	GSE113013	10,347	38	1	38	NA
Atrial fibrillation	GSE108660	8090	33	1	33	NA
Asthma	GSE106388	8036	291	2	200	5 kb
Asthma	GSE96783	10,451	352	2	100	5 kb
Lupus erythematosus	GSE131525	6031	64	1	64	5 kb
Ulcerative colitis	GSE128682	1756	20	1	20	70 kb
Type-1 diabetes mellitus	GSE131526	6798	283	3	200	5 kb
Obesity	GSE129398	822	46	1	46	5 kb
Obesity	GSE145412	6896	197	1	100	5 kb
Type-2 diabetes mellitus	GSE133099	8843	1075	5	600	5 kb
Acute lymphoblastic leukemia	GSE141140	867	12	1	12	NA
Colorectal cancer	GSE144259	3249	43	6	43	300 kb
Breast cancer	TCGA_BC	14,097	528	12	500	5 kb
Thyroid cancer	TCGA_TC	13,618	8	1	8	NA

Detection of the expression variability of lncRNA by condition-specific expression

The lncRNAs have higher expression variability pattern in diseases compared to normal conditions. LncRNAs with relative high expression variability pattern may indicate disease-related function while with relative low variability indicate function in normal condition [16], [22]. The CV is the standard measurement for detecting the expression variability [16], [22]. The CV is defined as “the ratio between the standard deviation of the lncRNA expression levels across the patients and its mean” [22]. In this study, we used this measurement to assess the variability of potential disease-associated lncRNAs. The CV value (ratio) was calculated for each lncRNA in disease samples, and the lncRNA with relative high CV value represents disease associated lncRNA. Finally, we ranked the CV values from high to low, and then identified the lncRNAs with top ranked CV values as the disease-associated ones. Meanwhile, different top numbers were used in the following optimization procedure. Among the top KCV (the top number of lncRNAs with the higher variabilities) lncRNAs across each dataset, the number of experimentally validated lncRNAs was computed (Nexp). When the number of lncRNA identified by SNPs (Nsnp) was less than 100, the K was equal to the Nsnp, if else, the K was from 100 to Nsnp with gradient of 100. When the Nexp was maximum, the lower KCV was identified as the optimal value.

Construction of the cis-regulatory network based on lncRNAs’ neighboring genes

Co-expressed genes are more likely to be co-regulated and functionally associated, meaning that identification of the co-expressed neighboring protein-coding genes can be helpful in lncRNA function assignments [16], [37], [38]. Firstly, we collected the information of all 16,840 lncRNAs and 19,975 protein coding genes from GENCODE (V31, human reference genome hg38) [36]. After this, we obtained 10 candidate chromosome distances (CDs) based on the publications on genomic distance between the lncRNAs and their regulated neighboring genes. These CDs including: 5 kb [39], 10 kb [40], 20 kb [41], 50 kb [42], 70 kb [43], 100 kb [44], 200 kb [45], 300 kb [46], 400 kb [47], 500 kb [12]. Secondly, we calculated the neighboring genes within these CDs up/downstream of all lncRNAs based on the collected location information. Therefore, a collection of neighboring genes of identified disease-associated lncRNAs based on SNPs and optimal KCV was yielded. Thirdly, we constructed the co-expression network between identified disease-associated lncRNAs and their neighboring genes in different CDs for each dataset using WGCNA [10]. Moreover, optimization procedure was performed to determine the optimal CD across the benchmark datasets. Among the lncRNAs co-expressed with neighboring genes, the number of experimentally validated lncRNAs was computed (Nexp). When the Nexp was maximum, the lower CD was regard as the optimal one. Finally, for the functional prediction, the co-expression network based on the optimal KCV and CD was constructed by WGCNA for each dataset. The network of selected module identified by WGCNA was illustrated by Cytoscape 3.7.2 (http://www.cytoscape.org/) [48] software.

Annotating the lncRNA function based on KEGG pathway

Groups of transcripts that are identified though clustering need to be subjected to a functional enrichment step to help in revealing the biological processes that these genes are involved in [16]. The KEGG pathway [49] is globally used for characterizing the function of disease-associated lncRNA. Herein, we performed the KEGG enrichment analyses by using the mRNAs that were found to be co-expressed with disease-associated lncRNAs. The statistical significance of KEGG pathway enrichments were determined with the hypergeometric test. A p value less than 0.05 indicated a significant enrichment. Also, a chord diagram was constructed using R package “circlize” [50] to illustrate the enrichment results.

Evaluating the ability of DAnet on the function annotation of lncRNA

As a gold standard for verifying the DAnet analysis, 9,949 pairs of experimentally verified lncRNA-disease association were integrated from five databases including LncRNAWiki [29], LncRNADisease [14], LncRNA2Target [30], Lnc2Cancer [31], and EVLncRNAs [32], which provided many experimental verified lncRNAs for diseases. Two metrics were employed to evaluate the ability of the DAnet in characterizing the function of disease-associated lncRNAs. Both metrics were based on experimentally validated disease associated lncRNAs. The metrics included: (1) percentage of successful prediction (Rate), and (2) enrichment factor (EF). The Rate (%) of DAnet and DEA (Supplementary Method S1) in characterizing the experimental verified lncRNAs was employed as the first metric to evaluate the performances. Also, EF was used to represent the comparison between the concentration of the experimentally verified lncRNAs in the identification results of DAnet/DEA and the concentration in the entire lncRNAs expression. The false discovery can be effectively evaluated by fully considering the experimentally validated disease associated lncRNAs [51]. The formula for EF is given:where Ntruesuc denoted the number of experimental verified lncRNAs successfully characterized as ‘disease-associated’ by DAnet or DEA; Nsuc represented the number of lncRNAs characterized as ‘disease-associated’ by DAnet or DEA; Ntrue was the number of experimental verified lncRNAs in the integrated experimentally verified lncRNAs-disease associations; and Nall indicated the total number of lncRNAs in the expression matrix. The EF no less than 1 indicated that there is an enrichment. The larger EF value represented the lower FDR [51].

Results

Identification of disease-specific lncRNA by SNPs across the benchmark datasets

More than 90% of disease-associated SNPs are actually located in the non-coding region (e.g., lncRNAs). The SNPs located in lncRNAs can either modify their secondary structure or affect their expression level [20]. As described in the Methods section, potential disease-associated lncRNAs of the 24 benchmark datasets were identified by disease-associated SNPs for DAnet analysis. The differential expressed lncRNAs were regarded as disease-associated lncRNAs for DEA (Supplementary Method S1). Subsequently, the Rate was utilized as a metric to measure the performance of DAnet and DEA about identifying experimentally verified lncRNAs. As shown in Supplementary Fig. S1, the Rate value of each dataset by the adjusted p value (from 0% for 18 datasets to 16.7% for GSE125583) was lower than that by the p value (from 0% for 11 datasets to 33.3% for GSE106388). Among the 24 datasets, there were 8 datasets with no differentially expressed genes using the FDR less than 0.05. Thus, the raw p value (p less than 0.05) was used for identifying the differential expressed lncRNAs across the 24 datasets. As shown in Fig. 1, the Rate of DAnet was varied (from 2.6% for TCGA-TC to 100% for GSE113013 and GSE108660) and the Rate of DEA was also differed greatly (from 0% for 11 datasets to 33.3% for GSE106388). The Rate of DAnet was generally no less than DEA across 24 benchmark datasets. Moreover, among the 24 benchmark datasets, two datasets GSE97210 and GSE120521 from the atherosclerosis were collected from the microarray and RNA-Seq, respectively. We further compared the differences between the microarray and RNA-Seq data in terms of the originally detected lncRNAs, the potential disease-associated lncRNAs and the experimentally validated lncRNAs. As shown in the Supplementary Fig. S2, the total number of the originally detected lncRNAs for GSE97210 and GSE120521 was 10,347 and 10343, respectively. The number of lncRNAs detected by both GSE97210 and GSE120521 was 6836 (highlighted in blue and red lines). The number of potential disease-associated lncRNAs for GSE97210 and GSE120521 was 163 and 120, respectively. The number of shared lncRNAs was 111 (highlighted in green and red lines). In both GSE97210 and GSE120521, the experimentally validated lncRNA (CDKN2B-AS1) was identified via the DAnet. These findings indicate that both GSE97210 and GSE120521 are consistent in identifying the experimentally validated lncRNA.

Fig. 1

Performance comparison between DAnet and DEA across the 24 benchmark datasets (shown in Table 1) based on the percentage of successful prediction (Rate, %), the Rate was for characterizing the experimentally verified disease associated lncRNAs. Similarly, the EF was employed to assess the ability of DAnet and DEA about controlling the false characterization. As shown in Fig. 2, the EF of DAnet was differed greatly (from 2.2 for GSE125583 to 272.3 for GSE113013) and the EF of DEA was also varied (from 0.0 for 11 datasets to 9.2 for GSE106388). The EF of DAnet was generally no less than DEA of each dataset and all EFs of DAnet were greater than one.

Fig. 2

Performance comparison between DAnet and DEA across the 24 benchmark datasets (shown in Table 1) based on the enrichment factor (EF), the EF represented the comparison between the concentration of the experimentally verified lncRNAs in the identification results of DAnet/DEA and the concentration in the entire lncRNAs expression.

Optimizing the KCV and CD parameters across the benchmark datasets

In order to identify more likely disease-associated lncRNAs, optimization procedure was performed to determine the optimal KCV and CD across the benchmark datasets. As shown in Fig. 3, the optimal KCV represented in red square was varied across the datasets (from 8 for TCGA-TC to 1000 for GSE102556), and the CV of experimentally verified disease-associated lncRNAs was generally higher. Table 2 shows the optimal KCV value across the datasets. Moreover, as shown in Supplementary Fig. S3, the optimal CD represented in red square was different across the datasets (from 5 kb for 13 datasets to 400 kb for GSE113524). Table 2 shows the optimal CD across the datasets. For six datasets (GSE127853, GSE97210, GSE113013, GSE108660, GSE141140, TCGA_TC), the CD was not available.

Fig. 3

Optimization for the KCV across these benchmark datasets. X axis: the top number of lncRNAs with the higher variabilities, Y axis: the number of experimental verified lncRNA (Nexp). When the number of lncRNA identified by SNPs (Nsnp) was less than 100, the K was equal to the Nsnp, if else, the K was from 100 to Nsnp with gradient of 100.

The function of lncRNA in disease characterized by DAnet

KEGG enrichment analysis to character lncRNA function

Moreover, the co-expression network of lncRNAs and neighboring mRNAs was constructed under the optimal KCV and CD by WGCNA for each dataset. The network of module (contains the most genes with significant correlation) were displayed by Cytoscape [48]. Four networks are shown in Fig. 4 A-D as examples, the light-yellow square represented the lncRNA and the blue dot represented the co-expressed mRNA in the cis-lncRNA regulatory networks, red edge represented the association between disease-associated lncRNA and neighboring mRNA. Other 14 networks are shown in Supplementary Fig. S4. For each dataset, the KEGG enrichment analysis was performed to character lncRNA function via the co-expressed mRNAs. A chord diagram was dawn for illustrating the significantly enriched pathways across different datasets (Fig. 4 E). As shown in Fig. 4 E, the enriched pathways reported to be associated with the disease studied were indicated in blue lines, and other pathways were shown in grey lines. The statistical results of disease-related pathways in each dataset are shown in Fig. 4 F. As shown, the percentage of disease-associated pathways were differed from 40% to 100% across datasets. The detail ed descriptions on relevance between disease and pathways are provided in Supplementary Table S2.

Fig. 4

The function of lncRNA in disease characterized by DAnet. A-D: co-expression network of module (contains the most genes with significant correlation) constructed by WGCNA for each dataset. A: GSE113524, B: GSE65705, C: GSE131525, D: GSE131526, green square: lncRNA, blue dot: mRNA. E: chord diagram of enriched pathways of 15 benchmark datasets (p less than 0.05). F: the statistic of diseases-associated pathways.

Association between lncRNAs identified by DAnet and the specific disease

Finally, the relationships of lncRNAs and diseases were systemic manually searched. As illustrated in Fig. 5, 41 directly diseases-associated lncRNAs were identified for most diseases (blue lines). In particular, 13 lncRNAs were identified for Alzheimer disease (orange square, 8A20), three for major depressive disorder (brown square, 6A70), four for schizophrenia (brown square, 6A20), 12 for myocardial infarction (blue square, BA41), two for atherosclerosis (blue square, BD40), six for asthma (pink square, CA23), one for lupus erythematosus (purple square, 4A40), one for ulcerative colitis (turquoise square, DD71), five for obesity (yellow square, 5B81), six for type-2 diabetes mellitus (yellow square, 5A11), three for colorectal cancer (green square, 2B91), six for breast cancer (green square, 2C6Z). The detailed descriptions on relevance between lncRNAs and the specific disease are provided in Supplementary Table S3.

Fig. 5

Associations between lncRNAs identified by DAnet and the specific disease. The blue lines mean the reported associations between lncRNAs and diseases. The squares represent the type of diseases. The dots indicate lncRNAs identified by DAnet. Orange square: diseases of the nervous system; brown square: mental, behavioural or neurodevelopmental disorders; blue square: circulatory system disease; pink square: diseases of the respiratory system; purple square: diseases of the immune system; turquoise square: diseases of the digestive system; yellow square: endocrine, nutritional or metabolic diseases; green square: neoplasms; grey dot: lncRNA not reported in the studied disease; green dot: lncRNA associated with a single disease; red dot, lncRNA associated with multiple diseases.

Meanwhile, as illustrated in Fig. 5, the lncRNAs (red dots) associated with multiple diseases were identified. Specifically, two lncRNAs (LINC-PINT, GAS5) were associated both with Alzheimer disease and type-2 diabetes mellitus [52], [53], [54], [55], [56], SOX2-OT was associated with Alzheimer disease and asthma [57], [58], CCDC39 was associated with asthma and schizophrenia [59], [60], HCP5 was associated with asthma and breast cancer [61], [62], IFNG-AS1 was associated with asthma and ulcerative colitis [63], [64], CDKN2B-AS1 was associated with five diseases including Alzheimer disease, myocardial infarction, atherosclerosis, type-2 diabetes mellitus, and breast cancer [65], [66], [67], [68], [69], [70]. Associations between lncRNAs identified by DAnet and the specific disease. The blue lines mean the reported associations between lncRNAs and diseases. The squares represent the type of diseases. The dots indicate lncRNAs identified by DAnet. Orange square: diseases of the nervous system; brown square: mental, behavioural or neurodevelopmental disorders; blue square: circulatory system disease; pink square: diseases of the respiratory system; purple square: diseases of the immune system; turquoise square: diseases of the digestive system; yellow square: endocrine, nutritional or metabolic diseases; green square: neoplasms; grey dot: lncRNA not reported in the studied disease; green dot: lncRNA associated with a single disease; red dot, lncRNA associated with multiple diseases.

Discussion

Functional annotation of lncRNAs in diseases has attracted great attention for understanding disease etiology. In this study, we proposed a novel strategy termed DAnet by combining disease associations with cis-regulated network between lncRNAs and neighboring protein-coding genes for improving the functional annotation of lncRNAs. The strategy mainly consists of three procedures including: (1) identifying potential disease-associated lncRNAs based on disease-associated SNPs, (2) detecting more likely disease-associated lncRNAs based on expression variability, (3) developing cis-regulated networks between disease-associated lncRNAs and their neighboring protein-coding genes. To widen the scope of DAnet to other RNA-seq or Microarray data, the code of DAnet was provided in Supplementary Method S2. DAnet can be expected to identify the specific lncRNA function in the given disease. Primarily, based on the analysis of 24 datasets involving 16 diseases, the Rate value of DAnet was overall higher than the DEA, which indicates that the performance of DAnet could be better than traditional differential expression-based analysis on identification of experimentally validated lncRNA. In addition, the EF of DAnet was overall higher than the DEA. All EFs of DAnet were higher than 1. These findings indicate the superior capacity of DAnet in controlling the false characterization of lncRNA function. Furthermore, during the optimization procedure for determining the optimal KCV, we found that the experimentally verified disease-associated lncRNAs were generally with higher CV values. This finding is consistent with those reported by other investigators [16], [17], [18]. Under the optimal KCV, the optimal CD was not available for these six datasets (GSE127853, GSE97210, GSE113013, GSE108660, GSE141140, TCGA_TC). This may be attributed to the effect of the small number of samples and the few numbers of lncRNAs/mRNAs in the co-expression analysis [71]. Finally, the KEGG enrichment results indicate most biological pathways identified by DAnet were associated with the corresponding disease (from 40% to 100%). And by DAnet, directly diseases-associated lncRNAs were identified for most diseases. Moreover, lncRNAs associated with multiple diseases were also identified.

Conclusions

A new strategy integrating disease associations was developed for obtaining the lower false discovery rate in functional annotation of lncRNAs. The analysis of 24 datasets involving 16 diseases, indicated that the performance of DAnet could be better than traditional differential expression-based on identification of experimentally validated lncRNA, and the most biological pathways identified by DAnet were associated with the studied diseases. This provides a way to study the function of lncRNA in diseases from another aspect. In sum, DAnet is expected to identify the specific lncRNA function in the given disease.

Contributors

J.T. and Y.W. conceived the idea and supervised the work. Y.W., J.Z., and X.W., performed the research. Y.W., J.Z., X.W., Adu-Gyamfi E., L.Y., T.L., M.W., Y.D., and F.Z. prepared and analyzed the data. J.T. and Y.W. wrote manuscript. All authors reviewed and approved the final version of the manuscript.

CRediT authorship contribution statement

Yongheng Wang: Formal analysis, Writing – original draft, Writing – review & editing, Visualization. Jincheng Zhai: Formal analysis, Investigation. Xianglu Wu: Investigation, Visualization. Enoch Appiah Adu-Gyamfi: Validation, Writing – review & editing. Lingping Yang: Investigation, Validation. Taihang Liu: Validation. Meijiao Wang: Validation. Yubin Ding: Project administration. Feng Zhu: Conceptualization, Project administration. Yingxiong Wang: Conceptualization, Supervision, Funding acquisition. Jing Tang: Conceptualization, Writing – original draft, Writing – review & editing, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

88 in total

1. Promoter of lncRNA Gene PVT1 Is a Tumor-Suppressor DNA Boundary Element.

Authors: Seung Woo Cho; Jin Xu; Ruping Sun; Maxwell R Mumbach; Ava C Carter; Y Grace Chen; Kathryn E Yost; Jeewon Kim; Jing He; Stephanie A Nevins; Suet-Feung Chin; Carlos Caldas; S John Liu; Max A Horlbeck; Daniel A Lim; Jonathan S Weissman; Christina Curtis; Howard Y Chang
Journal: Cell Date: 2018-05-03 Impact factor: 41.582

2. The Gene Expression Omnibus Database.

Authors: Emily Clough; Tanya Barrett
Journal: Methods Mol Biol Date: 2016

3. Long noncoding RNA NEXN-AS1 mitigates atherosclerosis by regulating the actin-binding protein NEXN.

Authors: Yan-Wei Hu; Feng-Xia Guo; Yuan-Jun Xu; Pan Li; Zhi-Feng Lu; David G McVey; Lei Zheng; Qian Wang; John H Ye; Chun-Min Kang; Shao-Guo Wu; Jing-Jing Zhao; Xin Ma; Zhen Yang; Fu-Chun Fang; Yu-Rong Qiu; Bang-Ming Xu; Lei Xiao; Qian Wu; Li-Mei Wu; Li Ding; Tom R Webb; Nilesh J Samani; Shu Ye
Journal: J Clin Invest Date: 2019-02-04 Impact factor: 14.808

4. Expression profile analysis of human peripheral blood mononuclear cells in response to aspirin.

Authors: Sunghee Choi; Hae-Sim Park; Myeong Sook Cheon; Kyunglim Lee
Journal: Arch Immunol Ther Exp (Warsz) Date: 2005 Mar-Apr Impact factor: 4.291

5. Gene expression biomarkers in the brain of a mouse model for Alzheimer's disease: mining of microarray data by logic classification and feature selection.

Authors: Ivan Arisi; Mara D'Onofrio; Rossella Brandi; Armando Felsani; Simona Capsoni; Guido Drovandi; Giovanni Felici; Emanuel Weitschek; Paola Bertolazzi; Antonino Cattaneo
Journal: J Alzheimers Dis Date: 2011 Impact factor: 4.472

6. LncRNADisease 2.0: an updated database of long non-coding RNA-associated diseases.

Authors: Zhenyu Bao; Zhen Yang; Zhou Huang; Yiran Zhou; Qinghua Cui; Dong Dong
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

7. Genome-wide identification and functional prediction of cold and/or drought-responsive lncRNAs in cassava.

Authors: Shuxia Li; Xiang Yu; Ning Lei; Zhihao Cheng; Pingjuan Zhao; Yuke He; Wenquan Wang; Ming Peng
Journal: Sci Rep Date: 2017-04-07 Impact factor: 4.379

8. LncRNA HCP5 promotes triple negative breast cancer progression as a ceRNA to regulate BIRC3 by sponging miR-219a-5p.

Authors: Lihong Wang; Tian Luan; Shunheng Zhou; Jing Lin; Yue Yang; Wei Liu; Xiao Tong; Wei Jiang
Journal: Cancer Med Date: 2019-06-18 Impact factor: 4.452

9. LNCipedia 5: towards a reference set of human long non-coding RNAs.

Authors: Pieter-Jan Volders; Jasper Anckaert; Kenneth Verheggen; Justine Nuytens; Lennart Martens; Pieter Mestdagh; Jo Vandesompele
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

10. The Human-Specific and Smooth Muscle Cell-Enriched LncRNA SMILR Promotes Proliferation by Regulating Mitotic CENPF mRNA and Drives Cell-Cycle Progression Which Can Be Targeted to Limit Vascular Remodeling.

Authors: Amira D Mahmoud; Margaret D Ballantyne; Vladislav Miscianinov; Karine Pinel; John Hung; Jessica P Scanlon; Jean Iyinikkel; Jakub Kaczynski; Adriana S Tavares; Angela C Bradshaw; Nicholas L Mills; David E Newby; Andrea Caporali; Gwyn W Gould; Sarah J George; Igor Ulitsky; Judith C Sluimer; Julie Rodor; Andrew H Baker
Journal: Circ Res Date: 2019-07-24 Impact factor: 17.367