Literature DB >> 19270067

No-match ORESTES explored as tumor markers.

Barbara P Mello¹, Eduardo F Abrantes, César H Torres, Ariane Machado-Lima, Rogério da Silva Fonseca, Dirce M Carraro, Ricardo R Brentani, Luiz F L Reis, Helena Brentani.

Abstract

Sequencing technologies and new bioinformatics tools have led to the complete sequencing of various genomes. However, information regarding the human transcriptome and its annotation is yet to be completed. The Human Cancer Genome Project, using ORESTES (open reading frame EST sequences) methodology, contributed to this objective by generating data from about 1.2 million expressed sequence tags. Approximately 30% of these sequences did not align to ESTs in the public databases and were considered no-match ORESTES. On the basis that a set of these ESTs could represent new transcripts, we constructed a cDNA microarray. This platform was used to hybridize against 12 different normal or tumor tissues. We identified 3421 transcribed regions not associated with annotated transcripts, representing 83.3% of the platform. The total number of differentially expressed sequences was 1007. Also, 28% of analyzed sequences could represent noncoding RNAs. Our data reinforces the knowledge of the human genome being pervasively transcribed, and point out molecular marker candidates for different cancers. To reinforce our data, we confirmed, by real-time PCR, the differential expression of three out of eight potentially tumor markers in prostate tissues. Lists of 1007 differentially expressed sequences, and the 291 potentially noncoding tumor markers were provided.

Entities: CellLine Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2009 PMID： 19270067 PMCID： PMC2677862 DOI： 10.1093/nar/gkp074

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Understanding the genetic basis of human development and the mechanisms implicated in the physiopathology of diseases has improved dramatically after the disclosure of the human genome sequence, and its encoded genes (1–3). It is now widely accepted that, in mammals, there is no linear correlation between the number of genes, transcripts, and functionally diverse proteins. In the human transcriptome, a myriad of controlling mechanisms involving alternative splicing and a diversity of 5′ and 3′ ends contribute to, a yet unknown universe of transcripts (4). It is known that most of the genome is transcribed in complex patterns of interacting and overlapping transcripts from both strands (5–9), and most mammalian genes also have antisense transcripts (7,9–11). We currently have a great deal of information (4,5,12–14) arising from modern technologies, such as tiling arrays, that confirm the genome to be pervasively transcribed, and that the noncoding regions, such as the introns and intergenic regions, play an important role in human genome regulation by cis-acting at the transcriptional level (4,15,16). These approaches have resulted in the discovery of many novel transcribed sequences, and provide a new perspective on the number and extent of transcripts. Noncoding RNAs (ncRNAs) are emerging as key players in transcriptional and translational control, and represent a new level of complexity (17,18). Available data shows that the ratio of noncoding versus coding RNAs increases from prokaryotes to mammals (6,19). Furthermore, ncRNAs appear to have cell- or condition-restricted expression, and at lower levels compared with the well-characterized coding genes (20–22). In addition, although cross-species conservation of many ncRNA transcribed regions is weak, promoters of these transcripts are generally much more evolutionarily conserved, and the conserved regions extend further than in the promoters of protein coding RNAs (5 kb versus 500 bp) (5,22,23). In recent years, the use of bioinformatics tools allied to experimental studies, particularly for the whole genome, has become a common and promising means to predict and screen novel ncRNAs and antisense RNAs (10,14,22,24,25). Although sequencing efforts based on generating cDNA fragments had a major impact on gene discovery, the unspliced human transcripts that map exclusively to introns, and with no similarity to known expressed genes from any organism, were not fully appreciated. Most investigators selected transcripts with evidence of splicing, or ESTs only where both a polyadenylation signal and a poly(A) tail were present (18). It is now accepted that only a small fraction of the sequences generated through EST methods represent mitochondrial transcripts, reverse transcribed copies of rRNA, bacterial contaminants or immature mRNA molecules (26,27). Large fractions of what were, until recently, considered ‘junk’ DNA are indeed transcribed, and may play a fundamental role in understanding genomes (5,15,28). In addition, the results presented by Ravasi et al. (29) show that most of the cloned, noncoding sequences in the RIKEN cDNA collection, are expressed and are not, on the whole, derived from genomic, or pre-mRNA (premature mRNA), contamination. A large contribution toward identifying ESTs was the Human Cancer Genome Project (HCGP) (3,26,27,30), performed by the ORESTES (open reading frame EST sequences) methodology. ORESTES is a technique to generate ESTs encompassing midpoints of genes, unlike conventional EST methodologies (5′ and 3′) that cover the ends of transcripts. This characteristic results from the cDNA synthesis using arbitrarily selected, nondegenerate primers under low-stringency conditions, that permits sequence analysis of less abundant gene transcripts, and therefore, lead us to access genes with lower levels of expression (26). Thus, the HCGP, through ORESTES methodology, generated 1 190 044 open reading frame EST sequences using RNA extracts from 24 types of normal or tumor tissues (3,27). From this total, almost 30% (341 680 sequences) showed no similarity with known transcripts and were considered no-match ORESTES (27). With the aim to explore the potential of ORESTES with no similarities with ESTs in the public databases as tumor markers, we constructed a cDNA microarray. This platform, containing ORESTES with a high probability of representing actively transcribed regions not associated with annotated transcripts, was hybridized against 12 different normal and tumor human tissues. The differential expression observed among distinct tissues or pathological conditions demonstrates that this strategy was very useful for identifying tissue-specific, or tumor-specific RNAs that do not correspond to previously annotated transcripts. These hitherto-uncharacterized transcripts may represent new human genes, splice variants, ncRNAs or natural antisense transcripts (NATs) with a restricted pattern of gene expression. As prostate tumor is the most prevalent cancer in the Brazilian male population (http://www.inca.gov.br), we have explored some of these sequences as potential prostate tumor markers.

MATERIALS AND METHODS

Selection of ORESTES and genome mapping

To construct the array, 4356 ORESTES with higher probability to represent actively transcribed regions of the human genome not associated with annotated transcripts, were randomly selected from the data generated by Fonseca et al. (31), resulted from the exploration of the 341 680 ORESTES from the Human Cancer Genome Project that showed no similarity to known transcripts (27). In this work, a bioinformatics pipeline was constructed for the sequences mapped on the human genome that were annotated as no-match in the Human Cancer Genome Project, starting with the removal of sequences derived from libraries containing genomic DNA or immature mRNA contamination, according to Sorek & Safer, 2003 (32), followed by selection of clusters containing at least one no-match sequence derived from prostate or breast tissues and that were formed by ESTs originating from at least two distinct libraries, and the singletons that showed gaps upon genomic alignment. Also, clusters aligned with full-length transcripts or ESTs of other projects were removed. Genome mapping was done through a local database composed of data downloaded from the UCSC Genome Bioinformatics database (http://genome.ucsc.edu). ORESTES were classified according to their mapping on the human genome using three different gene tracks (Ensembl, KnownGene and RefSeq), and sequences mapped once on the genome were further classified as exonic, intronic and intergenic sequences.

cDNA microarrays

Glass arrays with 4356 elements were prepared in our lab with the aid of the Flexys Robot (Genomic Solutions, Ann Arbor, MI, USA), as described by Brentani et al., 2005 (33). Microarray data are deposited at Gene Expression Omnibus (GEO) under accession number GSE12737. Detailed information is provided in Supplementary Data.

RNA extraction and amplification

The institutional research ethics committee approved the current study (REC number 970/07), which was performed in accordance with the principles expressed in the Declaration of Helsinki. All samples kept in the A.C. Camargo Hospital BioBank, have signed informed consent for use in research, provided and approved by patients. Total RNA derived from 56 normal or tumor tissues, obtained from the A.C. Camargo Hospital BioBank, was extracted with TRIzol (Invitrogen, Carlsbad, CA, USA) (Supplementary Data, Table S1). As a reference, we used a pool of RNAs obtained from 15 distinct human cell lines (Table S2). RNA samples were linearly amplified using a T7-based protocol (34,35). cDNA was prepared with aminoallyl-dUTP (Sigma–Aldrich, St. Louis, MO, USA) (36). Detailed information is provided in Supplementary Data.

Labeling, hybridization and data extraction

cDNA samples were submitted to indirect labeling (36) using Alexa Fluor 555 or Alexa Fluor 647 labels (Invitrogen). Hybridizations were performed in duplicate using the dye-swap method (35,37) in the GeneTAC Hybridization Station (Genomic Solutions). Slides were scanned on a confocal laser scanner (ScannArray Express, PerkinElmer, Waltham, MA, USA), using identical parameters for all slides and data was extracted with ScanArray Express software (PerkinElmer). The histogram method was used to estimate signal and local background intensities. Detailed information is provided in Supplementary Data.

Selection of bonafide transcripts

After subtracting local background, data was normalized by Lowess (38). For each sample, we determined the correlation between replica hybridizations and the number of spots with signal greater than local background. We also determined, for each sample, the differences between average signal intensity for elements representing intergenic or intragenic (exonic and intronic) sequences and for exonic or intronic sequences. To define a sequence as expressed, and to minimize the risk of a false-positive call, we applied a second level of cutoff for low-intensity spots. First, we determined, for each element, the lowest background-corrected intensity value among the 112 reads (main and swap slides) in each channel. Then, for each channel, we considered, as threshold, the highest value among the 112 lowest reads in each slide. Next, we eliminated, for each channel, all elements with median intensity below this threshold. Elements that survived these criteria were considered bonafide transcripts. For all expression data we applied log2 to the values.

Prediction of structured ncRNA candidates

Genomic sequences corresponding to ORESTES were analyzed to predict structured ncRNAs candidates. First, we separated the sequences into three groups: fully exonic, partially exonic, and nonexonic, according to the annotation systems KnownGene and RefSeq (UCSC Genome Bioinformatics). For each group, we combined searches for three features: (i) putative ORF, (ii) coding/noncoding potential of sequences and (iii) sequence and secondary structure conservation. To determine if a sequence is entirely an ORF, we used the getorf program (EMBOSS program suite, http://www.ebi.ac.uk/Tools/emboss), which analyzes if the three reading frames of both strands of the sequence could generate a coding sequence, and checked if the longest ORF identified by this software corresponded to the whole sequence (or its trimmed version of up to 2 bases from each end). Also, we used the Coding Potential Calculator (CPC) software (39), with default parameters, which classifies sequences in coding and noncoding (weak-coding, coding, weak noncoding and noncoding), to refine our initial ORF prediction. This software takes into account six features, being three of them based on the predicted ORF extension, quality and integrity, and the other three derived from BLASTX searches (UniRef90, BLAST Assembled Genomes; http://blast.ncbi.nlm.nih.gov/Blast.cgi): the number, quality and frame of the hits. We grouped sequences classified as noncoding or weak noncoding and sequences classified as coding and weak coding. To detect sequence and secondary structure conservation, we searched for multispecies alignments (16 vertebrate genomes with human http://hgdownload.cse.ucsc.edu/goldenPath/hg18/multiz17way) that overlapped the ORESTES sequence locations. These alignments were analyzed using the RNAz software (40) with default parameters, to detect evidence of secondary structure conservation, like compensatory base substitution.

Validation by RT-PCR

To select sequences for validation by RT-PCR, we first determined the average intensity value for each element in all slides. Using an MA plot (intensity ratios versus average intensities), we randomly selected elements with intensity 20-fold higher than the background (cutoff value of log212 for A, average intensities), since we intended to validate highly expressed sequences. Primers for 12 selected sequences were designed using Primer3 software (http://frodo.wi.mit.edu) (Table S3). RNAs from 23 normal or tumor tissues were obtained from the A.C. Camargo Hospital BioBank (Table S1), extracted with TRIzol (Invitrogen) and DNase treated (Illustra RNAspin Mini Isolation Kit, GE Healthcare, Buckinghamshire, ENG, UK). RT-PCR reactions were carried on Gene Amp PCR System 9700 (Applied Biosystems, Foster City, CA, USA) and the amplicons were fractionated by electrophoresis thorough a 3% NuSieve GTG (Cambrex, East Rutherford, NJ, USA) and stained with ethidium bromide. Detailed information is provided in Supplementary Data.

Differential expression analysis

To select differentially expressed sequences to be considered as tumor marker candidates we constructed MA plots showing, for each spot, fold differences and median signal intensity for tumor versus normal tissues. For these analyses, three (placenta, lung and testis) out of 12 tissues that were used in cDNA microarray experiments were discarded because we had only normal samples from them, and therefore, we could not perform differential expression analyses with the aim to identify tumor markers for these tissues.

Validation by quantitative real-time PCR

To select sequences to validate by real-time PCR, we determined, for each element, fold differences between median signal intensity for: (i) prostate tumor versus normal prostate tissue and (ii) prostate tumor versus all normal tissues analyzed on cDNA microarray experiments. Using MA plots, we selected elements expressed at least 4-fold more or 4-fold less in prostate tumor relative to normal prostate, and at least 2-fold more or 2-fold less in prostate tumor relative to all normal tissues (values converted to log2). Primers were constructed for nine sequences differentially expressed in prostate tissue, using Primer Express software (Applied Biosystems) and Oligo Tech program (http://www.oligosetc.com/analysis.php) (Table S5). Real-time PCR reactions were optimized using a pool of RNAs from three tumor prostate cell lines (PC-3, DU 145 and LNCaP), provided by the São Paulo branch of the Ludwig Institute for Cancer Research, and cultivated by the Laboratório de Investigação Médica/24 from Universidade de São Paulo. Real-time PCR validation was performed in seven paired samples from prostate (prostate adenocarcinoma and its surrounding non-neoplasic tissue), obtained from the A.C. Camargo Hospital BioBank (Table S6), extracted with TRIzol (Invitrogen) and DNase treated (RQ1 RNase-Free DNase, Promega, Madison, WI, USA). Real-time PCR experiments were carried out in duplicate using the SYBR Green detection method (Applied Biosystems). The housekeeping gene HPRT was selected through literature review (41). We used a previously described molecular marker for prostate carcinoma (AMACR) (42) as positive control for real-time PCR reactions. Real-time PCR was performed on a 7900HT Fast Real-Time PCR System (Applied Biosystems). The relative expression ratio was calculated according to Pfaffl formula (43). For all expression data we applied log2 to the values. Detailed information is provided in Supplementary Data.

Sequencing of validated ORESTES

ORESTES validated as real-transcripts by RT-PCR had their PCR products sequenced to verify their correspondence to the immobilized sequences and differentially expressed ORESTES validated by real-time PCR had their original clones sequenced to verify their correspondence to the sequences with which we expected that they were. Sequencing was carried on the 3130 Genetic Analyzer (Applied Biosystems). Detailed information is provided in Supplementary Data.

RESULTS

Genomic mapping of the cDNA microarray sequences

An analysis comparing the genomic location of ORESTES and non-ORESTES ESTs, with respect to coordinates of coding genes, was performed. As expected, we found that both non-ORESTES ESTs, as well as ORESTES, were preferentially mapped in transcribed regions of the human genome, using three different gene tracks (RefSeq, Ensembl and KnownGene, UCSC Genome Bioinformatics; http://genome.ucsc.edu) (Figure 1A). The proportion of ORESTES sequences that overlapped annotated exons of coding genes was somewhat reduced in the ORESTES data set (Figure 1B). The preferential mapping of ORESTES to transcriptional units suggests that fully intronic ORESTES may represent valid transcripts instead of genomic DNA contamination of ORESTES libraries.

Figure 1.

Mapping of ESTs on the human genome according to three different data sets. (A) ESTs mapped onto human transcript regions. (B) ESTs mapped onto human exonic regions. Black bar, ESTs; gray bar, ORESTES (open reading frame expressed sequence tags); and white bar, ORESTES that compound the cDNA microarray. We constructed a cDNA microarray containing 4356 distinct ORESTES, selected using a previously described pipeline developed to maximize the probability of identifying new expressed sequences (31). Our data showed that most ORESTES that compounded the array was mapped to transcribed regions of the genome (Figure 1A), and had a fully intronic location (Figure 1B). Only a small fraction of spotted sequence overlapped annotated exons of coding genes or had intergenic mapping (Figure 1). For further analysis, we considered 3872 sequences that map once to the human genome. We divided these sequences into exonic (335 sequences), intronic (3178 sequences) and intergenic sequences (359 sequences), representing 8.6%, 82.1% and 9.3% of the sequences respectively. A large proportion of these ORESTES (3767) are unspliced relative to the genome.

Analysis and identification of actively transcribed regions not associated with annotated transcripts, and their evaluation as potential ncRNAs

Many low expression transcripts, splicing isoforms and ncRNAs are involved in specialized biological functions, and show a tissue-specific or even a pathological-specific expression patterns. To survey new transcripts associated with ORESTES, 24 tumor and 32 normal RNA samples from 12 different tissues (Table S1) were hybridized with the microarray platform. Some preliminary analyses were performed to determine the overall quality of data. The Pearson correlation between two replicate slides showed a median value of 0.86, and 76% of the elements that compounded this platform had signal greater than local background. We investigated if there was any bias that could be associated with the different types of sequences immobilized on the array, according to the previous classification: exonic, intronic or intergenic sequences. Using the Wilcoxon test, there were no statistically significant differences in either case, i.e. in the comparison of average signal intensity for elements representing intergenic or intragenic sequences, as well as in the comparison of only intragenic (exonic or intronic) sequences. The spotted sequences showed no systematic bias associated with their classification, corroborating the likelihood of those sequences mapped on nonexonic regions as being transcribed sequences. To be more accurate in defining true hybridization signals we created a more stringent criterion, described in ‘Materials and methods’ section, with signal intensity cutoff values of 196 and 65 for channels 1 and 2, respectively. Thus, for each channel, we eliminated all elements with a median signal intensity below these thresholds. For channels 1 and 2 we had 86.6% and 91.1% of slides with more than 3000 valid elements, respectively. Therefore, the total number of actively transcribed regions not associated with annotated transcripts was 3421 (3079 out of 3178 intronic and 342 out of 359 intergenic sequences). The additional number of 319 out of 335 exonic elements identified as valid elements, corroborated the potential of our approach to identify new real, transcribed regions, since these sequences were deposited by others in public databases while this work was being performed. From this final number of valid elements (3740), 96 sequences (80 intronic, 6 intergenic and 10 exonic sequences) had intensity above our established cutoff value (20-fold higher than the background) and were eligible for RT-PCR validation. From this 96 sequences, we arbitrarily selected nine intronic sequences (roughly 11% of the total of intronic sequences), and three intergenic sequences (50% of intergenic group) and validated the existence of all of them as actively transcribed regions not associated with annotated transcripts, in RNAs derived from 10 different tissues (Tables S1 and S3, Figure S1). PCR products of validated sequences were submitted to sequencing and their correspondence to the immobilized sequences on the array was confirmed. Evidence of secondary structures coupled with some sequence conservation at the RNA level can provide important clues that a given ‘locus’ is probably transcribed, and that this transcript may have a biological role (14,40,44,45). RNA secondary structures are known to play an important functional role, not only in many noncoding transcripts, but also in the context of protein-coding mRNAs (46). To analyze the proportion of spotted sequences that may represent structurally conserved putative ncRNAs, we searched for three features: (i) putative ORF, (ii) coding/noncoding potential and (iii) sequence and secondary structure conservation. For this analysis, sequences that did not overlap to known exons (intronic and intergenic sequences) were grouped together (3537 sequences) and the exonic sequences were further classified to fully exonic (131) and partially exonic (166). As for this analysis we only considered the KnownGene and RefSeq gene tracks to classify analyzed sequences, we discarded 38 sequences, previously classified as exonic according to the initial mapping, using the RefSeq, Ensembl and KnownGene gene tracks (UCSC Genome Bioinformatics) (Figure S2). We considered as putative ncRNAs sequences which presented all following features: partially exonic or nonexonic mapping, CPC software prediction of noncoding potential and evidence of secondary structure conservation according to the RNAz software. From the partially exonic sequences, we found 38 putative ncRNAs and from the nonexonic sequences, we found 1040 ncRNAs candidates (Table 1, Figure S2). It is noteworthy that some known ncRNAs possess a subsequence that is not as short as is usual, and resembles an ORF (46). In summary, about 28% (1078 of 3834) of our transcribed regions, not associated with annotated transcripts, are potential ncRNAs (Table 1, Figure S2).

Table 1.

Putative noncoding RNAs and their distribution with respect to differential expression

ncRNAs (noncoding RNAs).

Putative noncoding RNAs and their distribution with respect to differential expression ncRNAs (noncoding RNAs).

Differential expression analyses and validation by quantitative real-time PCR

We constructed MA plots (intensity ratios versus average intensities) showing, for each spot, fold differences and median signal intensity for tumor versus normal tissues, for all the different tissues used in the cDNA microarray (Figure 2). We observed in all tissues, a large number of differentially expressed (at least 2-fold) sequences between tumor and normal samples, suggesting the potential to explore uncharacterized molecular markers (about 28% of the intronic and intergenic sequences mapped once on the genome). The total number of differentially expressed sequences, with fold differences between tumor and normal samples of at least two, in one or more different tissues and in agreement in respect to these sequences being up- or downregulated in all tissues in which they were expressed, were 1007, being 111 out of 335 exonic sequences, 885 out of 3178 intronic sequences and 111 out of 359 intergenic sequences (a list of all 1007 differentially expressed sequences is provided in our website, http://www.lbhc.hcancer.org.br/orestes_tumor_markers). Considering the same criteria of differentially expressed sequences described above, 291 transcripts were classified as differentially expressed putative ncRNAs by our pipeline. Four percent of these putative noncoding tumor markers were in the NONCODE database (47), or were predicted as an antisense pair by Galante et al. (48), again corroborating the validity of our approach, but have never been identified as differentially expressed in tumors. In Table 1 we assessed whether these candidates were expressed in one or more tumor tissues and found that at least five putative noncoding tumor markers were upregulated in at least four different tumors (AW803984, BE161676, CV358552, AW814925 and AW935941), compared with normal tissues. This is a very promising result for the search for tumor markers. A list of all putative noncoding tumor markers is provided in Table S4.

Figure 2.

MA plot (intensity ratios versus average intensities) showing the fold differences and median signal intensity for tumor versus normal tissues for each spot on microarray. (A) Prostate tissue. (B) Other tissues used on cDNA microarray. Gray circles, the sequences from prostate selected for real-time PCR validation (with fold value in prostate tumor 4-fold more or 4-fold less relative to normal prostate and 2-fold more or 2-fold less, relative to all normal tissues). Dotted line, 2-fold line; dashed line, 4-fold line. We constructed MA plots to present an overview of the sequence expression distribution in prostate tissue (Figure 2A). For each spot, we observed the fold differences and median signal intensity for prostate tumor versus normal prostate (Figure 2A), and for prostate tumor versus all normal tissues (Figure S3). The nine sequences from prostate selected for validation by real-time PCR (Table S5) had at least a 4-fold variation in prostate tumor relative to normal prostate, and had at least 2-fold variation in prostate tumor relative to all normal tissues. We observed that, in general, the selected sequences were differentially expressed only in prostate when compared with other tissues (Figure 2, gray circles). Using real-time PCR, we validated eight of the nine sequences as real transcripts. We considered valid differentially expressed sequences as those that presented a 3-fold difference in at least three out of seven paired samples. Using this criterion, three sequences were considered to be potential prostate tumor markers (Table 2). One of the potential tumor markers (BQ373258) was previously described as a ncRNA (DD3PCA3) by Bussemakers et al. (49). Its differential expression was confirmed in five of our seven paired prostate samples, and was upregulated in prostate cancer, serving as a positive control for our real-time PCR experiments. The overexpression of AW793062 ORESTES in prostate tumor was confirmed in four paired tissues. Genome mapping of this sequence showed its alignment to the first intron of a putative isoform of the RNF217 gene. The sequence BF910617 was validated in three samples and showed overexpression in prostate cancer. It is an intronic sequence of the KIAA1432 gene. Considering our criteria of valid differentially expressed transcripts (3-fold difference in at least three out of seven paired samples), we validated the overexpression of the AMACR gene. This molecular marker for prostate carcinoma was previously described as having high sensitivity and specificity for prostate carcinoma from different grades and types, being its mRNA overexpressed in about 30% (microarray) to 60% (real-time PCR) of prostate tumors and is low to undetectable in normal tissues (42,50,51).

Table 2.

Results of quantitative real-time PCR validating paired prostate samples with cDNA microarray results

a100% values represent expression only in tumor samples (no detectable signal in normal samples) and were converted 10-fold to calculate fold mean.

All values represent log2 of expression values, considering tumor/normal ratios.

Results of quantitative real-time PCR validating paired prostate samples with cDNA microarray results a100% values represent expression only in tumor samples (no detectable signal in normal samples) and were converted 10-fold to calculate fold mean. All values represent log2 of expression values, considering tumor/normal ratios. A summary of all sequences and samples sets used in each performed assay, as well as obtained results, is provided in Supplementary Data (Table S7).

DISCUSSION

Since a significant set of ORESTES remains unassociated with annotated transcripts, and could potentially represent actively transcribed regions of the human genome, we constructed a cDNA microarray containing ORESTES with a high probability of representing actively transcribed regions of the human genome, and not associated with annotated transcripts. Most of the sequences immobilized on the array map on intronic regions and are unspliced. After hybridization using 12 different tissues, we identified 3421 actively transcribed regions not associated with annotated transcripts. With RT-PCR we validated 100% of actively transcribed regions not associated with annotated transcripts that were evaluated (12 sequences). Based on an ORF detector program (getorf, http://www.ebi.ac.uk/Tools/emboss), only 9% of the sequences mapped once on the genome may represent coding genes, leading us to search for potential noncoding sequences. In spite of the ORESTES methodology being biased to cover transcript midpoints with high probability of representing open reading frames, our data showed that from the sequences mapped to intronic or intergenic location (nonexonic group) only 7.6% presented a putative ORF. In contrast, 47.3% of fully exonic sequences had a putative ORF (Figure S2). Our next step was to look for sequences that could be tumor, tissue or tumor/tissue associated. We observed in all tissues, a large number of differentially expressed (at least 2-fold) sequences between tumor and normal samples, suggesting the potential to explore uncharacterized molecular markers (about 28% of the intronic and intergenic sequences mapped once on the genome). The total number of differentially expressed sequences, with fold differences between tumor and normal samples of at least two and in agreement in respect to these sequences being up- or downregulated in all tissues in which they were expressed, in one or more different tissues, were 1007. We investigated the number of intronic ORESTES that mapped in a cancer gene list, compounded by 382 genes for which mutations have been causally implicated in cancer. This catalog of cancer genes is available on the Sanger Institute (Cancer Gene Census, http://www.sanger.ac.uk/genetics/CGP/Census) and it is based on a previously published review (52). We found 189 intronic ORESTES mapped to 97 cancer genes. The number of the differentially expressed ORESTES, considering the same criteria described above, located within introns of these cancer genes were 47. Using a list of cellular signal pathways curated by NCI-Nature (http://pid.nci.nih.gov), we expanded the original list of cancer genes for 1003 cancer pathway related genes. We found that 287 ORESTES mapped to 170 cancer-pathway related genes. From these 287 ORESTES related to cancer pathways, 70 were differentially expressed, considering the same criteria described above. De novo computational prediction of ncRNA genes is difficult, since these transcripts lack most of the signatures that make protein-coding gene prediction possible (45). However, ncRNA genes produce a functional RNA rather than a translated protein, and often display a conserved, base-paired secondary structure instead of primary sequence similarity. These features can be combined in analyses and result in profiles of a multiple sequence alignment of ncRNAs that can be captured by statistical models (14,53). There are several approaches that are used to successfully predict ncRNAs based on the idea that functionally significant RNA structures will be conserved in related species, even when primary sequence is not conserved (54). The secondary structure base pairings are maintained by compensatory base mutations. These changes can be used as statistical evidence of evolutionary pressure to keep the base pairs at those positions (14,40,44,45). Pedersen et al. (44) predicted, from an initial set of more than 48 000 structured regions, ∼10 000 structured RNA transcripts in the human genome. Washietl et al. (40) estimated that 35 000 structured RNAs are conserved in mammals. The annotation of ncRNAs on a genome-wide scale is currently restricted to searching for homologs of known RNA families. More than 1500 homologs of known classical RNA genes can be annotated in the human genome sequence, and automatic, homology-based methods predict up to 5000 related sequences (45). Major databases containing thousands of annotated ncRNA sequences are RNAdb (10) and NONCODE (47). Thus, using a combination of methods (see ‘Materials and methods’ section), we identified about 28% (1078 of 3834) of our transcripts as potential ncRNA. These sequences showed a small overlap (4%) to sequences deposited on these ncRNA databases. One of them, CV372409 ORESTES, aligns with a sequence in the NONCODE database and was downregulated in three different tumors, compared with normal tissues in our cDNA microarray experiments. A common theme seems to be that many ncRNA genes have a very restricted expression. Often, they have low, or no, EST coverage, but this does not necessarily mean that they are not expressed and are nonfunctional (14,55). Microarray technology has dramatically enhanced the discovery of molecular markers for cancer. Prostate cancer is the most prevalent cancer in Brazilian males (http://www.inca.gov.br) as well in men wordwide (http://www.cancer.gov), and investigators have searched for molecular markers of the disease. The first gene identified by cDNA microarray to be suitable for clinical practice, and to potentially improve the diagnosis of prostate cancer was AMACR (42). AMACR was suggested as a new molecular marker for prostate carcinoma by Xu et al. (42) in 2000, and confirmed by Jiang et al., (51). This protein is already used clinically as an aid in distinguishing prostate cancer from benign disease (56), and discriminating different grades and types of prostate cancer (50). Another potential molecular marker for prostate cancer, identified through cDNA microarray analysis, is the polycomb gene, EZH2. The expression of EZH2 indicates poor survival, and could be used as a marker for prostate cancer progression and metastasis (57–59). Also identified as a molecular marker is the TMPRSS2-ERG gene fusion, which is involved in the development of prostate cancer (60). Increasing evidence shows a relationship between changes in expression levels of ncRNAs and cancer (18,61–63), emphasizing the potential role of ncRNAs in tumorigenesis, and the potential of this type of transcript as a tumor molecular marker (62). For example, in breast carcinoma, BC1 is deregulated (64), and the overexpression of BC200 RNA was recently evaluated as a new molecular marker for a poor prognosis (65). In lung cancer, increased expression of the MALAT-1 gene indicates a poor clinical outcome (66), and in hepatocellular carcinoma, HULC ncRNA is one of the most upregulated genes (62). In prostate cancer, there is overexpression of PCGEM (67), and DD3PCA3 (49) is implicated in tumorigenesis (68). These ﬁndings present a strong argument for the inclusion of noncoding transcripts into the arsenal of markers used for molecular diagnostics, which, thus far, has been almost exclusively populated by assays of protein-coding transcripts (11). We validated three differentially expressed sequences in paired prostate samples as potential tumor markers. Validation of the BQ373258 sequence enhanced the value of our approach to identify molecular markers, since this sequence is mapped on the last exon of a described ncRNA (DD3PCA3) (49). DD3PCA3 has been described as highly overexpressed in prostate cancer tissue when compared with adjacent nonmalignant prostatic tissue, and its expression is restricted to the prostate (49). An unusually high density of stop codons has been identified along the entire DD3PCA3 cDNA sequence (49,69), which, in addition to the lack of an extended open frame and, after several years of analyzing putative proteins from predicted small ORFs, has resulted in the classification of DD3PCA3 as a polyadenylated ncRNA (69–71). Its function is unknown, although there is speculation that DD3PCA3 functions to regulate gene expression or participates in gene splicing (69). Both our cDNA microarray and real-time PCR show that this sequence is upregulated in prostate cancer relative to normal prostate (fold mean of 5.50 for real-time PCR and 7.41 for cDNA microarray). An interesting observation arises from the data of two ORESTES, BF910617 and AW793062. ORESTES BF910617 is aligned with an intron of the KIAA1432 gene. From the analyses performed through Oncomine Research (http://www.oncomine.org) of the Lapointe et al. (72) data set, we observed that, in prostate cancer relative to normal prostate, the BF910617 ORESTES has diametrically opposite expression compared with the KIAA1432 gene. In the data set provided by Lapointe et al. (72) using cDNA microarray, the KIAA1432 gene was highly expressed in normal prostate, decreasing as the aggressiveness of prostate cancer increased. According to this data set, it was least expressed in metastatic prostate cancer in the lymph node (72). Therefore, our hypothesis is that BF910617 ORESTES may play a role in regulating the KIAA1432 gene, inhibiting its expression in prostate cancer when it is expressed at high levels. ORESTES AW793062 was validated with high fold values in almost 70% of the paired samples. This sequence is located in the first intron of a putative isoform of the RNF217 gene. Once again, the differential expression of AW793062 in prostate cancer was opposite to that observed for the RNF217 gene, with respect to primary and metastatic prostate cancer (Oncomine Research analyses) (73). Although the differential expression of –4.32-fold in prostate tumor, showed by cDNA microarray experiments, of the CV374350 ORESTES was not confirmed by real-time PCR, we observed that this sequence maps to the last intron of the SGK1 gene, an inducible Ser/Thr kinase activated via phosphoinositide 3-kinase (PI3K) signal pathway (74,75). It is worth to note that there is an mRNA sequence (BX649005), also mapped to the SGK1 locus, which shows an extensive intron retention that includes the SGK1 last intron. It has been suggested that SGK1 may regulate androgen receptor activity, affecting androgen-mediated prostate cancer growth through a positive-feedback mechanism (76). Oncomine Research analysis of the SGK1 gene (73) suggests that this gene is expressed in normal prostate and benign prostate hyperplasia and its expression is fairly reduced among primary prostate carcinoma samples but is significantly reduced in metastatic prostate cancer. Further analysis of metastatic tumor samples could reveal if CV374350 expression follows the pattern of SGK1 gene expression and if this sequence may represents an SGK1 intron retention event or may be associated with other gene-regulation mechanism. This ORESTES was found in the list of cellular signal pathways related to cancer, analyzed as described above. The power of our data to explore uncharacterized molecular markers was demonstrated with the large number of differentially expressed sequences, between tumor and normal samples from all tissues (about 28% of the intronic and intergenic sequences mapped once on the genome). Also, 291 of these differentially expressed transcripts have ncRNA potential, as predicted by our analysis. It is also very promising that at least five putative noncoding tumor markers are upregulated in at least four different tumors, compared with normal tissues. On the basis of these results, we believe in the value of our approach to identify uncharacterized molecular markers. Our data set contains a large number of actively transcribed regions of the human genome not associated with annotated transcripts not yet widely explored. These may represent new genes, splice variants, NATs or ncRNAs, which could be used as molecular markers for other cancers.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq; 142330/2007-8 to B.P.M.); and Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP; 04/11774-8, 07/55791-1 to B.P.M.; 07/01549-5 to A.M.-L.). Funding for open access charge: Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq). Conflict of interest statement. None declared.

75 in total

1. Shotgun sequencing of the human transcriptome with ORF expressed sequence tags.

Authors: E Dias Neto; R G Correa; S Verjovski-Almeida; M R Briones; M A Nagai; W da Silva; M A Zago; S Bordin; F F Costa; G H Goldman; A F Carvalho; A Matsukuma; G S Baia; D H Simpson; A Brunstein; P S de Oliveira; P Bucher; C V Jongeneel; M J O'Hare; F Soares; R R Brentani; L F Reis; S J de Souza; A J Simpson
Journal: Proc Natl Acad Sci U S A Date: 2000-03-28 Impact factor: 11.205

2. Identification of human chromosome 22 transcribed sequences with ORF expressed sequence tags.

Authors: S J de Souza; A A Camargo; M R Briones; F F Costa; M A Nagai; S Verjovski-Almeida; M A Zago; L E Andrade; H Carrer; H F El-Dorry; E M Espreafico; A Habr-Gama; D Giannella-Neto; G H Goldman; A Gruber; C Hackel; E T Kimura; R M Maciel; S K Marie; E A Martins; M P Nobrega; M L Paco-Larson; M I Pardini; G G Pereira; J B Pesquero; V Rodrigues; S R Rogatto; I D da Silva; M C Sogayar; M de Fátima Sonati; E H Tajara; S R Valentini; M Acencio; F L Alberto; M E Amaral; I Aneas; M H Bengtson; D M Carraro; A F Carvalho; L H Carvalho; J M Cerutti; M L Corrêa; M C Costa; C Curcio; T Gushiken; P L Ho; E Kimura; L C Leite; G Maia; P Majumder; M Marins; A Matsukuma; A S Melo; C A Mestriner; E C Miracca; D C Miranda; A N Nascimento; F G Nóbrega; E P Ojopi; J R Pandolfi; L G Pessoa; P Rahal; C A Rainho; N da Rós; R G de Sá; M M Sales; N P da Silva; T C Silva; W da Silva; D F Simão; J F Sousa; D Stecconi; F Tsukumo; V Valente; H Zalcbeg; R R Brentani; F L Reis; E Dias-Neto; A J Simpson
Journal: Proc Natl Acad Sci U S A Date: 2000-11-07 Impact factor: 11.205

3. The contribution of 700,000 ORF sequence tags to the definition of the human transcriptome.

Authors: A A Camargo; H P Samaia; E Dias-Neto; D F Simão; I A Migotto; M R Briones; F F Costa; M A Nagai; S Verjovski-Almeida; M A Zago; L E Andrade; H Carrer; H F El-Dorry; E M Espreafico; A Habr-Gama; D Giannella-Neto; G H Goldman; A Gruber; C Hackel; E T Kimura; R M Maciel; S K Marie; E A Martins; M P Nobrega; M L Paco-Larson; M I Pardini; G G Pereira; J B Pesquero; V Rodrigues; S R Rogatto; I D da Silva; M C Sogayar; M F Sonati; E H Tajara; S R Valentini; F L Alberto; M E Amaral; I Aneas; L A Arnaldi; A M de Assis; M H Bengtson; N A Bergamo; V Bombonato; M E de Camargo; R A Canevari; D M Carraro; J M Cerutti; M L Correa; R F Correa; M C Costa; C Curcio; P O Hokama; A J Ferreira; G K Furuzawa; T Gushiken; P L Ho; E Kimura; J E Krieger; L C Leite; P Majumder; M Marins; E R Marques; A S Melo; M B Melo; C A Mestriner; E C Miracca; D C Miranda; A L Nascimento; F G Nobrega; E P Ojopi; J R Pandolfi; L G Pessoa; A C Prevedel; P Rahal; C A Rainho; E M Reis; M L Ribeiro; N da Ros; R G de Sa; M M Sales; S C Sant'anna; M L dos Santos; A M da Silva; N P da Silva; W A Silva; R A da Silveira; J F Sousa; D Stecconi; F Tsukumo; V Valente; F Soares; E S Moreira; D N Nunes; R G Correa; H Zalcberg; A F Carvalho; L F Reis; R R Brentani; A J Simpson; S J de Souza; M Melo
Journal: Proc Natl Acad Sci U S A Date: 2001-10-09 Impact factor: 11.205

Review 4. Diamonds in the rough: mRNA-like non-coding RNAs.

Authors: Linda A Rymarquis; James P Kastenmayer; Alexander G Hüttenhofer; Pamela J Green
Journal: Trends Plant Sci Date: 2008-04-28 Impact factor: 18.313

Review 5. Computational methods in noncoding RNA research.

Authors: Ariane Machado-Lima; Hernando A del Portillo; Alan Mitchell Durham
Journal: J Math Biol Date: 2007-09-04 Impact factor: 2.259

6. Identification of differentially expressed genes in human prostate cancer using subtraction and microarray.

Authors: J Xu; J A Stolk; X Zhang; S J Silva; R L Houghton; M Matsumura; T S Vedvick; K B Leslie; R Badaro; S G Reed
Journal: Cancer Res Date: 2000-03-15 Impact factor: 12.701

7. PCGEM1, a prostate-specific gene, is overexpressed in prostate cancer.

Authors: V Srikantan; Z Zou; G Petrovics; L Xu; M Augustus; L Davis; J R Livezey; T Connell; I A Sesterhenn; K Yoshino; G S Buzard; F K Mostofi; D G McLeod; J W Moul; S Srivastava
Journal: Proc Natl Acad Sci U S A Date: 2000-10-24 Impact factor: 11.205

8. DD3: a new prostate-specific gene, highly overexpressed in prostate cancer.

Authors: M J Bussemakers; A van Bokhoven; G W Verhaegh; F P Smit; H F Karthaus; J A Schalken; F M Debruyne; N Ru; W B Isaacs
Journal: Cancer Res Date: 1999-12-01 Impact factor: 12.701

9. Identification of protein-coding and intronic noncoding RNAs down-regulated in clear cell renal carcinoma.

Authors: Glauber Costa Brito; Angela A Fachel; Andre Luiz Vettore; Giselle M Vignal; Etel R P Gimba; Franz S Campos; Marcello A Barcinski; Sergio Verjovski-Almeida; Eduardo M Reis
Journal: Mol Carcinog Date: 2008-10 Impact factor: 4.784

Review 10. Mechanisms of Disease: biomarkers and molecular targets from microarray gene expression studies in prostate cancer.

Authors: Colin S Cooper; Colin Campbell; Sameer Jhavar
Journal: Nat Clin Pract Urol Date: 2007-12

4 in total

1. Early gene expression changes in skeletal muscle from SOD1(G93A) amyotrophic lateral sclerosis animal model.

Authors: Gabriela P de Oliveira; Jessica R Maximino; Mariana Maschietto; Edmar Zanoteli; Renato D Puga; Leandro Lima; Dirce M Carraro; Gerson Chadi
Journal: Cell Mol Neurobiol Date: 2014-01-18 Impact factor: 5.046

2. Temporal blastemal cell gene expression analysis in the kidney reveals new Wnt and related signaling pathway genes to be essential for Wilms' tumor onset.

Authors: M Maschietto; A P Trapé; F S Piccoli; T I Ricca; A A M Dias; R A Coudry; P A Galante; C Torres; L Fahhan; S Lourenço; P E Grundy; B de Camargo; S de Souza; E J Neves; F A Soares; H Brentani; D M Carraro
Journal: Cell Death Dis Date: 2011-11-03 Impact factor: 8.469

3. Identification of the Key Genes and Pathways in Esophageal Carcinoma.

Authors: Peng Su; Shiwang Wen; Yuefeng Zhang; Yong Li; Yanzhao Xu; Yonggang Zhu; Huilai Lv; Fan Zhang; Mingbo Wang; Ziqiang Tian
Journal: Gastroenterol Res Pract Date: 2016-10-12 Impact factor: 2.260

4. In vitro and in silico validation of CA3 and FHL1 downregulation in oral cancer.

Authors: Cláudia Maria Pereira; Ana Carolina de Carvalho; Felipe Rodrigues da Silva; Matias Eliseo Melendez; Roberta Cardim Lessa; Valéria Cristina C Andrade; Luiz Paulo Kowalski; André L Vettore; André Lopes Carvalho
Journal: BMC Cancer Date: 2018-02-17 Impact factor: 4.430

4 in total