Literature DB >> 18547402

Genome-wide identification of coding and non-coding conserved sequence tags in human and mouse genomes.

Flavio Mignone¹, Anna Anselmo, Giacinto Donvito, Giorgio P Maggi, Giorgio Grillo, Graziano Pesole.

Abstract

BACKGROUND: The accurate detection of genes and the identification of functional regions is still an open issue in the annotation of genomic sequences. This problem affects new genomes but also those of very well studied organisms such as human and mouse where, despite the great efforts, the inventory of genes and regulatory regions is far from complete. Comparative genomics is an effective approach to address this problem. Unfortunately it is limited by the computational requirements needed to perform genome-wide comparisons and by the problem of discriminating between conserved coding and non-coding sequences. This discrimination is often based (thus dependent) on the availability of annotated proteins.
RESULTS: In this paper we present the results of a comprehensive comparison of human and mouse genomes performed with a new high throughput grid-based system which allows the rapid detection of conserved sequences and accurate assessment of their coding potential. By detecting clusters of coding conserved sequences the system is also suitable to accurately identify potential gene loci. Following this analysis we created a collection of human-mouse conserved sequence tags and carefully compared our results to reliable annotations in order to benchmark the reliability of our classifications. Strikingly we were able to detect several potential gene loci supported by EST sequences but not corresponding to as yet annotated genes.
CONCLUSION: Here we present a new system which allows comprehensive comparison of genomes to detect conserved coding and non-coding sequences and the identification of potential gene loci. Our system does not require the availability of any annotated sequence thus is suitable for the analysis of new or poorly annotated genomes.

Entities: Chemical Disease Gene Species

Mesh：

Substances：

Year: 2008 PMID： 18547402 PMCID： PMC2442843 DOI： 10.1186/1471-2164-9-277

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

One of the main challenges of post-genomic era is the accurate annotation of genes and the improvement of our knowledge of mechanisms of gene expression through the identification of cis-acting non-coding regulatory regions. Comparative genomics has been one of the most successful approaches used to address this task. Indeed it is well known that sequences with functional activity – such as coding sequences or regulatory regions – are subject to selective pressures that prevent the fixation of mutations and conserve sequences during evolution. Conserved non-coding sequences have been shown to act as tissue specific enhancers of gene expression [1] and in particular of genes involved in control of development [2]. Evolutionary conserved sequences have also been successfully used for the identification of new genes [3]. Given the great interest in this area of research and thanks to the availability of the almost complete genome sequences of many organisms, several tools to identify and collect conserved sequences have been proposed [4,5]. The identification of a conserved sequence is only the first step in the identification of functional elements that requires further information, the most obvious being the assessment of its coding potential, i.e. to assess if the conserved sequence is likely to be part of a coding region. Discriminating between coding and non coding conserved sequences is of great importance as the discovery of novel coding sequences may help the detection of unannotated genes or coding exons and the identification of splice variants. Conversely, the study of non-coding conserved sequences may lead to the identification of regions that may have regulatory activity both at DNA or mRNA level by affecting transcription or translation thus modulating gene expression. The usual approach is to classify conserved sequences as coding or noncoding by comparison with annotated protein sequences: if a conserved sequence is not supported by (does not align to) a known protein it is labelled as non-coding. This approach makes classification heavily dependent on the quality of the annotation of the genomes under analysis and it is obviously less applicable to new – poorly annotated – genomes. We previously developed CSTminer, a tool that does not suffer from these limitations as it identifies conserved sequences (Conserved Sequence Tags – CSTs) and classifies them as coding or non-coding by evaluating the presence of evolutionary dynamics specific of coding sequences [6,7]. We recently applied CSTminer to an extensive analysis of human chromosomes 15, 21 and 22 and corresponding mouse syntenic regions [8]. We identified more than 37,000 CSTs. 9,500 of these were labelled as coding and were used to benchmark a novel methodology – based on the identification of clusters of coding CSTs – to detect genomic regions which are likely to contain genes (see [8] for details). One striking result of the work was that, despite the large efforts made towards the annotation of human genome, we were able to identify 25 loci potentially containing unannotated genes using a relatively simple comparative approach. Interestingly 11 of the 25 predicted genes were confirmed by updated genome annotation at the time of publication – confirming the reliability of our approach. The computational problem for the comprehensive comparison of two genomes of the size of human and mouse is not trivial. Indeed, although CSTminer is fairly fast, it is limited by the alignment step which implements a blast like algorithm which is not suitable to compare very long sequences. In this paper we propose a highly parallelized system to perform a complete comparison of large genomes. This system also allows the submission of precomputed genomic alignments (such as blastz) to further improve the speed of the analysis. A preliminary study of sequences conserved between human and mouse genomes allowed the identification of several clusters of coding CSTs that do not correspond to any annotated genes and the creation of a collection of non-coding CSTs possibly endowed of some functional activity.

Results

The first operation performed by the CSTminer algorithm is the identification of high scoring segment pairs (HSPs) through a Blast-like sequence comparison. A solution to the computational problem associated with whole genome comparisons is to split long sequences into smaller fragments (in order to limit the number of CSTs interrupted at the boundaries of the fragments, these cannot be too short). We empirically established a length of 100 Kbp (with an overlap of 1 Kbp) as a good compromise (data not shown). Given the size of human (~3 Gbp) and mouse (~2.6 Gbp) genomes, an exhaustive comparison between all human and mouse 100 Kbp sequences would require nearly 800 M comparisons. Considering an average computation time of 2 sec for each comparison the whole analysis would require many years of computation on a single CPU. We took advantage of the high level of parallelization offered by grid technology and developed a system suitable to perform all comparisons in a "reasonable" time. The number of tasks to be performed is very large and in a distributed environment there are many reasons that individual jobs may fail (Worker Node problems, site configuration, problems due to middleware failures, etc). Accurate job management is therefore essential and to this end a fully automated procedure based on mysql DBMS was developed to launch and monitor jobs, re-run failed jobs and to collect results of the analysis.

Genome-wide detection of human-mouse Conserved Sequence Tags (CSTs)

Following CSTminer analysis we obtained a redundant collection (see below) of nearly 1,500,000 CSTs with an average length of 190 nt as summarized in Table 1 where total number of CSTs, average and maximum length is reported for each chromosome.

Table 1

CST distribution among human (A) and mouse (B) chromosomes.

A
Chromosome	Tot	AvLen	MaxLen

chr1	138165 (66669)	198.01 (187.9948)	4621 (4621)
chr2	117207 (64475)	194.56 (196.3639)	17543 (17543)
chr3	92770 (51821)	192.64 (191.7276)	3378 (6460)
chr4	69640 (28421)	195.49 (190.4989)	4347 (4228)
chr5	80407 (37576)	207.3 (195.852)	7219 (7219)
chr6	85924 (37799)	203.53 (186.8935)	4177 (4177)
chr7	70839 (40048)	189.56 (188.1825)	3829 (3237)
chr8	56955 (31423)	188.25 (190.3644)	5600 (5600)
chr9	62015 (31661)	186.98 (192.5231)	4403 (4403)
chr10	61229 (34765)	193.75 (188.0005)	2997 (3918)
chr11	98006 (38038)	167.79 (191.4565)	6512 (4968)
chr12	73657 (34444)	180.06 (177.3208)	2680 (2680)
chr13	38619 (19341)	194.75 (200.7517)	10176 (10176)
chr14	53848 (25668)	191.36 (194.2269)	3854 (3854)
chr15	49142 (25707)	201.89 (191.1299)	2600 (2600)
chr16	41166 (21161)	185.79 (191.0755)	4020 (4020)
chr17	61341 (29899)	172.99 (179.4735)	3179 (3179)
chr18	27896 (17053)	198.68 (196.9212)	3719 (3719)
chr19	56808 (17624)	144.18 (163.8533)	3738 (3738)
chr20	29061 (16007)	195.18 (184.8446)	2862 (2862)
chr21	13165 (6688)	174.33 (178.0885)	3935 (3935)
chr22	18378 (9414)	180.52 (165.1375)	4710 (4710)
chrX	68893 (38838)	200.7 (196.3681)	4882 (4882)
chrY	6192 (1786)	171.74 (147.0274)	2087 (3180)

B

Chromosome	Tot	AvLen	MaxLen

chr1	96565	193.02	3539
chr2	123127	180.58	17545
chr3	80185	198.2	4615
chr4	82789	194.01	4388
chr5	76557	184.15	3969
chr6	83801	188.59	3251
chr7	93077	176.52	6521
chr8	63198	194.55	4222
chr9	80909	187.01	4222
chr10	66720	183.21	4168
chr11	97295	176.61	3272
chr12	65166	192.03	3857
chr13	64693	198.31	3268
chr14	59365	200.92	10179
chr15	57968	180.62	5600
chr16	48185	192.52	3950
chr17	54366	185.28	3999
chr18	48736	215.77	7206
chr19	42161	188.74	5381
chrX	85471	199.71	4877
chrY	989	145,88	1193

The total CST length (Tot), the average (AvLen) and the maximum length (MaxLen) is reported. Numbers in brackets refer to the non-redundant CST dataset. Minimum length has been limited to 60 nt (see text for details).

CST distribution among human (A) and mouse (B) chromosomes. The total CST length (Tot), the average (AvLen) and the maximum length (MaxLen) is reported. Numbers in brackets refer to the non-redundant CST dataset. Minimum length has been limited to 60 nt (see text for details). The minimum CST length has been limited to 60 nt as shorter sequences would not allow a reliable computation of the coding potential score (see below). Chromosomes 2 and 13 show some uncommonly long CSTs (17,543 and 10,176 nt respectively). These two CSTs are characterized by a high coding potential and correspond to conserved sequences of the long coding exons of TTN and SACS genes. CSTminer assesses the coding potential of CSTs by computing a Coding Potential Score (CPS) based on the evolutionary dynamics observed between the two sequences. Once fixed threshold values for CPS derived from benchmark sets of coding and non-coding sequences a specific label can be assigned to each CST [7]. More than 400,000 CSTs were labelled as coding, 550,000 CSTs were labelled as non-coding, 500,000 CSTs remained undefined as the CPS fell in a twilight zone between the coding and the non-coding thresholds. Finally nearly 20,000 CSTs displayed more than 95% of identity and were labelled as ultra-conserved (Figure 1). Given the low divergence, it is not possible to compute a reliable CPS for these CSTs.

Figure 1

CST Coding Potential distribution. Distribution of CSTs among different coding classes assigned by CSTminer. Uc: ultraconserved sequences (identity >95%), Cod: coding CSTs, NonCod: non-coding CSTs, Undef: CSTs with undefined coding potential. Redundant CST dataset is shown with green bars, non-redundant set with red bars. See text for details. We observed that CSTs labelled as undefined often overlap both coding and non-coding regions corresponding to CDS or to UTR or intron sequences, respectively (data not shown). The global coding potential score assigned by CSTminer is influenced by both subregions and does not allow a clear classification of the sequence.

Human mouse conservation

In order to accurately evaluate whole human-mouse genome conservation it was necessary to consider that CSTs may overlap one another as the same region of a genome may share similarity with more than one region of the other genome under analysis (mainly due to segmental duplications or paralogous sequences). We projected each CST onto the genome and labelled each human and mouse nucleotide according to the CST label. If the same nucleotide was contained in more than one CST it was labelled according to the ranking: ultraconserved > coding > noncoding > undefined, as depicted in Figure 2. In this way we obtained a non-redundant set of CSTs assigning to each conserved nucleotide an unambiguous annotation.

Figure 2

Non redundant CSTs dataset generation. Each nucleotide of human and mouse genomes has been labelled as ultraconserved, coding, non-coding or undefined if contained in a CST. Nucleotides part of overlapping CSTs were labelled according to the ranking ultraconserved > coding > non-coding > undefined. Stretches of at least 60 consecutive nucleotides with the same label were merged into a non-redundant CST (nrCST). We then obtained a cleaned set of more than 720,000 nonredundant CSTs (nrCSTs) distributed across human chromosomes (Table 1 numbers in brackets) and labelled as summarized in Figure 1. We evaluated nucleotide conservation of each human and mouse chromosome as shown in Figure 3. Consistent with previous data [9] we observed that average conservation of human and mouse genomes (evaluated as the frequency of conserved nucleotides part of nrCSTs) is around 5%.

Figure 3

Human and mouse chromosome conservation. Chromosome conservation evaluated on human (A) and mouse (B) genomes. For each chromosome the frequency of conserved nucleotides – evaluated on nrCSTs – is reported. As described in legend each colour corresponds to a coding potential classification. Undef: undefined, NonCod: non-coding, Cod: coding, Uc: ultraconserved sequences (identity>95%). Previous observations showed that the majority of genes on human chromosome 17 have their homologues on mouse chromosome 11. Interestingly our data show that these chromosomes are the most conserved, with 6.9% and 7.2% of conserved nucleotides respectively. Conversely, both human and mouse chromosomes Y are poorly conserved (about 0.5%). This observation could be explained with the degeneration process faced by Y chromosome. Moreover they are unusually rich of repetitive elements that we masked before running our analysis. Only 0.94% of human genome is labelled as coding from our comparison with mouse genome.

CSTs and annotated mRNAs

Each CST is characterized by absolute coordinates on both human and mouse genomes and this allows a comparison with mRNA annotations (extracted by UCSC database). We then classified CSTs as exonic, intronic or intergenic on the basis of their overlap with the available mRNAs annotation. Figure 4 shows that 65% of ultraconserved CSTs and 76% of coding CSTs overlap with known exons, 21% and 15% respectively map within introns while the remaining 13% and 9% map outside of annotated genes. Noncoding CSTs – as expected – show a different pattern of localization: 43% and 29% map within intronic or intergenic regions while 28% overlap with annotated exons, likely corresponding to conserved tracts of untranslated regions (UTRs). A similar pattern is also shown by undefined sequences.

Figure 4

CSTs comparison to human and mouse mRNAs. CSTs have been compared to human and mouse annotated mRNAs and classified as Exonic (E) if overlapping with an exon in either human or mouse genomes. Remaining CSTs have been labelled as Intronic (I) if overlapping with an intron, or Intergenic (O) if not overlapping to any annotated gene in both human and mouse genomes. The distribution among different classes is reported for ultraconserved (Uc), coding, non-coding (NonCod) and undefined (Undef) CSTs.

Coding CST-Clustering

To identify regions with an high density of coding conserved sequences – likely gene loci – we applied an improved version of a clustering procedure previously described [8] (see also Material and Methods) and detected 25,296 clusters containing 141,001 CSTs. By comparing the genomic coordinates of the clusters of coding CSTs with those of annotated mouse and human mRNAs we observed that 22,360 clusters (97.6%) overlapped known mRNAs in at least one organism. Noticeably 15,275 clusters were fully confirmed by human sequences (each CST of the cluster was confirmed by the overlap with an mRNA derived exon) and 15,194 clusters were fully confirmed by mouse sequences (Table 2). These observations underline the reliability of our simple approach to localize gene loci on unannotated genomes and strongly support the idea that clusters not overlapping with known sequences are likely to represent unannotated gene loci.

Table 2

Clusters of coding CSTs have been identified as described in text and have been compared to annotated RefSeq mRNAs.

RefSeq
Tot Clusters	25296
Hs confirmed Clusters (full confirmed)	22360 (15275)
Mm confirmed Clusters (full confirmed)	22324 (15194)
Tot confirmed Clusters	25297

ESTs

Tot Clusters unconfirmed by RefSeq	668
Hs confirmed Clusters (full confirmed)	551 (432)

Clusters not corresponding to any annotated mRNA sequence have been compared to ESTs to find evidence of their expression.

Clusters of coding CSTs have been identified as described in text and have been compared to annotated RefSeq mRNAs. Clusters not corresponding to any annotated mRNA sequence have been compared to ESTs to find evidence of their expression. Moreover, given that our approach does not require the previous availability of annotated features, it seems reasonable to think that it could prove to be a powerful tool in the annotation of genomes lacking a well curated gene annotation. As 668 clusters were confirmed neither by human nor by mouse mRNAs, we compared their chromosome coordinates to those of human ESTs to find evidence of their expression. Indeed, 551 (82%) of these clusters showed an overlap with ESTs (432 clusters were fully supported). Only 117 clusters (comprising 775 CSTs) did not show any overlap with known transcribed sequences (14 of these corresponding to pseudogenes according to the human pseudogene database [10]). One of the CST clusters possibly corresponding to a novel gene locus is shown in Figure 5. Quite strikingly, coding CSTs not only correspond to spliced ESTs but also to coding exons predicted by a variety of gene finding programs thus further supporting the inferred gene prediction

Figure 5

Cluster n.13688 has been aligned to human genome on UCSC genome browser and compared to annotated genes, gene predictions and ESTs. The CSTs of this cluster match well to mapped spliced ESTs and exons predicted from various gene-finding programs in a genomic region where no gene has been annotated.

Noncoding CSTs

Several evidences have been reported about the critical role of non coding conserved sequences in regulation of gene expression [1] and in particular in the regulation of genes involved in control of development [2]. We are aware that – at this stage – our data are limited to the comparison of human and mouse and may not allow the precise localization of short functional motifs. Nonetheless the identification of core sequence elements shared by several non-coding nrCSTs might represent a powerful approach for the detection of conserved sequences that might be involved in chromatin remodelling or in the regulation of the expression of many genes while unique non-coding nrCSTs might be expected to include elements with more gene-specific functions. We performed reciprocal blastn analyses of more than 300,000 human noncoding nrCSTs and observed that nearly 92% of the sequences are unique (ie they do not show any significant similarity with other sequences of the dataset). 7% of sequences show some similarity to up to five different nrCSTs, 1% show up to 10 occurrences while the remaining 0.6% show sequence similarity to more than 10 non-coding nrCSTs. These results are summarized in Figure 6. The observed non-coding repeated elements do not correspond to annotated repeats as the CSTminer analysis was performed on repeat masked sequences.

Figure 6

Abundance of non-coding CSTs in human genome. Distribution of unique and repeated non-coding CSTs in the human genome. Bars report the number of non-coding CSTs grouped on the basis of their occurrenceData are shown in logarithmic scale. Figure 7 shows the frequency of occurrence of sequences corresponding to non-coding nrCSTs in the mouse and human datasets. The plot suggests a remarkable species-specificity of repetitive non coding conserved sequences.

Figure 7

non-coding CSTs occurrence in mouse and human genome. Each dot represents the number of occurrences of each non-coding CST in human (x axis) and mouse (y axis) genomes.

non-coding CSTs occurrence in mouse and human genome. Each dot represents the number of occurrences of each non-coding CST in human (x axis) and mouse (y axis) genomes. To investigate the hypothesis that noncoding conserved sequences (ncCSTs) might correspond to functional regions we made a comparison with specialized databases containing known regulatory elements. The enrichment of known functional regions in our dataset of conserved non-coding sequences would support the possibility that the same dataset could contain new regulatory elements. In particular we considered "Presta-promoter", which contains a curated non-redundant set of human promoter sequences [11], Rfam ncRNA database [12] and "OregAnno" database which contains manually curated known regulatory elements [13]. 306 ncCSTs matched with 238 (43%) sequences in the PrestaPromoter dataset, 347 ncCSTs with 269 (21%) "OregoAnno" sequences and 513 ncCSTs with 1069 (3%) Rfam elements. The finding that a sizable proportion of known functional elements are represented in our conserved non-coding set suggests that additional, still unknown, regulatory elements are represented in our ncCST dataset.

CST comparison with blastz chains

Pre-computed genome alignments are already available for several genomes, including human and mouse and it may make sense to take advantage of this data – provided that information loss is minimal. We compared CSTs obtained from our full genome analysis with the results obtained by comparing genome regions corresponding to blastz [14] chain tracts only. We adapted the grid based system we developed for the full genome comparison to allow the submission of blastz alignments (or any other "query – target" coordinate pairs) to limit the analysis to these regions. Nonetheless many blastz chains are longer than 100 Kbp (with length up to about 80 Mbp) and cannot be efficiently analyzed with a direct comparison. Those sequences were split in 100 Kbp slices (with 1 Kbp overlap) and – to further reduce computational load – only slices showing at least 3 identical sequences of 10 nt on the same diagonal were compared. This procedure – which is similar to the one employed by the BLAT algorithm [15] – remarkably reduced the number of CSTminer comparisons to about 1% of all the possible 100 Kbp comparisons. Despite the striking reduction of comparisons we observed that only 1% of total CSTs were completely missed by blastz chains while and additional 4% escaped detection because they did not pass the filter imposing three decamers on the same diagonal above described. However, our data suggest that the use of blastz chains can provide an acceptable reduction of the complexity of analysis with a limited (about 5%) loss of information. However, the main advantage of blastz chains in this context is their availability as pre-computed features (available for instance at UCSC genome browser website [16]). Indeed, their computation is rather time consuming (481 CPU days for the human – mouse comparison according to [14]). On the other hand, the reduction procedure based on the identification of exact matches on the same diagonal provides a significant speed up of the process as the computational requirements to perform the identification of those regions is limited (data not shown).

Discussion

As highlighted in [6] the CSTminer algorithm measures the coding potential through the evaluation of evolutionary dynamics unique to coding sequences not requiring the availability of any annotated feature. Indeed, many analyses of conserved coding or noncoding sequences have been made by classifying a sequence as coding (or non-coding) following a comparison with protein databases [1,17,18]. It is clear that the reliability of such approach depends on the availability of annotated proteins. If a sequence is not supported by a protein it is difficult to decide if the sequence is really noncoding or whether the corresponding protein has simply not been identified. Moreover very few annotated proteins have been physically sequenced and the vast majority of them are conceptual translation products of available mRNA sequences. This introduces a vicious circle as the hypothetical codingness of a sequence is inferred by the alignment to putative proteins. We have developed and implemented a high performance grid-based system to perform exhaustive full genome comparisons with CSTminer algorithm to identify and discriminate between conserved coding and noncoding sequences. Besides the speed of the whole procedure even when entire large genomes are compared, one of the main advantages of our system is that it does not require any annotated feature for the assessment of the putative coding potential of identified conserved tracts rendering it useful for the comparison of poorly annotated genomes (i.e. when no or few cDNA sequences are available). It is thus possible to identify "interesting" sequences such as putative genes or regulatory regions, and use these data to drive subsequent experimental analysis or to strengthen the reliability of independent computational data (i.e. de novo gene finding data). We demonstrated – using as a reference the well annotated human and mouse genomes – that the observation of clusters of coding CSTs is a good indicator of the existence of a gene locus. This information can be incorporated in a gene prediction pipeline where several gene prediction tools are combined and their results compared to limit the rate of false positives and to strengthen the significance of predictions [19]. Collections of conserved non-coding sequences can also address specific studies on sequences that might regulate, for example, gene expression or chromatin structure. Indeed, these data might also facilitate the identification of novel non-coding RNAs, whose importance and prevalence are currently the subject of much debate [20]. Currently available sets of nonding elements are generated following the comparison of the conserved elements with annotated mRNAs. It can thus be expected that following updates of databases some noncoding elements are reclassified as coding. In Table 3 we show a comparison of publicly available conserved noncoding elements with coding CSTs to detect misannotated elements. It is interesting to notice that – as expected – while recent sets of noncoding elements such as CONDOR [21]or Ancora [22] do not show any correspondence to coding CSTs, older sets (reviewed in [23]) show a sizable (239 to 345) number of elements misannotated as noncoding that today we know to be part of coding region of RefSeq sequences.

Table 3

Comparison of publicly available noncoding datasets to coding CSTs and currently annotated RefSeq coding sequences.

Non Coding Elements datasets	Reference	N° el	CST	Coding CST/RefSeq CDS
Ancora	[22]	267641	96%	0
CONDOR	[21]	4554	98%	0
Berkeley (*)	[23]	939457	51%	239
Penn State (*)	[23]	452517	85%	261
UCSC (*)	[23]	1007256	59%	345

(N° el:) number of noncoding elements of the dataset; (CST) percentage of elements of the set corresponding to a CST; (Coding CST/RefSeq CDS) number of elements of the set which correspond to a coding CST and to the CDS of a human RefSeq mRNA. (*) Elements of these sets refers to hg10 human assembly. Therefore we preventively converted their coordinates to current assembly by using LiftOver conversion tool.

Comparison of publicly available noncoding datasets to coding CSTs and currently annotated RefSeq coding sequences. (N° el:) number of noncoding elements of the dataset; (CST) percentage of elements of the set corresponding to a CST; (Coding CST/RefSeq CDS) number of elements of the set which correspond to a coding CST and to the CDS of a human RefSeq mRNA. (*) Elements of these sets refers to hg10 human assembly. Therefore we preventively converted their coordinates to current assembly by using LiftOver conversion tool. As pointed out by Couronne [23] local alignment tools, beside the identification of orthologous segments, lead to the identification of paralogous relationships and sequence repetitions. This information is often considered "noise" and is thus removed. This seems reasonable if the primary goal is to align genomes to find large scale orthologous regions; nonetheless repetitive elements can have functional relevance in regulation of gene expression and warrant further inspection. We have used repeat masked sequences – thus purging known repetitive elements – before CSTminer analysis. Nonetheless we observed many highly repetitive conserved noncoding elements that we believe to be interesting and may represent novel lineage specific repetitive elements. Our analysis system has been implemented on a grid facility [24], taking advantage of the high parallelization achievable and allowing full genome comparisons in very reasonable amount of time (15 days computation to compare mouse and human genomes). Nonetheless many genomes have already been aligned with very sensitive algorithms like blastz and it is possible to take advantage of this information, limiting CSTminer comparison to those regions only. It is important to notice that although blastz chains are alignments of genomic sequences, the CSTminer alignment step is required to detect local similarities. Indeed the average length of CSTs detected by CSTminer is 190 bases, while blastz tries to extend matches to find large synteny tracts, often resulting in very long alignments. Use of blastz chains only leads to the complete loss of only about 1% of total CSTs. However, chains longer than 100 Kbp must be split into shorter tracts and an exhaustive comparison of all tracts must be performed. To further improve the speed of the analysis it is possible to limit CSTminer comparison of chain fragments to pairs that show at least 3 identical matches of 10 nt on the same diagonal. By applying this restriction it is possible to drastically limit the number of comparisons (nearly 1% of total comparisons) thus reducing the computation time to slightly more than 24 hours (in the case of human and mouse genomes) with a further loss of 4% of CSTs.

Conclusion

In this paper we describe a grid-based system devised to perform full genome comparisons with CSTminer algorithm. The main advantage of this system is that the assessment of coding potential of conserved sequences does not require any annotated feature rendering it useful for the comparative analysis of poorly annotated genomes. The system has been benchmarked on the well-annotated human and mouse genomes where it proved its reliability.

Methods

CSTminer

The CSTminer algorithm has been described in [6] and slightly modified in [7] where a web interface to run CSTminer was made available. A further automatic web system to compare a single sequence to several genomes has also been implemented in [25]. Briefly, given a pair of sequences, CSTminer identifies high scoring segment pairs (HSPs) through a Blast-like sequence comparison. The coding capacity of each CST delimited by an HSP is then assessed by assigning a coding potential score (CPS) which corresponds to the maximum score value obtained from each of the possible reading frames in the forward and reverse orientation. CSTminer also allows the display of the highest-scoring triplet window (default minimum length of 60 nt) by scanning each detected CST. This approach facilitates the detection of potential coding regions located in longer CSTs which might contain both coding and non-coding tracts (through the presence of untranslated mRNA or intronic regions). Following an accurate benchmark on controlled coding and non-coding datasets, CPS thresholds for coding and non-coding CSTs were evaluated. Therefore each CST was labeled as coding (Cod) (if CPS ≥ coding_threshold) or non-coding (NonCod) (if CPS ≤ non_coding threshold or CPS < coding_threshold and highest scoring triplet window CPS < coding threshold). CSTs with CPS non fulfilling these requirements were labeled as Undefined. Finally CSTs with more than 95% of similarity were labelled as Ultrancoserved and no CPS was computed as the low divergence would not allow the computation of a significant score.

GRID

We empirically determined that for large scale comparisons CSTminer gives optimal results with sequences of 100 Kbp with 1 Kbp overlap. Indeed this value allows a good balance between computational speed and the occurrence of CST fragmentation at the border of the submitted sequences. In order to reduce the time needed to execute this large amount of comparisons, we took advantage of grid technology using many machines in parallel. Indeed each CST comparison requires an independent computation thus we split all the 800 M comparisons in smaller subset and we run them on the EGEE grid infrastructure [24]. In order to maximize the level of parallelization, the comparisons were grouped in set of 1000 (10 human 100 K slices vs. 100 mouse 100 K slices). The number of comparisons was chosen in order to have each task running for approximately one hour, giving a good ratio between the time spent in order to set-up the environment and the CPU time spent in running the comparison. This approach also assures something similar to a check-point: even if a job fails only less than one hour of computation is lost.

Clustering

The clustering procedure is an improvement of the procedure described in [8]. The basic idea is to identify genomic regions with a significant concentration of coding CSTs. Given N coding CSTs sorted on their genomic start we computed t3, the genomic span of three consecutive CSTs centered in CST i for i ∈ (2, N - 1). We labeled as pre-cluster three consecutive coding CSTs centered in CST i if 2 * t3i ≤ where is the average genomic span of t3k for k ∈ (i - 30, i + 30). When i<30 or i>N-30, is computed respectively for k ∈ (2,60) or k ∈ (N - 61, N - 1). Overlapping pre-clusters are then merged into clusters. The main difference with the previous clustering procedure is that clustering parameters (average density of surrounding CSTs) are now dynamically computed over a 60 CSTs window in the genomic region under analysis – thus accounting for regions with different gene density. Given that each CST has human and mouse genomic coordinates, the clustering procedure is applied both to human and mouse genomes. Only CSTs belonging to a cluster in both organisms are considered. As already pointed out clusters are computed on coding CSTs only, but syntenic CSTs of all classes (non-coding, undefined and ultraconserved) are included in clusters following a post-processing step.

Authors' contributions

FM carried out the analysis on coding dataset, participated in the design and coordination of the study and helped to draft the manuscript. AA carried out the analysis on non-coding dataset, GG adapted the software to the grid-based application. GD and GPM set up and run the grid-based procedure. GP conceived the study and helped to draft the manuscript. All the authors read and approved the final manuscript.

24 in total

1. Initial sequencing and comparative analysis of the mouse genome.

Authors: Robert H Waterston; Kerstin Lindblad-Toh; Ewan Birney; Jane Rogers; Josep F Abril; Pankaj Agarwal; Richa Agarwala; Rachel Ainscough; Marina Alexandersson; Peter An; Stylianos E Antonarakis; John Attwood; Robert Baertsch; Jonathon Bailey; Karen Barlow; Stephan Beck; Eric Berry; Bruce Birren; Toby Bloom; Peer Bork; Marc Botcherby; Nicolas Bray; Michael R Brent; Daniel G Brown; Stephen D Brown; Carol Bult; John Burton; Jonathan Butler; Robert D Campbell; Piero Carninci; Simon Cawley; Francesca Chiaromonte; Asif T Chinwalla; Deanna M Church; Michele Clamp; Christopher Clee; Francis S Collins; Lisa L Cook; Richard R Copley; Alan Coulson; Olivier Couronne; James Cuff; Val Curwen; Tim Cutts; Mark Daly; Robert David; Joy Davies; Kimberly D Delehaunty; Justin Deri; Emmanouil T Dermitzakis; Colin Dewey; Nicholas J Dickens; Mark Diekhans; Sheila Dodge; Inna Dubchak; Diane M Dunn; Sean R Eddy; Laura Elnitski; Richard D Emes; Pallavi Eswara; Eduardo Eyras; Adam Felsenfeld; Ginger A Fewell; Paul Flicek; Karen Foley; Wayne N Frankel; Lucinda A Fulton; Robert S Fulton; Terrence S Furey; Diane Gage; Richard A Gibbs; Gustavo Glusman; Sante Gnerre; Nick Goldman; Leo Goodstadt; Darren Grafham; Tina A Graves; Eric D Green; Simon Gregory; Roderic Guigó; Mark Guyer; Ross C Hardison; David Haussler; Yoshihide Hayashizaki; LaDeana W Hillier; Angela Hinrichs; Wratko Hlavina; Timothy Holzer; Fan Hsu; Axin Hua; Tim Hubbard; Adrienne Hunt; Ian Jackson; David B Jaffe; L Steven Johnson; Matthew Jones; Thomas A Jones; Ann Joy; Michael Kamal; Elinor K Karlsson; Donna Karolchik; Arkadiusz Kasprzyk; Jun Kawai; Evan Keibler; Cristyn Kells; W James Kent; Andrew Kirby; Diana L Kolbe; Ian Korf; Raju S Kucherlapati; Edward J Kulbokas; David Kulp; Tom Landers; J P Leger; Steven Leonard; Ivica Letunic; Rosie Levine; Jia Li; Ming Li; Christine Lloyd; Susan Lucas; Bin Ma; Donna R Maglott; Elaine R Mardis; Lucy Matthews; Evan Mauceli; John H Mayer; Megan McCarthy; W Richard McCombie; Stuart McLaren; Kirsten McLay; John D McPherson; Jim Meldrim; Beverley Meredith; Jill P Mesirov; Webb Miller; Tracie L Miner; Emmanuel Mongin; Kate T Montgomery; Michael Morgan; Richard Mott; James C Mullikin; Donna M Muzny; William E Nash; Joanne O Nelson; Michael N Nhan; Robert Nicol; Zemin Ning; Chad Nusbaum; Michael J O'Connor; Yasushi Okazaki; Karen Oliver; Emma Overton-Larty; Lior Pachter; Genís Parra; Kymberlie H Pepin; Jane Peterson; Pavel Pevzner; Robert Plumb; Craig S Pohl; Alex Poliakov; Tracy C Ponce; Chris P Ponting; Simon Potter; Michael Quail; Alexandre Reymond; Bruce A Roe; Krishna M Roskin; Edward M Rubin; Alistair G Rust; Ralph Santos; Victor Sapojnikov; Brian Schultz; Jörg Schultz; Matthias S Schwartz; Scott Schwartz; Carol Scott; Steven Seaman; Steve Searle; Ted Sharpe; Andrew Sheridan; Ratna Shownkeen; Sarah Sims; Jonathan B Singer; Guy Slater; Arian Smit; Douglas R Smith; Brian Spencer; Arne Stabenau; Nicole Stange-Thomann; Charles Sugnet; Mikita Suyama; Glenn Tesler; Johanna Thompson; David Torrents; Evanne Trevaskis; John Tromp; Catherine Ucla; Abel Ureta-Vidal; Jade P Vinson; Andrew C Von Niederhausern; Claire M Wade; Melanie Wall; Ryan J Weber; Robert B Weiss; Michael C Wendl; Anthony P West; Kris Wetterstrand; Raymond Wheeler; Simon Whelan; Jamey Wierzbowski; David Willey; Sophie Williams; Richard K Wilson; Eitan Winter; Kim C Worley; Dudley Wyman; Shan Yang; Shiaw-Pyng Yang; Evgeny M Zdobnov; Michael C Zody; Eric S Lander
Journal: Nature Date: 2002-12-05 Impact factor: 49.962

2. JIGSAW: integration of multiple sources of evidence for gene prediction.

Authors: Jonathan E Allen; Steven L Salzberg
Journal: Bioinformatics Date: 2005-08-02 Impact factor: 6.937

3. A new strategy to identify novel genes and gene isoforms: Analysis of human chromosomes 15, 21 and 22.

Authors: Matteo Rè; Flavio Mignone; Michele Iacono; Giorgio Grillo; Sabino Liuni; Graziano Pesole
Journal: Gene Date: 2005-12-15 Impact factor: 3.688

4. PRESTA: associating promoter sequences with information on gene expression.

Authors: Václav Mach
Journal: Genome Biol Date: 2002-08-21 Impact factor: 13.583

5. Human-mouse alignments with BLASTZ.

Authors: Scott Schwartz; W James Kent; Arian Smit; Zheng Zhang; Robert Baertsch; Ross C Hardison; David Haussler; Webb Miller
Journal: Genome Res Date: 2003-01 Impact factor: 9.043

6. CSTminer: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison.

Authors: Tiziana Castrignanò; Alessandro Canali; Giorgio Grillo; Sabino Liuni; Flavio Mignone; Graziano Pesole
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

7. Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis.

Authors: Flavio Mignone; Giorgio Grillo; Sabino Liuni; Graziano Pesole
Journal: Nucleic Acids Res Date: 2003-08-01 Impact factor: 16.971

8. Rfam: annotating non-coding RNAs in complete genomes.

Authors: Sam Griffiths-Jones; Simon Moxon; Mhairi Marshall; Ajay Khanna; Sean R Eddy; Alex Bateman
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

9. DG-CST (Disease Gene Conserved Sequence Tags), a database of human-mouse conserved elements associated to disease genes.

Authors: Angelo Boccia; Mauro Petrillo; Diego di Bernardo; Alessandro Guffanti; Flavio Mignone; Stefano Confalonieri; Lucilla Luzi; Graziano Pesole; Giovanni Paolella; Andrea Ballabio; Sandro Banfi
Journal: Nucleic Acids Res Date: 2005-01-01 Impact factor: 16.971

10. Highly conserved non-coding sequences are associated with vertebrate development.

Authors: Adam Woolfe; Martin Goodson; Debbie K Goode; Phil Snell; Gayle K McEwen; Tanya Vavouri; Sarah F Smith; Phil North; Heather Callaway; Krys Kelly; Klaudia Walter; Irina Abnizova; Walter Gilks; Yvonne J K Edwards; Julie E Cooke; Greg Elgar
Journal: PLoS Biol Date: 2004-11-11 Impact factor: 8.029

1 in total

1. A transcriptional sketch of a primary human breast cancer by 454 deep sequencing.

Authors: Alessandro Guffanti; Michele Iacono; Paride Pelucchi; Namshin Kim; Giulia Soldà; Larry J Croft; Ryan J Taft; Ermanno Rizzi; Marjan Askarian-Amiri; Raoul J Bonnal; Maurizio Callari; Flavio Mignone; Graziano Pesole; Giovanni Bertalot; Luigi Rossi Bernardi; Alberto Albertini; Christopher Lee; John S Mattick; Ileana Zucchi; Gianluca De Bellis
Journal: BMC Genomics Date: 2009-04-20 Impact factor: 3.969

1 in total