| Literature DB >> 22645565 |
Philippe Leroy1, Nicolas Guilhot, Hiroaki Sakai, Aurélien Bernard, Frédéric Choulet, Sébastien Theil, Sébastien Reboux, Naoki Amano, Timothée Flutre, Céline Pelegrin, Hajime Ohyanagi, Michael Seidel, Franck Giacomoni, Mathieu Reichstadt, Michael Alaux, Emmanuelle Gicquello, Fabrice Legeai, Lorenzo Cerutti, Hisataka Numa, Tsuyoshi Tanaka, Klaus Mayer, Takeshi Itoh, Hadi Quesneville, Catherine Feuillet.
Abstract
In support of the international effort to obtain a reference sequence of the bread wheat genome and to provide plant communities dealing with large and complex genomes with a versatile, easy-to-use online automated tool for annotation, we have developed the TriAnnot pipeline. Its modular architecture allows for the annotation and masking of transposable elements, the structural, and functional annotation of protein-coding genes with an evidence-based quality indexing, and the identification of conserved non-coding sequences and molecular markers. The TriAnnot pipeline is parallelized on a 712 CPU computing cluster that can run a 1-Gb sequence annotation in less than 5 days. It is accessible through a web interface for small scale analyses or through a server for large scale annotations. The performance of TriAnnot was evaluated in terms of sensitivity, specificity, and general fitness using curated reference sequence sets from rice and wheat. In less than 8 h, TriAnnot was able to predict more than 83% of the 3,748 CDS from rice chromosome 1 with a fitness of 67.4%. On a set of 12 reference Mb-sized contigs from wheat chromosome 3B, TriAnnot predicted and annotated 93.3% of the genes among which 54% were perfectly identified in accordance with the reference annotation. It also allowed the curation of 12 genes based on new biological evidences, increasing the percentage of perfect gene prediction to 63%. TriAnnot systematically showed a higher fitness than other annotation pipelines that are not improved for wheat. As it is easily adaptable to the annotation of other plant genomes, TriAnnot should become a useful resource for the annotation of large and complex genomes in the future.Entities:
Keywords: cluster; gene models; pipeline; plant genome; structural and functional annotation; transposable elements; wheat
Year: 2012 PMID: 22645565 PMCID: PMC3355818 DOI: 10.3389/fpls.2012.00005
Source DB: PubMed Journal: Front Plant Sci ISSN: 1664-462X Impact factor: 5.753
Figure 1An overview of the workflow supported by the TriAnnot pipeline V3.5. The four main panels are displayed. Each panel contains modules and each module can use one or more bioinformatics programs and databanks. The detailed description of each panel and module is provided in the text. CNSs, conserved non-coding sequences; ncRNA, non-coding RNA; SSRs, simple sequence repeats or microsatellites; TEannot, pipeline for transposable elements annotation (REPET package – Quesneville et al., 2005).
Figure 2Color coded system established to provide a quality index for the gene annotation in TriAnnot. Six categories (Cat0–Cat5) have been defined depending on the approach and the biological evidences used for the analysis. FL-cDNAs, full-length cDNAs. The SMInuc and SIMprot databanks are described in more details Table S2 in Supplementary Material.
Figure 3Schematic representation of the master program (MP). “Tasks list”: list of tasks to be executed and their parameters (XML file). Each task may depend on the results produced by a preceding task and this information is also specified in the XML file. When all the dependencies are satisfied for a given task, it is submitted to the computing cluster by running a “Program Launcher” job (Run tasks). When the “Program Launcher” is completed, a “Parser Launcher” job is submitted (Run parsing) to generate GFF and EMBL files from the program output. These scripts update their status in a MySQL database and write XML files to summarize the execution result. The main program checks both the database (Check Status) and the result files (Check result files) to monitor running jobs. When all tasks are completed, the master program ends the pipeline (Finished).
Figure 4Schematic representation of the different options to access and use TriAnnot for genome sequence annotation. The process for small scale analyses (individual BACs or a few BAC contigs) that are performed directly on the web is represented on the left hand side. The process that enables large scale analysis (several thousand of sequences) through the automated download and annotation with direct manual curation in a CHADO database is described on the right hand side. The curation can be performed either with ARTEMIS, GenomeView or APOLLO graphical editors. Curated annotation can then be displayed with a GBrowse graphical viewer through internet. The future architecture of the pipeline with seven panels is represented on the Cluster.
Figure 5GBrowse graphical display of a 117-kb sequence scaffold from the wheat chromosome 3B. The upper part shows the sequence and the window corresponding to the region for which annotation features are displayed in the central part. The bottom part presents the different databases that are used for the annotation. The ticked boxes indicate the databases that were used for the annotation of this sequence. The “Structural and Functional Gene Annotations” track represents the final gene models with the six color index categories described in Figure 2. All other tracks are biological or ab initio evidences. The GBrowse display is available only for a default analysis.
Comparisons of the fitness of TriAnnot with other well known annotation pipelines based on a reference dataset containing 145 genes (17.9 Mb of wheat chromosome 3B).
| Pipelines | Predicted genes | TP | Gene | Exon | Fitness | ||
|---|---|---|---|---|---|---|---|
| FPGP | 304 | 69 | 46.6 | 22.7 | 71.3 | 58.3 | 45.8 |
| MIPS | 215 | 53 | 35.1 | 24.2 | 61.1 | 50.8 | 40.3 |
| RiceGAAS | 848 | 52 | 35.1 | 6.1 | 70.2 | 18.0 | 22.9 |
| TriAnnot, full analysis | 292 | 80 | 54.0 | 27.4 | 76.1 | 53.1 | 49.5 |
| TriAnnot, SIMsearch analysis only | 128 | 72 | 48.6 | 56.2 | 71.2 | 84.4 | 63.7 |
.
2Fitness = (SnG × SpG × SnE × SpE)0.25.
.
Two analyses are shown for the TriAnnot pipeline: (1) a full analysis that follows the three approaches: SIMsearch (similarities), EuGene (combiner), and .
For SIMnuc and SIMprot see Table S2 in Supplementary Material. SnG, sensitivity at the gene level; SpG, specificity at the gene level; SnE = sensitivity at the exon level; SpE, specificity at the exon level. FPGP, flowering plant gene picker (.
Evaluation of the TriAnnot fitness for the annotation of rice chromosome 1 using the IRGSP/RAP build5 dataset.
| Predicted genes | TP | Gene | Exon | Fitness | |||
|---|---|---|---|---|---|---|---|
| Analysis 1: 4,632 rice genes – with rice IRGSP and MSU genome annotation | 3,885 | 2,368 | 51.1 | 60.9 | 74.5 | 82.8 | 66.2 |
| Analysis 2: 4,632 rice genes – without rice IRGSP and MSU genome annotation | 3,387 | 2,050 | 44.3 | 60.5 | 69.2 | 81.2 | 62.3 |
| Analysis 3: 3,748 rice genes – without rice IRGSP and MSU genome annotation | 3,121 | 2,017 | 53.8 | 64.6 | 72.2 | 81.9 | 67.4 |
.
.
The TriAnnot annotation is compared with different sets of representative rice gene models using Eval as described for wheat. Analysis 1 and 2 were performed on a “corrected” dataset of 4,632 gene models. Analysis 1 included databases for rice comprising the IRGSP and MSU genome annotations whereas analysis 2 was conducted in less optimal conditions (i.e., without rice IRGSP and MSU genome annotations). A second “corrected” set of 3,748 rice genes models was used to perform analysis 3 without the rice IRGSP and MSU genome annotations. The sensitivity (.