| Literature DB >> 16083500 |
Philippe Gouret1, Vérane Vitiello, Nathalie Balandraud, André Gilles, Pierre Pontarotti, Etienne G J Danchin.
Abstract
BACKGROUND: Two of the main objectives of the genomic and post-genomic era are to structurally and functionally annotate genomes which consists of detecting genes' position and structure, and inferring their function (as well as of other features of genomes). Structural and functional annotation both require the complex chaining of numerous different software, algorithms and methods under the supervision of a biologist. The automation of these pipelines is necessary to manage huge amounts of data released by sequencing projects. Several pipelines already automate some of these complex chaining but still necessitate an important contribution of biologists for supervising and controlling the results at various steps.Entities:
Mesh:
Year: 2005 PMID: 16083500 PMCID: PMC1188056 DOI: 10.1186/1471-2105-6-198
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1FIGENIX software architecture. FIGENIX software servers can be distributed on several CPU. Some servers, like "Annotation Engine" or "Expert System" can be cloned and distributed on these CPU (theEnsembl pipeline includes a similar approach, named "Computer Farm"). This allows load balancing inside the FIGENIX platform. Fault tolerance is not yet implemented but can easily be integrated to this kind of architecture.
PROLOG rules, syntax and semantic, example
a(X, Y) :- b(X), c, d(X, Y) can be read like this: a is true for X and Y values if b is true for X value, c is true and d is true for X and Y value. Here is a very short example to explain how works a PROLOG program. Suppose we want to write a program able to verify that an element belongs to a list or able to enumerate list's elements. In PROLOG we just have to describe "belonging" concepts.
Figure 2Phylogenomic inference pipeline. For more details about all the steps and functionalities automated in the pipeline see material and methods sections of the 2002 and 2003 phylogenomic papers [29, 30]. From the query sequence, a dataset of putative homologous sequences is first built by BLAST [16] on a protein database like NR. We filter raw dataset to eliminate sequences potentially non-homologous, disturbing alignments and doubles. User can choose to focus on a specific scope on any node of the tree of life (the vertebrates, the bilaterians...). In the next step, we produce an alignment with CLUSTALW [19]. Then the alignment is modified to eliminate large gaps. Since phylogenetic analysis is done at the domain level, we next detect these domains with HMMPFAM [23]. For each domain alignment (extracted from the original alignment), a bias correction phase is run, to eliminate: – Non-monophyletic "repeats" in a tree built with NJ [31] algorithm on CLUSTALW software. – Sequences with a diverging composition by using an amino-acid composition test of TREE-PUZZLE software [22] (with an alpha risk set to 5%). – Sites not under neutral evolution [35]. Once domains are "purified", and after congruent domains selection with HOMPART test from PAUP package [20], a new alignment is built by merging preserved parts of domains' alignments. From this alignment, three phylogenetic trees are generated using NJ, ML (with TREE-PUZZLE [22]) and MP (with PAUP [20] package) methods. By comparing topologies of these trees with PSCORE command ("Templeton winning sites" test) from PAUP package and KISHINO-HASEGAWA [34] test from TREE-PUZZLE package, fusion of these trees in a unique consensus tree is produced. Through the comparison of this consensus protein tree with a reference species tree, (the tree of life from NCBI [26]), we then deduce orthologous proteins to the query sequence.
Figure 3Task creation and running (GUI). Here is shown a phylogenomic inference task on human Notch1 protein. The graph associated to the phylogenomic pipeline is displayed on the left part of the figure, as a graphical tree. We introduced a virtual concept of "step of work" that allows to show a cyclic oriented graph as a tree. At each step one or several units can act. (e.g.: at the step named "Protein", the unit "sequenceProvider", whose role is to read protein sequences from a file, will work). At the level just next to the current unit, are represented the units that will be activated as its continuation. (e.g. "BLAST" unit follows "sequenceProvider" cause the first treatment executed on a protein is the BLAST search). At the same graphical level as nodes related to a unit, are shown the parameters which can be customized for this unit (e.g. on "sequenceProvider" unit, the parameter "taxeid" (the query sequence's taxon) or parameter "$filePath" (path to the file with proteins to be analyzed)). The task given as an example in the figure was currently running when we took the screenshot. In green are shown units that finished their work, in red those which are running, in blue those which are not running. One can guess, by observing buttons on the right part of the figure, that the presented task: is an instance of pipeline model named "__ProtPhyloGenix__" (the one which produces phylogenomic inference studies for proteins), can be interrupted at any time, can be cloned (when user want to run it again modifying only several parameters), and finally explored through the scientific results web pages already produced according to the execution.
Data allowing export system to decide what kind of fusion must be done
In this conceptual example, Templeton and Kishino-Hasegawa tests' results indicate that the Maximum Parsimony phylogenetic tree's topology is the "best" one and only the Neighbor Joining topology is congruent with this one.
All possible cases provided by tree topologies comparison tests
| Best | >=0.05 | >=0.05 | Best | Best | Best | |
| >=0.05 | Best | >=0.05 | <0.05 | >=0.05 | <0.05 | |
| >=0.05 | >=0.05 | Best | >=0.05 | <0.05 | <0.05 | |
T indicates support from Templeton test. K indicates support from Kishino-HaseGawa test. Values returned by the tests that are >=0.05 indicate that the fusion is possible for the considered reconstruction method.
Interpretation of phylogenetic trees topologies comparison tests
| Case 1: | Case 3: | Case 5: | |
| Case 2: | Case 9: | Case 6: | |
| Case 4: | Case 7: | Case 8: |
"T" indicates support from Templeton test. "K" indicates support from Kishino-HaseGawa test. "A" means support from all the tests. Suffix "1" indicates full congruence, "2" partial congruence, "3" non congruence.
Figure 4Intermediate domain tree (NJ). This tree, built with Neighbor Joining method, is used by expert module to detect paralogy groups. The reconstruction was made with Human Notch1 as query on the NCBI NR database. Here we have three significant groups tagged "G" on the figure (species taxon end the labels).
The 8 pipeline models currently available in FIGENIX
| Pipeline Name | Pipeline Purpose |
| ProtPhyloGenix | The phylogenomic functional inference pipeline shown in this paper and detailed in the supplement. |
| TwinBaseMatix | Builds a FASTA database, eliminating redundant sequences obtained from two different query databases. For example, mixes protein coming from NR and Ensembl databases, and eliminates doubles. |
| BaseProtPhylogenix | Composition of the two previous pipelines. This pipeline first builds a temporary protein database (mixing two different databases and eliminating doubles). The phylogenomic inference process is then run using the built database. |
| TwinESTMatix | Builds a FASTA database, mixing sequences obtained on the one hand from a filtered given database and on the other hand by a database of automatically clustered ESTs. For example, it allows mixing protein coming from NR and translations of EST contigs from NCBI dbEST database. |
| BaseESTPhylogenix | Composition of TwinESTMatix and ProtPhyloGenix__ pipelines. Phylogenomic inference on FASTA databases built with TwinESTMatix This allows construction of phylogenetic tress mixing proteins and translated EST contigs. |
| GenePredix | Runs our structural annotation method (mixing ab-initio and homology information) to DNA sequence up to ~50 kb (due to current computational power limitations) to predict genes. For larger DNA sequences, SlidingGenePredix can be used. |
| SlidingGenePredix: | Apply the GenePredix pipeline on a sliding window. This allows gene prediction on larger DNA sequences, and bypasses the ~50 kb limitation. |
| PhyloGenix: | Composition of GenePredix and ProtPhyloGenix pipelines. This model allows automatic structural and functional annotation of DNA sequences. Indeed it produces gene prediction in DNA sequences using GenePredix, and then performs phylogenomic functional inference for each putative gene using ProtPhyloGenix. |
Performance of two Ab-initio methods vs. FIGENIX's structural annotation method
| 0.55 | 0.80 | 0.65 | 0.22 | 0.31 | |
| 0.75 | 0.81 | 0.78 | 0.15 | 0.38 | |
| 0.91 | 0.92 | 0.95 | 0.05 | 0.87 |
Performances were measured on a modified version of the HMR195 [53] dataset. The new dataset contains 55 sequences from Mouse and Rat genomes. They were annotated with Genscan, HMMGene and FIGENIX (with the human section of Swissprot [54] as a reference database for homology-based approach).
Figure 5Consensus phylogenetic tree of the Notch family. The tree is midpoint rooted. At the root of the trees, a "npl_A" label means that the tree is the result of the fusion of three independently reconstructed trees with Neighbor Joining, Maximum Parsimony, and Maximum Likelihood methods. In this case, the fusion is done on the NJ topology (branches' lengths can be displayed but are not shown here to keep the tree easily readable). That means that topologies are strongly congruent. The bootstrap values are given for the three methods when a node exists as identical in the three trees. (sometimes a node exists only in two trees or only in the Neighbor Joining tree, e.g. a bootstrap 100_*_99 means that the node exists in NJ tree with a bootstrap value equal to 100 in ML tree with a bootstrap value equal to 99, but doesn't exist in the MP tree).
Figure 6Human Notch1 functional report. The browser window shows a Web page with part of an automatically generated functional report. One of the orthologs (NTC1_MOUSE) to the query sequence (NOTCH_HSA) is shown, including some associated functional terms. At the end of each phylogenomic pipeline (Figure 5), after orthologs detection was produced on the consensus tree, an additional process is run. The goal of this process is to search on the Web experimentally verified functional data on proteins orthologous to the studied sequence. A HTML report synthesizing functional retrieved data is then built. It includes links to Web database and publication associated to retrieved functional terms. Current implementation of this system manages data coming from: GENE ONTOLOGY [51], MGD [52], and EST expression data available on NCBI Web site. This system is open for integration of other data sources.
Specific differences between FIGENIX's phylogenomic inference pipeline and other software
| Homologous sequences search on any NCBI-formatted database including nr, Swissprot and Ensembl. | Homologous sequences search limited to Swissprot and trEMBL. | Homologous sequences search on any NCBI-formatted database including nr. |
| Choice of the scope of phylomes by the user (root = all phylomes by default) | No choice of the scope of phylomes by the user. | Choice of the scope of phylomes by the user. |
| Automatic detection of domains on the query sequence. | Manual input of a domain that must be present in pfam and for which pairwise distances must have been precalculated. | Phylogenetic reconstruction at BLAST's high scoring pairs (HSPs) level converted after corrections in multiple sequence alignment (MSA). |
| Expert system selection of domains and repeats whose evolutionary behaviour are congruent. | Phylogenetic reconstruction on a single domain provided by the user. | No test for domains congruence. Phylogenies constructed on a corrected alignment with a HMM profile. |
| When no domain is found phylogenetic reconstruction on the "alignable" portion of the query sequence. | No reconstruction possible when no known domain is present on the query sequence. | Phylogenetic reconstruction possible regardless the presence of a known domain on the query sequence. |
| Elimination of sites not evolving under neutral evolution. | No elimination of sites producing biases in phylogenetic reconstruction. | No elimination of sites producing biases in phylogenetic reconstruction. |
| Elimination of sequences having a divergent amino acids composition | No elimination of sequences with divergent composition. | No test for sequence composition but selection for sequences producing significant alignments with the query HMM. |
| Phylogenetic reconstruction with three different methods and projection on a consensus tree. | Phylogenetic reconstruction with one single method (NJ). | Choice of reconstruction method (NJ by default) but only one method at a time and no fusion with multiple methods. |
| Comparison of the consensus tree with NCBI reference tree of life containing around 200,000 taxa. | Comparison of the NJ tree with a reference tree of life containing around 2,500 taxa. | Comparison of the one-method tree with NCBI reference tree of life containing around 200,000 taxa. |
| Automatic detection of speciation and duplications, of orthologs and paralogs. | Automatic detection of speciation and duplications, of orthologs and paralogs. | Functionality not available. Possibility to scan a database of trees for a given topology. |
| Automatic extraction of experimentally verified functional information for all detected orthologs and paralogs. | Functionality not available | Functionality not available |
Comparison of homology inference between FIGENIX's pipeline and Homologene
| Notch | Human Notch1 | Notch2, Notch3, and Notch4 are not detected as paralogs of Notch1. | Notch2, Notch3, and Notch4 are not detected as co-orthologous to Drosophila N. | Amphibian Ray-finned fish Cephalochordata Arachnida | 3 different C.elegans genes are detected for Hs Notch1, Notch2, and Notch3, suggesting that duplications giving rise to this family took place before the divergence between protostomes and deuterostomes, and that Notch2, and Notch3 were lost in Drosophila. |
| Calnexin/Calreticulin | Human Calnexin | Calmegin and Calreticulin are not detected as paralogs of Calnexin. | Calmegin is not detected as a Human co-ortholog to Drosophila CG9906 gene. | Amphibian Ray-finned fish | Calmegin is detected to be orthologous to another Drosophila gene than CG9906, suggesting Calmegin and Calnexin already existed as two duplicates before the divergence between protostomes and deuterostomes and Calmegin was secondary lost in C. elegans |
| ENPEP/TRHDE/LNPEP/ERAP/LRAP/ANPEP | Human TRHDE | ENPEP, LNPEP, ERAP, LRAP, and ANPEP are not detected as paralogous to TRHDE. | None | None | Each human gene of this family has been assigned a distinct ortholog in protostomes (e.g. Drosophila) suggesting this multigenic family emerged before the separation of Protostomes and Deuterostomes. |
| PSMB5/PSMB8 | Human PSMB5 | PSMB8 is not detected as paralogous to PSMB5 | PSMB8 is not detected as co-orthologous to the same Drosophila gene than PSMB5. | Ray-finned fish Avian Cephalochordata. Amphibian | PSMB5 and PSMB8 are each assigned a distinct Drosophila ortholog suggesting they already existed as two copies in the last common ancestor of human and Drosophila. |
| PSMB7/PSMB10 | Human PSMB7 | PSMB10 is not detected as paralogous to PSMB7. | PSMB10 is not detected as co-orthologous to the same Drosophila gene than PSMB7. | Ray-finned fish Avian Cephalochordata. Amphibian | PSMB7 and PSMB10 are each assigned a distinct Drosophila ortholog suggesting they already existed as two copies in the last common ancestor of human and Drosophila. |
| Cathepsins L, M, P, R | Human Cathepsin R | Cathepsins L, M and P are not detected as paralogous to Cathepsin R. | None | Amphibian Avian Ray-finned fish | Each cathepsin gene is assigned a distinct drosophila ortholog suggesting the cathepsin family emerged before the separation between human and Drosophila. |
| Tpp2 | Human Tpp2 | None (not a multigenic family) | None | Drosophila | None |
| ERP57 (GPR58) | Human GRP58 | None (not a multigenic family) | None | Fungi Bovine Schistosoma Avian | None |
| HSPA5 (GRP78) | Human HSPA5 | None (not a multigenic family) | None | Amphibian Aplysia Lepidopteran Avian Schistosoma | None |
| TAP1, TAP2, ABCB9, MDR1 | Human TAP1 | TAP2, ABCB9, and MDR1 are not detected as paralogous to TAP1. | None | Drosophila Avian Amphibian Ray-finned fish | TAP1, and TAP2 are each assigned a distinct C.elegans ortholog and none in Drosophila, suggesting there was already two copies of these genes in the last common ancestor of these two species, and that the two copies were secondary lost in the Drosophila lineage. |
| PSME1, PSME2, PSME3 | Human PSME1 | PSME2, and PSME3 are not detected as paralogous to PSME1. | None | Protostomes Ray-finned fish | None |
| THOP1, NLN | Human THOP1 | NLN is not detected as paralogous to THOP1 | NLN is not detected as co-orthologous to the same N.crassa gene than THOP1. | Amphibian Bacteria | None |
*Query gene is identical to the Query gene we used for phylogenetic reconstruction with FIGENIX's phylogenomic inference pipeline.