| Literature DB >> 24324765 |
Intikhab Alam1, André Antunes, Allan Anthony Kamau, Wail Ba Alawi, Manal Kalkatawi, Ulrich Stingl, Vladimir B Bajic.
Abstract
BACKGROUND: The next generation sequencing technologies substantially increased the throughput of microbial genome sequencing. To functionally annotate newly sequenced microbial genomes, a variety of experimental and computational methods are used. Integration of information from different sources is a powerful approach to enhance such annotation. Functional analysis of microbial genomes, necessary for downstream experiments, crucially depends on this annotation but it is hampered by the current lack of suitable information integration and exploration systems for microbial genomes.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24324765 PMCID: PMC3855842 DOI: 10.1371/journal.pone.0082210
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Workflow of annotation process and data warehousing.
Here, the section marked (A) shows steps in the annotation process. Section (B) shows a PERL based conversion of annotations into an XML schema - validated using the class attributes and data types defined in the genomic model, and finally, section (C) shows the process of data warehouse development steps.
A comparison of features from different microbial data warehouses.
| INDIGO | InterMine | Integrated Microbial Genomes | Microbes Online | MicroScope | |
|---|---|---|---|---|---|
|
| |||||
| Chromosome/Contigs | Yes | Yes | Yes | Yes | Yes |
| Genes | Yes | Yes | Yes | Yes | Yes |
| Proteins | Yes | Yes | Yes | Yes | Yes |
| Expression data | No | Yes | Yes | Yes | Yes |
|
| |||||
| Gene Ontology | Yes | Yes | Yes | Yes | Yes |
| KEGG Pathways | Yes | Yes | Yes | Yes | Yes |
| Interpro Domains | Yes | Yes | Yes | Yes | Yes |
| Cross references | Yes | Yes | Yes | Yes | Yes |
|
| |||||
| Showing assigned KEGG pathway diagrams | Yes | No | No | No | No |
| Individual Feature (Gene/Protein/Pathway) list generation | Yes | Yes | Yes | Yes | Yes |
| Multiple Feature (Gene/Protein/Pathway) list generation | Yes | Yes | No | Yes, limited | Yes |
| Keyword search | Yes | Yes | Yes | Yes | Yes |
| Keyword search against all attributes | Yes | Yes | No | Yes | No |
| Filter keyword search results based on categories | Yes | Yes | No | Yes | Yes |
| Keyword search for feature list generation | Yes | Yes | No | No | Yes |
| BLAST search to feature list generation | Yes | No | Yes | Yes | Yes |
| Query builder to user selected all/multiple feature list generation | Yes | Yes | No | No | Yes |
| Save / share queries | Yes | Yes | No | Yes | Yes |
| Feature list analysis; GO enrichment | Yes | Yes | No | No | No |
| Feature list analysis; Pathway enrichment | Yes | Yes | No | No | No |
| Feature list analysis; Protein enrichment | Yes | Yes | No | No | No |
| Adding additional attribute to generated lists | Yes | Yes | No | No | No |
| List summary functions | Yes | Yes | No | No | No |
| List filtering functions | Yes | Yes | Yes | Yes | Limited |
| List export | Yes | Yes | Yes | Yes | Yes |
| Save / share lists | Yes | Yes | No | Yes | Yes |
| Genome Browser | Yes | Yes | Yes | Yes | Yes |
|
| |||||
| Compare different genomic features e.g.via keyword search | Yes | Yes | Yes | Yes | Yes |
| Compare sequences via BLAST | Yes | No | Yes | Yes | Yes |
| Compare genomes based on other tools | No | No | Yes | Yes | Yes |
|
| |||||
| Web server based data access | Yes | Yes | Yes | Yes | Yes |
| Remote access via API (PERL, JAVA, RUBY, PYTHON) | Yes | Yes | No | Yes | No |
| Bulk Download | Yes | Yes | Yes | Yes | Yes |
| User selected single feature list based download | Yes | Yes | Yes | Yes | Yes |
| User integrated feature list based download | Yes | Yes | No | No | Yes, limited. |
|
| |||||
| Public microbial genome annotation | Yes | No | Yes | Limited, uses rast and takes six months | Annotation_editor (manual) |
| User genome annotation job history | Yes | No | Yes | Yes | Manual |
|
| |||||
| operon finding | No | No | Yes | Yes | Yes |
| promoter/terminator finding | No | No | Yes | No | Yes |
| RNA detection (rRNA/tRNA) | Yes | No | Yes | No | Yes |
| Protein gene prediction (multiple methods) | Yes | No | Yes | No | Yes |
| RNA vs. Protein overlap resolution | Yes | No | Yes | No | Yes |
| HPC BLAST for Proteins to UniProt | Yes | No | No | Yes | Yes |
| HPC BLAST for Proteins to NCBI NR | Yes | No | No | Yes | No |
| HPC BLAST for Proteins to NCBI COG | Yes | No | Yes | Yes | Yes |
| HPC BLAST for Proteins to NCBI CDD | Yes | No | No | No | No |
| HPC BLAST for Proteins to KEGG | Yes | No | Yes | Yes | Yes |
| HPC Interproscan domain finding for Proteins | Yes | No | Yes | Yes | Yes |
| Global Best Taxonomy (GBT) distribution analysis | Yes | No | No | No | No |
| Annotation data integration to GFF format | Yes | No | Yes | No | No |
| Annotation data integration to GenBank format | Yes | No | No | Yes | Yes |
| Annotation data integration to TBL format | Yes | No | No | No | Yes |
| Annotation data checking using tbl2asn | Yes | No | No | No | No |
| Annotation data process to NCBI sqn submission format | Yes | No | No | No | No |
| Annotation data packing into validated xml for data warehouse | Yes | No | No | No | No |
| Hierarchical classification of COG annotations and visualization | Yes | No | No | Yes | No |
| Hierarchical classification of GO annotations and visualization | Yes | No | No | No | No |
| Hierarchical classification of GBT annotations and visualization | Yes | No | No | No | No |
| Hierarchical classification of InterPro domains annotations and visualization | Yes | No | No | Yes | Yes |
| Hierarchical classification of ALL annotations and visualization | Yes | No | No | No | No |
| Immediate access to all data files and visualizations | Yes | No | No, sso accounts | Yes | Yes |
Figure 2Annotation comparison for E. coli O104 (TY2482) among AAMG pipeline, BG7 and reference annotation set from Broad Institute.
Regarding the CDS annotation AAMG ranks second (with only 2 CDS region less annotated than BG7), while in annotation of orphan (hypothetical) CDS products (the less the better) and in annotation of functional (non-hypothetical) CDS products (the more the better) AAMG performs the best.
Results of AAMG Annotations compared with NCBI or BROAD institute sets.
|
|
|
| ||||
|---|---|---|---|---|---|---|
| Gene calls | AAMG | NCBI | AAMG | BROAD | AAMG | NCBI |
| CDS | 4340 | 4337 | 5208 | 5164 | 190 | 207 |
| rRNA | 22 | 22 | 22 | 22 | 3 | 3 |
| tRNA | 82 | 86 | 97 | 97 | 27 | 28 |
| Total | 4444 | 4445 | 5327 | 5288 | 220 | 238 |
| False Negatives | 235 | 236 | 50 | 11 | 4 | 22 |
| Functional genes | 3866 | 3730 | 4591 | 3502 | 182 | 191 |
| Orphan genes | 578 | 715 | 736 | 1786 | 38 | 47 |
|
|
|
|
|
|
|
|
| Detected | 3876 | 87.20 | 5172 | 97.81% | 205 | 86.13 |
| Detected similar | 333 | 7.49 | 105 | 1.99% | 11 | 4.62 |
| Not Detected | 236 | 5.31 | 11 | 0.21% | 22 | 9.24 |
| Total | 4445 | 5288 | 238 | |||
* Genes are identical when both start and stop positions are exactly the same.
** Genes are similar if start or stop positions are in the same region with an offset up to 50 bases.
Data warehouse development stages using InterMine.
|
|
|
|
|---|---|---|
| build database tables | build-db | 2 |
| data integration | prokredsea-HLPCO-largexml | 59 |
| data integration | prokredsea-HLRTI-largexml | 61 |
| data integration | prokredsea-SSPSH-largexml | 68 |
| data integration | Sequence ontology | 56 |
| data integration | interpro | 164 |
| data integration | Gene ontology | 1043 |
| post-processing | create-references | 28 |
| post-processing | make-spanning-locations | 21 |
| post-processing | create-chromosome-locations-and-lengths | 40 |
| post-processing | transfer-sequences | 89 |
| post-processing | create-bioseg-location-index | 15 |
| post-processing | create-attribute-indexes | 38 |
| post-processing | summarise-objectstore | 31 |
| post-processing | create-autocomplete-index | 20 |
| post-processing | create-search-index | 59 |
| total time taken | 1794 |
Figure 3A) Keyword and B) Query builder search interface to INDIGO.
The keyword search interface shows an example of the search for “benzoate degradation”. Results are categorized on the left side of the resulting page, showing the number of hits found for genes, domains, pathways, etc. These results are further categorized into hits per genome for different organisms. Clicking on any of these categories shows filtered results. The query builder interface has an option to include or constrains an annotation class attribute, e.g. pathway name is constrained for “benzoate degradation”, while the organism attribute ‘short name’ is constrained to “SSPSH”. The annotation feature class attributes to be included in the result list here are gene db identifier, symbol, organism’s short name and pathway name. User can select any of the available annotation class attributes making it possible to integrate annotation from several different sources. Results of constrained query builder search are shown as a list. There are summary and filter options on the list page that allow a user to further analyze these results.
Figure 4Region search interface.
This figure shows features (genes) for a region using coordinates (Contig3:198625-229704) from organism Haloplasma contractile (HLPCO). This region shows the cell Division and Cell Wall (DCW) biosynthesis gene cluster. An integrated genome browser view available via Region search results page, shows here the arrangement of genes in this region of the contig from HLPCO . The table below this section shows genome region, data export options, basic details of the feature (genes), type of features and their location on the genome. The create list by feature link saves this gene list in the data warehouse for further analysis. This list stays permanently if the user is logged in.
Figure 5A) Gene Ontology, B) Protein Domain and C) Pathway enrichment analysis.
The figure shows a snapshot obtained in case when a term “cell cycle” was searched through the keyword search option and resulting genes were saved in a list that shows enrichment of GO, protein domain and pathways in comparison to the rest of the data in INDIGO. The number of hits shown for reach category can be saved as lists for further analysis.
Basic annotated features of the three Red Sea extremophiles in INDIGO.
|
|
|
|
|
|
|
|---|---|---|---|---|---|
|
| 34 | 347868 | 3036 | 4 | 27 |
|
| 72 | 58136 | 3287 | 3 | 40 |
|
| 41 | 129079 | 3530 | 3 | 46 |
Top 10 pathways from each of the three extremophiles.
|
| Genes |
| Genes |
| Genes |
|---|---|---|---|---|---|
| ABC transporters | 115 | Two-component system | 182 | ABC transporters | 88 |
| Purine metabolism | 69 | ABC transporters | 131 | Purine metabolism | 74 |
| Two-component system | 67 | Purine metabolism | 96 | Ribosome | 64 |
| Pyrimidine metabolism | 56 | Methane metabolism | 78 | Pyrimidine metabolism | 60 |
| Ribosome | 56 | Oxidative phosphorylation | 75 | Oxidative phosphorylation | 55 |
| Tyrosine metabolism | 52 | Butanoate metabolism | 73 | Amino sugar and nucleotide sugar metabolism | 53 |
| Amino sugar and nucleotide sugar metabolism | 50 | Benzoate degradation | 71 | Two-component system | 50 |
| Starch and sucrose metabolism | 49 | Fatty acid metabolism | 70 | Methane metabolism | 46 |
| Methane metabolism | 46 | Arginine and proline metabolism | 63 | Starch and sucrose metabolism | 40 |
| Histidine metabolism | 40 | Pyruvate metabolism | 60 | Cysteine and methionine metabolism | 39 |
Figure 6Benzoate degradation in Salinisphaera shabanensis.
The genes from Salinisphaera shabanesis associated with Benzoate degradation pathway by INDIGO are shown in Red. INDIGO developed a functionality, available for all pathways present in INDIGO, that generates a specific URL to automatically display KEGG Orthologs from INDIGO on to pathway diagrams at KEGG webserver.