| Literature DB >> 27342196 |
Christopher S Henry1,2, Claudia Lerma-Ortiz3, Svetlana Y Gerdes4,3, Jeffrey D Mullen4, Ric Colasanti4, Aleksey Zhukov3, Océane Frelin5, Jennifer J Thiaville3, Rémi Zallot3, Thomas D Niehaus5, Ghulam Hasnain5, Neal Conrad4, Andrew D Hanson5, Valérie de Crécy-Lagard6.
Abstract
BACKGROUND: Gene fusions are the most powerful type of in silico-derived functional associations. However, many fusion compilations were made when <100 genomes were available, and algorithms for identifying fusions need updating to handle the current avalanche of sequenced genomes. The availability of a large fusion dataset would help probe functional associations and enable systematic analysis of where and why fusion events occur.Entities:
Keywords: B vitamin pathways; Bottlenecks; Escherichia coli; Essential reactions; Gene fusions; Metabolic modeling
Mesh:
Substances:
Year: 2016 PMID: 27342196 PMCID: PMC4921024 DOI: 10.1186/s12864-016-2782-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Previous analyses of gene fusions
| No. of genomes | Organisms analyzed | No. of detected fused proteins | No. of predicted functional linkages** | Ref | Website | Fusion detection method*** | Homology or orthology-based? *** |
|---|---|---|---|---|---|---|---|
| 2 | EC, SC | - | 6,809 in EC 45,502 in SC | [ | - | Gene fusion (BLAST) & domain fusion (ProDom) | All homologs (5 % most promiscuous domains removed) |
| 3 | EC, PH, SC | - | 854 in EC 107 in PH; 918 in SC | [ | - | Gene fusion (BLAST) | All homologs |
| 4 | EC, HI, MJ, SC | 64 | - | [ | List of fusions a | Gene fusion (BLAST & S-W) | All homologs |
| 17 | Bact, Arch | 229 | - | [ | - | Gene fusion (S-W) | Orthologs only (BBH) |
| 24 | Bact, Arch (+SC) | 2,365 (621 families) | - | [ | - | Gene fusion (BLAST, component overlap <10 %) | All homologs |
| 30 | Bact, Arch (+SC) | 4,515 | - | [ | DB (not maintained) b; Fusion stats c | Gene fusion (BLAST) | Orthologs only (one link between each COG) |
| 89 | Bact, Arch | ∼20,000 | - | [ | FusionDB (not maintained) d | Gene fusion (BLAST) | Orthologs only (BBH) |
| 184 | Bact, Arch, Eukar | 130,229 | 2,192,019 | [ | Results for download e | Domain fusion (Pfam) | All homologs (promiscuous domains removed) |
| 20 | Bact, Arch, Eukar | 49 | - | [ | SAFE software; FED DB (not maintained) f | Gene fusion (BLAST) | All homologs (promiscuous domains removed) |
| 30 | Bact, Arch | 2,490 by MF 5,339 by FT | - | [ | MosaicFinder; FusedTriplets software g | Gene fusion (BLAST) | Graph topology of seq. similarity network is used for scoring |
| 1,895* | Bact, Arch | user set-dependent, 2,193 in EC | - | [ | MicroScope h | n/a | Synteny based fusion detection |
| 2,031* | Bact, Arch, Eukar | user set-dependent | - | [ | String DB i | n/a | n/a |
| 2,291* | Bact, Arch (+SC) | - | 2,209,622 | [ | Prolinks j | Gene fusion (BLAST) | All homologs (promiscuous domains removed) |
| 31,442* | Bact, Arch, Eukar | user set-dependent,397 in EC | - | [ | JGI IMG k | Gene fusion (USEARCH) | All homologs (as in [ |
| user set | Eukar | - | user set-dependent | [ | CODA software l | Domain fusion (Pfam) | All homologs (scoring immune to promiscuous domains) |
| 2 | Eukar (HS, SC) | 235 in HS; 189 in SC | - | [ | Domain Fusion DB m | Domain fusion (Pfam) | All homologs (promiscuous domains removed) |
| 1 | Eukar (TT) | 80 in TT | - | [ | DeFuser n | Domain fusion (KOG) | Compares N and C termini of query sequence to KOG DB |
The Table is modified and extended from Table 1 in Reid et al. [24]
Abbreviations: DB database, MF MosaicFinder software, FT FusedTriplets software, n/a information not available, S-W Smith-Waterman, organisms, Bact Bacteria, Arch Archea, Eukar Eukaryota, EC E. coli, HI H. influenza, HS H. sapiens, MJ M. jannaschii, PH P. horikshii, SC S. cerevisiae, TT T. thermophila
* Statistics as of November 2015
** Predicted potential protein-protein interactions (‘functional links’) based on gene fusion events; the actual fused proteins were NOT reported in some studies
*** Two main bioinformatics approaches to identify fusion events were used: whole protein sequence comparisons (‘gene fusion’) or domain family comparisons (‘domain fusion’)
a http://www.nature.com/nature/journal/v402/n6757/extref/402086a0-s2.html
b http://fusion.bu.edu
c http://www.pnas.org/content/98/14/7940/T1.expansion.html
d http://www.igs.cnrs-mrs.fr/FusionDB/
e http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2248599/#S8
f Contact Sofia KOSSIDA (sofia.kossida@igh.cnrs.fr)
g http://sourceforge.net/projects/mosaicfinder/
h https://www.genoscope.cns.fr/agc/microscope/compgenomics/fusfis.php?
i http://string-db.org/
j http://prl.mbi.ucla.edu/prlbeta/
k https://img.jgi.doe.gov
l ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v12.0.0/coda/
m http://calcium.uhnres.utoronto.ca/pi/no_flash.htm
Fig. 1Distribution of functions associated with gene fusion events in E. coli. Each of the 121 fused genes identified in E. coli was manually assigned to one of six categories based on their annotated function. The distribution of fusions among these categories is shown in the pie chart. Red numbers represent the total fusion counts and black numbers their respective percentages
Fig. 2Distribution, variety, and frequency of fusion events in riboflavin biosynthesis. Riboflavin gene fusions and identification are given in (Additional file 1: Tables S2 A and B). Pathway enzymes are shown as yellow boxes and their abbreviations are given in the central bottom panel. The domain name is given in parentheses inside each yellow box. Fusion partners are listed in the white boxes immediately below the corresponding pathway enzyme. Of these, domains of the riboflavin pathway are in black font and unknown domains or domains belonging to other pathways are in blue font. Compounds are shown in ovals. Their abbreviations are given in the left bottom panel. Highly reactive compounds are marked with a red oval. Participation in triple fusion events is flagged with colored squares inserted in the white boxes of the corresponding enzymes. The identification code for these squares is given in the bottom right panel. The variety of fusions of each riboflavin gene is shown on the top left insert. This variety is represented by a color range where the number of binary fusion events in which each gene participates (see Additional file 1: Table S2B), is proportional to the orange color intensity. The frequency of fusions of each riboflavin gene is displayed on the top right insert. This frequency is expressed as a percentage and it was calculated as described in Methods. It is represented by a color range where the ratio for each riboflavin gene mentioned above is proportional to the blue color intensity. Enzymes that participate in only a few fusions are colored grey in both inserts
Fig. 3Distribution, variety, and frequency of fusion events in thiamin biosynthesis. The fusion architectures involving these genes as well as their identification data are given in (Additional file 1: Tables S3 A and B). The thiamin biosynthesis pathway enzymes, reactions, intermediates and fusion architectures are illustrated in this diagram following the same representation rules as in Fig. 2. The variety of fusions of each thiamin biosynthesis gene is shown on the top left insert. This variety is represented by a color range where the number of binary fusion events in which each gene participates (see Additional file 1: Table S3B), is proportional to the orange color intensity on the diagram (see left panel under compounds abbreviations section). The frequency of fusions of each thiamin biosynthesis gene is displayed on the top right insert. This frequency is expressed as a percentage and calculated as described in Methods. It is represented by a color range where the ratio for each thiamin gene mentioned above is proportional to the blue color intensity (see right panel under compounds abbreviations section)
Criteria used to filter true fusions from false positives
| ID | Criteria | Biological meaning |
|---|---|---|
| 1 | Protein length must exceed 600 amino acid residues | Fusion proteins should be longer than single-domain proteins |
| 2 | All non-overlapping CDDs together must align to at least 40 % of the gene length | Fused-domains should cover the full length of the fused gene |
| 3 | A minimum alignment length of 50 for all non-overlapping CDDs | Fused-domains should represent entire genes and should not be overly short |
| 4 | Gap between fused domains must be at least 60 residues and 10 % of gene length from end of gene | Point of fusion should be fairly centrally located in fused gene |
| 5 | At least two distinct CDD sets represented in the gene | Fused domains should not belong to the same CDD |
| 6 | Less than half of the CDD alignments for the gene should cross the gap between fused domains | A fused gene should be characterized more as a fusion of multiple domains than as a match to a single domain |
| 7 | All non-overlapping CDDs must co-occur with fewer than 1500 different CDD sets | Fused domains should not be overly promiscuous |
| 8 | Fewer than 1000 matches among the non-overlapping CDDs | Fused domains should be different from one another |
Fig. 4Workflow of our fusion prediction algorithm. Previous protein-domain-based algorithms (see Table 4) overlap with the first three steps of our own algorithm, and other algorithms often include length-based (step 6) or domain promiscuity-based (step 11) criteria. Our algorithm is unique in its application of all these criteria with these specific parameters
Cases where a fusion of a domain of unknown function to a B vitamin gene led to a functional discovery
| Domain | Vitamin pathway | Molecular function | ref |
|---|---|---|---|
| COG3236 | Riboflavin |
| [ |
| DUF89 | CoA | Phosphatase | [ |
| DUF1537 | PLP | Kinase | [ |
| Tnr3/Nudix | Thiamin | Pyrophosphatase | [ |
| COG1058 | Niacin | Pyrophosphatase | [ |
| Human CoaD | CoA | Adenyl transferase | [ |
| TenA-HAD | Thiamin | Hydrolase | unpublished |
| HAD-IA | Thiamin | Hydrolase | [ |
| HAD-IB | Thiamin | Hydrolase | [ |
Fig. 5Fusion occurrences across genomes and subsystems. a Our fusion algorithm predicted 3.9 million fusions across 11,473 genomes, with the number of fusion events per genome being broadly proportional to the number of genes in the genome. b The annotations of these predicted fusions come from a wide range of SEED subsystem classes. Here we show the distribution of predicted fusion events among the 32 prominent subsystem classes. Two distinct measures of fusion prevalence are displayed: (i) the fraction of distinct functional roles in the subsystem that are classified as frequently fused (red bars); and (ii) the fraction of genes associated with any role in the subsystem that are fused (blue bars)
Fig. 6Functional analysis of frequently fused reactions. We identified 841 reactions as being frequently associated with gene fusion events. We manually assigned one of nine possible mechanistic explanations for the frequent fusion events associated with each of these reactions. The distribution of these mechanistic explanations is plotted as a pie chart (data extracted from Additional file 1: Table S16). Red numbers represent the number of reactions associated with a fusion event in a given category and the black numbers their respective percentages
Fusions of neighboring enzymes in metabolic pathways and their unstable substrates/products
| Metabolism area | Enzyme roles | EC numbers | SEED gene identifier | Metabolite involved | References |
|---|---|---|---|---|---|
| Aromatic amino acids | Cyclohexadienyl dehydratase/Periplasmic chorismate mutase I precursor | 4.2.1.51/5.4.99.5 | fig|325240.9.peg.4134 | Prephenate | [ |
| Indole-3-glycerol phosphate synthase/Phosphoribosylanthranilate isomerase | 4.1.1.48/5.3.1.24 | fig|991999.3.peg.2431 | 1-(2-Carboxyphenylamino)-1-deoxyribulose 5-phosphate | [ | |
| Histidine | Phosphoribosyl-AMP cyclohydrolase/Phosphoribosyl-ATP pyrophosphatase | 3.5.4.19/3.6.1.31 | fig|751585.3.peg.1763 | Phosphoribosyl-AMP | [ |
| Glyoxalate | Isocitrate lyase / Malate synthase | 4.1.3.1/2.3.3.9 | fig|404589.10.peg.3099 | Glyoxalate | [ |
| Sulfur | Adenylylsulfate kinase/Sulfate adenylyltransferase subunit 1 | 2.7.1.25/2.7.7.4 | fig|349163.14.peg.1814 | Adenosine 5′-phosphosulfate | [ |
| Folate | Aminodeoxychorismate lyase/Para-aminobenzoate synthase, aminase component | 4.1.3.38/2.6.1.85 | fig|257309.4.peg.1776 | 4-Amino-4-deoxychorismate | [ |
| Phosphonate | 2-Aminoethylphosphonate:pyruvate aminotransferase/Phosphonoacetaldehyde hydrolase | 2.6.1.37/3.11.1.1 | fig|691161.5.peg.2163 | Phosphonoacetaldehyde | [ |
| Siderophore | 2,3-Dihydroxybenzoate-AMP ligase/Isochorismatase/Isochorismate synthase | 2.7.7.58/3.3.2.1/5.4.4.2 | fig|306537.3.peg.2089 | Isochorismate | [ |
| Heme and siroheme biosynthesis | Precorrin-2 oxidase/Sirohydrochlorin ferrochelatase / Uroporphyrinogen-III methyltransferase | 1.3.1.76/4.99.1.4/2.1.1.107 | fig|644335.4.peg.2909 | Precorrin 2 | [ |
| Uroporphyrinogen-III methyltransferase/Uroporphyrinogen-III synthase | 2.1.1.107/4.2.1.75 | fig|479834.4.peg.2988 | UroporphyrinogenIII | [ | |
| Porphobilinogen deaminase/Uroporphyrinogen-III synthase | 2.5.1.61/4.2.1.75 | fig|1049939.3.peg.1307 | Hydroxymethylbilane | [ |
Fusions of genes encoding for neighboring enzymes were extracted from the SEED database computationally as described in Methods. The metabolites involved are products of one functional role cited in the row and substrates of the corresponding fused functional role. The References column gives citations documenting the chemical instability of the intermediates
Fig. 7The use of fusions to infer functions of unknown domains. a Once a fusion of an unknown with a characterized gene is discovered, the function of the latter and the clustering pattern of the fusion gene help to propose functions for the unknown gene, especially when combined with structure analysis of the unknown and the fused product. If specific compounds are bound to the unknown protein and can be associated with the metabolic area of the known enzyme, mechanisms such as channeling or repair might be inferred. The position of the known enzyme in the pathway combined with flux balance and thermodynamics analysis can give clues about the function of the unknown gene. b Examples of the application of the ModelSEED fusions exploration tool. Beveled rectangles represent the genes that participate in the fusions used as starting points for our analysis. On the beveled rectangles, Cys H stands for phosphoadenylyl-sulfate reductase (EC 1.8.4.8)/adenylyl-sulfate reductase (EC 1.8.4.10); A-B stands for acetyl-coenzyme A carboxyl transferase alpha chain (EC 6.4.1.2)/acetyl-coenzyme A carboxyl transferase beta chain (EC 6.4.1.2); NUDIX stands for Nudix_15. These genes are also identified by the same color code as the arrows that represent them in the genome sections illustrated immediately below them. The rows of arrows depict the gene clustering areas given by the SEED platform for the genes analyzed in our examples. The genes in each organism’s genome section are represented by color coded arrows and identified by letters. The functional roles represented by these letters for each organism are given in the printed section below the illustration. Examples of the stand-alone genes and their clustering patterns are also given