| Literature DB >> 35762945 |
Hayda Almeida1,2,3, Adrian Tsang1,2, Abdoulaye Baniré Diallo1,3,4.
Abstract
MOTIVATION: Precise identification of Biosynthetic Gene Clusters (BGCs) is a challenging task. Performance of BGC discovery tools is limited by their capacity to accurately predict components belonging to candidate BGCs, often overestimating cluster boundaries. To support optimizing the composition and boundaries of candidate BGCs, we propose reinforcement learning approach relying on protein domains and functional annotations from expert curated BGCs.Entities:
Year: 2022 PMID: 35762945 PMCID: PMC9364373 DOI: 10.1093/bioinformatics/btac420
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.Computation of majority vote pre-processing for candidate BGCs: regions are merged according to the average score of predicted labels
Fig. 2.Example of functional annotation strategies applied to a candidate BGC
Distribution of A.niger BGC components in dataset genes
| Component type | Training | Test | ||
|---|---|---|---|---|
| BGCs | Non-BGCs | Gold BGCs | Non-gold BGCs | |
| Backbones | 17.0% | 2.0% | 15.9% | 2.2% |
| Tailoring enzymes | 30.5% | 7.8% | 9.9% | 11.9% |
| Transcription factors | 4.8% | 2.1% | 5.9% | 4.3% |
| Transporters | 5.6% | 2.8% | 7.4% | 4.6% |
| Non-component domains | 44.7% | 46.93% | 49.3% | 58.9% |
| No domains | 14.6% | 41.15% | 15.5% | 23.2% |
| Total # genes | 2833 | 1781 | 624 | 11239 |
Performance on A.niger candidate BGCs from TOUCAN, fungiSMASH and DeepBGC
| Model | Gene metrics | Cluster metrics | Average | % gold-std. genes | |||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F-m | P | R | F-m | F-m | Negative | Skipped | |
|
| 0.269 | 0.906 | 0.414 | 0.963 | 0.929 | 0.946 | 0.68 | 12.6% | — |
|
| 0.402 | 0.68 | 0.506 | 0.963 | 0.929 | 0.946 | 0.726 | 12.6% | 26.4% |
|
| 0.409 | 0.74 | 0.527 | 0.963 | 0.929 | 0.946 | 0.737 | 12.6% | 16.2% |
|
| 0.341 | 0.665 | 0.451 | 0.649 | 0.741 | 0.692 | 0.571 | 33.2% | — |
|
| 0.521 | 0.516 | 0.519 | 1 | 0.741 | 0.851 | 0.685 | 33.2% | 22.3% |
|
| 0.495 | 0.575 | 0.532 | 1 | 0.741 | 0.851 | 0.691 | 33.2% | 13.8% |
|
| 0.371 | 0.713 | 0.488 | 1 | 0.729 | 0.844 | 0.666 | 34.13% | — |
|
| 0.523 | 0.508 | 0.515 | 1 | 0.729 | 0.844 | 0.680 | 34.13% | 22.11% |
|
| 0.523 | 0.508 | 0.515 | 1 | 0.729 | 0.844 | 0.680 | 34.13% | 22.11% |
|
| 0.351 | 0.481 | 0.406 | 0.732 | 0.612 | 0.667 | 0.536 | 52.4% | — |
|
| 0.574 | 0.42 | 0.485 | 1 | 0.612 | 0.759 | 0.622 | 52.4% | 12.2% |
|
| 0.538 | 0.46 | 0.496 | 1 | 0.612 | 0.759 | 0.627 | 52.4% | 7.1% |
Fig. 3.Comparison between gold-standard and candidate BGC composition for four A.niger clusters. Non-BGC genes are shown in dark blue. (A) Candidate BGCs for which the reinforcement learning agent correctly skipped most non-BGC genes compared to their polyketide (left) and fatty acid (right) gold standard BGCs. (B) Candidate BGCs for which the agent kept most non-BGC genes compared to their two non-ribosomal peptide gold standard BGCs, possibly due to their ambiguous protein domains, which more than half are associated to BGC component roles but do not belong to neighboring clusters (A color version of this figure appears in the online version of this article.)
Distribution of A.nidulans pseudo BGC components in dataset genes
| Pseudo-component type | Training | Test | ||
|---|---|---|---|---|
| BGCs | Non-BGCs | Gold BGCs | Non-gold BGCs | |
| Backbones | 17.5% | 2.13% | 20% | 2.45% |
| Tailoring enzymes | 36% | 3.70% | 31.63% | 4.5% |
| Transcription factors | 4.83% | 2.35% | 5.92% | 3.92% |
| Transporters | 5.82% | 3.65% | 7.55% | 5.2% |
| Non-component domains | 33.15% | 48.28% | 35.3% | 62.12% |
| No domains | 14.6% | 41.15% | 12.65% | 22.8% |
| Total # genes | 2833 | 1781 | 490 | 10002 |
Performance on A.nidulans candidate BGCs from the three tools.
| Model | Gene metrics | Cluster metrics | Average | % gold genes | |||||
|---|---|---|---|---|---|---|---|---|---|
| P | R | F-m | P | R | F-m | F-m | Negative | Skipped | |
|
| 0.272 | 0.681 | 0.389 | 1 | 0.685 | 0.813 | 0.601 | 32.24% | — |
|
| 0.441 | 0.591 | 0.505 | 1 | 0.681 | 0.810 | 0.657 | 32.24% | 13.47% |
|
| 0.402 | 0.646 | 0.495 | 1 | 0.681 | 0.810 | 0.653 | 32.24% | 7.55% |
|
| 0.319 | 0.727 | 0.443 | 0.817 | 0.795 | 0.806 | 0.624 | 30.61% | — |
|
| 0.479 | 0.592 | 0.53 | 1 | 0.781 | 0.877 | 0.703 | 30.61% | 15.92% |
|
| 0.469 | 0.605 | 0.529 | 1 | 0.736 | 0.848 | 0.688 | 30.61% | 13.88% |
|
| 0.318 | 0.762 | 0.449 | 1 | 0.792 | 0.884 | 0.666 | 28.16% | — |
|
| 0.484 | 0.581 | 0.528 | 1 | 0.778 | 0.875 | 0.702 | 28.16% | 19.18% |
|
| 0.484 | 0.581 | 0.528 | 1 | 0.778 | 0.875 | 0.702 | 28.16% | 19.18% |
|
| 0.328 | 0.493 | 0.394 | 0.723 | 0.466 | 0.567 | 0.480 | 50.61% | — |
|
| 0.491 | 0.441 | 0.465 | 1 | 0.466 | 0.636 | 0.550 | 50.61% | 8.57% |
|
| 0.473 | 0.492 | 0.482 | 1 | 0.472 | 0.642 | 0.562 | 50.61% | 2.86% |