| Literature DB >> 31247029 |
Rohan V Koodli1, Benjamin Keep2, Katherine R Coppess3, Fernando Portela1, Rhiju Das1,3.
Abstract
Emerging RNA-based approaches to disease detection and gene therapy require RNA sequences that fold into specific base-pairing patterns, but computational algorithms generally remain inadequate for these secondary structure design tasks. The Eterna project has crowdsourced RNA design to human video game players in the form of puzzles that reach extraordinary difficulty. Here, we demonstrate that Eterna participants' moves and strategies can be leveraged to improve automated computational RNA design. We present an eternamoves-large repository consisting of 1.8 million of player moves on 12 of the most-played Eterna puzzles as well as an eternamoves-select repository of 30,477 moves from the top 72 players on a select set of more advanced puzzles. On eternamoves-select, we present a multilayer convolutional neural network (CNN) EternaBrain that achieves test accuracies of 51% and 34% in base prediction and location prediction, respectively, suggesting that top players' moves are partially stereotyped. Pipelining this CNN's move predictions with single-action-playout (SAP) of six strategies compiled by human players solves 61 out of 100 independent puzzles in the Eterna100 benchmark. EternaBrain-SAP outperforms previously published RNA design algorithms and achieves similar or better performance than a newer generation of deep learning methods, while being largely orthogonal to these other methods. Our study provides useful lessons for future efforts to achieve human-competitive performance with automated RNA design algorithms.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31247029 PMCID: PMC6597038 DOI: 10.1371/journal.pcbi.1007059
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 1Eterna and EternaBrain.
(A-C) Puzzle-solving interface presented to human players of Eterna including the state of the puzzle (whether it is solved or not) in the top left corner (red/green outline), the puzzle itself (in the middle), and the toolbar (bottom) with which the players can mutate the RNA sequence to make it fold into the desired state; yellow, blue, red, and green symbols represent A, U, G, and C nucleotides. (A) The desired target structure for the RNA molecule, as indicated by the bullseye in the bottom left (orange highlight). (B) Nature mode, as indicated by the leaf in the bottom left (orange highlight), gives the predicted minimum free energy structure for the current sequence. Since the bases in the top right should be paired with each other (orange circle), this puzzle is not yet folding correctly; this status is shown by the red indicator in the top left corner. (C) The solved puzzle. The nature-mode structure matches the target structure, and the indicator in the top left corner turns green, meaning the puzzle has been solved. (D) (left) Wide distribution of contributed Eterna solutions across different players. For preparing the eternamoves-select data set, we selected any player who had solved more than 3000 distinct puzzles, which left us with 72 players. (right) In EternaBrain, we tested whether information on players’ moves could be used to train a convolutional neural network. (E) For solving new puzzles, the final EternaBrain-SAP framework first uses the EternaBrain convolutional neural net model to predict sequence changes (‘moves’) for new RNA puzzles. In a second stage, the Single Action Playout (SAP), six additional hand-coded strategies are applied to complete the solution.
Input features used for training and testing EternaBrain convolutional neural network.
| Description | Example | Encoded example |
|---|---|---|
| Base Sequence | CCAGAAAAAAAAACUGG | [[0, 0, 0, 1], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 1, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0], [0, 0, 0, 1], [0, 1, 0, 0], [0, 0, 1, 0], [0, 0, 1, 0]] |
| Predicted (‘Nature Mode’) Structure in dot-bracket notation | ((((.........)))) (dot-bracket) | 2,2,2,2,1,1,1,1,1,1,1,1,1,1,3,3,3,3 |
| Target Structure in dot-bracket notation | (((((.........))))) (dot-bracket) | 2,2,2,2,2,1,1,1,1,1,1,1,1,3,3,3,3,3 |
| Predicted Structure Energy | -2.2 kcal/mol | -2.2 |
| Target Energy | 4.9 kcal/mol | 4.9 |
| Predicted Structure in pairmap notation* | ((((.........)))) (pairmap) | 16,15,14,13,-1,-1,-1,-1,-1,-1,-1,-1,-1,3,2,1,0 |
| Target Structure in pairmap notation* | (((((.........))))) (pairmap) | 16,15,14,13,12,-1,-1,-1,-1,-1,-1,-1,4,3,2,1,0 |
| Locked bases | xooooooooooooooox | 2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2 |
* Additional features used for training on eternamoves-select and not eternamoves-large data set.
EternaBrain CNN accuracies on eternamoves-select with different splits of training and test sets.
| Model | Training Accuracy | Test Accuracy |
|---|---|---|
| EternaBrain—base | 0.71 | 0.51 |
| EternaBrain—location | 0.31 | 0.34 |
| Half experts—base | 0.66 | 0.38 |
| Half experts—location | 0.30 | 0.11 |
| Half puzzles—base | 0.70 | 0.27 |
| Half puzzles—location | 0.33 | 0.02 |
| One expert—base | 0.79 | 0.25 |
| One expert—location | 0.5 | 0.01 |
EternaBrain CNN accuracies on eternamoves-select, grouped by length of puzzle and paired/unpaired status of nucleotide at which move was applied.
| Length of Puzzles | Number of Moves | Location Accuracy | Location Accuracy (%) | Base Accuracy | Base Accuracy (%) |
|---|---|---|---|---|---|
| 1034 | 398 | 38% | 537 | 52% | |
| 771 | 232 | 30% | 385 | 50% | |
| 1329 | 346 | 26% | 624 | 47% | |
| 313 | 32 | 10% | 141 | 42% | |
| 3447 | 1008 | 29% | 1687 | 49% | |
| 687 | 257 | 37% | 337 | 49% | |
| 674 | 201 | 29% | 324 | 48% | |
| 1206 | 320 | 27% | 557 | 46% | |
| 264 | 23 | 8% | 114 | 43% | |
| 2831 | 801 | 28% | 1332 | 47% | |
| 348 | 141 | 40% | 200 | 57% | |
| 96 | 30 | 31% | 61 | 63% | |
| 122 | 26 | 21% | 67 | 55% | |
| 50 | 9 | 18% | 27 | 54% | |
| 616 | 206 | 33% | 355 | 58% | |
Fig 2The 6 strategies included in the SAP.
(A) The original state of the puzzle before SAP. This represents a puzzle initiated with an arbitrary sequence of nucleotides; panel displays the target structure, where mismatched nucleotides (C-A) are highlighted. (B) The first step of the SAP is to correct mismatched pairs. Here, the cytosine nucleotides are switched to uracil to pair with adenine. (C) Changing end pairs to G-C. Changing base pairs that are at the edges of stems and flank loops to G-C pairs lowers the free energy of the molecule. (D) G-internal loop boost. The first nucleotide in an internal loop on either side is switched to a guanine. (E) U-G-U-G super boost. In an internal loop with 2 unpaired bases on either side, the 2 bases are changed to uracil and guanine, in that order, on either side. (F) G-hairpin boost. The first nucleotide in each strand of a hairpin loop is changed to a guanine. (G) Reorienting base pairs. Target base pairs that are not predicted to be folded correctly are ‘flipped’ to lower the energy of the structure. Here, alternating the A-U pairs lowers the energy of the stack. The 5’ end of each puzzle is at the top left, with the puzzle drawn counter-clockwise from that point.
Fig 3EternaBrain performance.
(A) Performance of EternaBrain and 6 previously published algorithms on Eterna100 benchmark. EternaBrain solves 61/100, followed by MODENA (54/100), INFO-RNA (50/100), NUPACK (48/100), DSS-Opt (47/100), RNAinverse (28/100), and RNA-SSD (27/100). (B) Performance of Alternative Model Constructions. The CNN alone could solve only 20/100, and the SAP alone could solve 50/100. Removing various input features passed into the CNN resulted in drops in performance, confirming the importance of these features.
EternaBrain-SAP performance on Eterna100 upon five additional playouts on the 61 puzzles it solved in its first run.
| Puzzle | Number of Times Solved out of 5 | Puzzle | Number of Times Solved out of 5 |
|---|---|---|---|
| Simple_hairpin | 5 | medallion | 4 |
| Arabidopsis_Thaliana_6_RNA_-_Difficulty_Level_0 | 5 | [RNA]_Repetitious_Sequences_8_10 | 5 |
| Prion_Pseudoknot_-_Difficulty_Level_0 | 5 | Documenting_repetitious_behavior | 5 |
| Human_integrated_adenovirus_-_Level_0 | 4 | 7_multiloop | 5 |
| The_Gammaretrovirus_Signal_-_Diffuculty_Level_0 | 5 | Kyurem_7 | 0 |
| Saccharomyces_Cerevisiae_-_Difficulty_Level_0 | 5 | JF1 | 5 |
| The_fractal | 5 | multilooping_fun | 0 |
| G-C_Placement | 5 | Multiloop … | 3 |
| The_Sun | 5 | hard_Y | 0 |
| Frog_Foot | 5 | Mat_-_Elements_&_Sections | 0 |
| InfoRNA_test_16 | 5 | Chicken_feet | 0 |
| Mat_–_Martian_2 | 5 | Bug_18 | 0 |
| square | 5 | Fractal_star_x5 | 5 |
| Six_legd_turtle_2 | 5 | Crop_Circle_2 | 0 |
| Small_and_Easy_6 | 0 | Branching_Loop | 0 |
| Fractile | 5 | Bug_38 | 0 |
| Six_legd_turtle_2 | 5 | Simple_Single_Bond | 0 |
| snoRNA_SNORD64 | 5 | Taraxacumr_officinale | 5 |
| Chalk_Outline | 4 | Headless_Bug_on_Windshield | 0 |
| InfoRNA_bulge_test_9 | 4 | Pokeball | 1 |
| Tilted_Russian_Cross | 5 | Variation_of_a_crop_circle | 3 |
| This_is_ACTUALLY_Small_And_Easy_6 | 5 | Loop_next_to_a_Multiloop | 0 |
| Shortie_4 | 5 | Snowflake_4 | 0 |
| Shape_Test | 3 | Mat_-_Cuboid | 3 |
| The_Minitsry | 5 | Misfolded_Aptamer_6 | 2 |
| stickshift_ | 5 | Snowflake_3 | 0 |
| U | 4 | Hard_Y_and_a_bit_more | 0 |
| Still_Life_(Sunflower_In_A_Vase) | 5 | Mat_-_Lot_2–2_B | 0 |
| Quasispecies_2–2_Loop_Challenge | 4 | Shapes_and_Energy | 0 |
| Corner_bulge_training_ | 5 | Spiral_of_5s | 0 |
| Spiral | 5 | Campfire | 0 |
| InfoRNA_bulge_test | 5 | Anemone | 3 |
| Worm_1 | 4 | Fractal_3 | 5 |
| just_down_to_1_bulge | 3 | Kyurem_5 | 0 |
| Iron_Cross | 5 | Snowflake_Necklace_(_or_v2.0_) | 0 |
| loops_and_stems | 5 | Methaqualone_C16H14N2O_Structural_Representation | 0 |
| Water_Strider | 5 | Cats_Toy_2 | 0 |
| The_turtle(s)_move(s) | 5 | Zigzag_Semicircle | 0 |
| Adenine | 5 | Short_string_4 | 0 |
| Tripod_5 | 4 | Gladius | 0 |
| Shortie_6 | 5 | Thunderbolt | 4 |
| Runner_ | 5 | Mutated_chicken_feet | 0 |
| Recoil | 5 | Chicken_Tracks | 2 |
| [CloudBeta]_An_Arm_and_a_Leg_1.0_ | 5 | Looking_Back_Again | 0 |
| [CloudBeta]_5_Adjacent_Stack_Multi-Branch_Loop | 5 | Multilooping_6 | 0 |
| Triple_Y | 0 | Cesspool | 0 |
| Misfolded_Aptamer | 4 | Hoglafractal | 0 |
| Flower_power | 0 | Bullseye | 0 |
| Kudzu | 0 | Shooting_Star | 0 |
| 1,2,3and4bulges | 0 | Teslagon | 0 |
Fig 4Example EternaBrain-SAP solutions to Eterna100 puzzles.
(A) U solution highlights the fact that the EternaBrain CNN alone can solve puzzles with short stems. (B) Chicken Tracks solution: EternaBrain-SAP can solve puzzles with three stems intersecting in one internal loop. (C) Thunderbolt solution demonstrates that EternaBrain-SAP can solve large puzzles (400 nucleotides long) and solve loops and stems in combination. (D) Shortie 4 solution shows EternaBrain-SAP can solve puzzles with multiple short stems (2 nucleotides long). (E) Shortie 6 is quite similar to Shortie 4, but with the same motif (short stems) repeated. The other algorithms mentioned could not solve Shortie 6 because of the repeated motifs. (F) Hard Y—target structure (left) vs nature-mode (right) structure. EternaBrain-SAP could not solve Hard Y because it required use of a little-used strategy to solve a motif called a zigzag. Since the strategy is not often used by players, the EternaBrain CNN did not learn the strategy and the strategy was not included in the SAP. In each panel, the 5’ end of each puzzle is at the top left, with the puzzle drawn counter-clockwise from that point.