Literature DB >> 32683444

Machine Boss: rapid prototyping of bioinformatic automata.

Jordi Silvestre-Ryan¹, Yujie Wang¹, Mehak Sharma¹, Stephen Lin¹, Yolanda Shen¹, Shihab Dider¹, Ian Holmes¹.

Abstract

MOTIVATION: Many software libraries for using Hidden Markov Models in bioinformatics focus on inference tasks, such as likelihood calculation, parameter-fitting and alignment. However, construction of the state machines can be a laborious task, automation of which would be time-saving and less error-prone.
RESULTS: We present Machine Boss, a software tool implementing not just inference and parameter-fitting algorithms, but also a set of operations for manipulating and combining automata. The aim is to make prototyping of bioinformatics HMMs as quick and easy as the construction of regular expressions, with one-line 'recipes' for many common applications. We report data from several illustrative examples involving protein-to-DNA alignment, DNA data storage and nanopore sequence analysis.
AVAILABILITY AND IMPLEMENTATION: Machine Boss is released under the BSD-3 open source license and is available from http://machineboss.org/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2021 PMID： 32683444 PMCID： PMC8034524 DOI： 10.1093/bioinformatics/btaa633

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.931

1 Introduction

Bioinformatics is a field littered with state machines, many of them still functional. The venerable Needleman–Wunsch, Smith–Waterman and Gotoh algorithms from the 1970s and early 1980s can be thought of as aligning pairs of sequences to input–output automata (Gotoh, 1982; Needleman and Wunsch, 1970; Smith and Waterman, 1981). Gribskov’s protein profiles of the late 1980s are state machines too (Gribskov ). The 1990s saw the probabilistic interpretation of both these kinds of machine as Hidden Markov Models (HMMs); respectively, the ‘pair HMM’ and the ‘profile HMM’ (Brown ; Durbin ). This inspired new applications of HMMs in emerging areas of sequence analysis, such as computational gene prediction (Burge and Karlin, 1997). Further evolution of these ideas including HMMs for multiple sequence alignment (Do ; Holmes and Bruno, 2001), comparative genefinding (Alexandersson ; Meyer and Durbin, 2004) and phylogenetics (Siepel and Haussler, 2003; Siepel ; Suchard and Redelings, 2006) occurred in the 2000s. HMMs (and automata more generally) continue to represent the state of the art for many bioinformatic tasks; for example, when reconstructing the indel histories of ancestral sequences (Holmes, 2017; Löytynoja and Goldman, 2005; Westesson ) or aligning protein to DNA (Birney ). Meanwhile, the growing field of Deep Learning drew from HMM model-fitting algorithms to train Recurrent Neural Networks (RNNs) with sequential inputs and outputs; specifically, using Connectionist Temporal Classification (CTC) which is based on the Forward-Backward algorithm (Graves ). Inevitably, RNNs have found application in bioinformatics, sometimes surpassing HMMs; for example, RNN basecallers for nanopore sequence data (Boza ) have been shown to outperform the corresponding HMMs (David ). Even in cases such as this, where RNNs have displaced HMMs, useful connections to automata theory can sometimes still be made due to the underlying parallels between CTC and HMM dynamic programming (Silvestre-Ryan and Holmes, 2018). Over this period, a number of software libraries have been developed for parsing, annotating and aligning biological sequences using generic state machines. Examples include Dynamite (Birney and Durbin, 1997), DART (Holmes and Bruno, 2001), GHMM (Schliep ), C4 (Slater and Birney, 2005), HMMoC (Lunter, 2007), HMMConverter (Lam and Meyer, 2009), StochHMM (Lott and Korf, 2014), MuxStep (Veličković and Liò, 2016) and ham (Ralph and Matsen, 2016). These various libraries all have slightly different capabilities but typical features include the ability to work with state machines of arbitrary topology, implementations of common dynamic programming algorithms (like the Viterbi and Forward-Backward algorithms) and generation of optimized code implementing those algorithms. Some newer libraries for deep learning that are frequently used in bioinformatics, such as TensorFlow (Abadi ), often include implementations of algorithms with close state-machine analogs, even though the libraries themselves are not explicitly founded on automata theory. Examples include CTC loss-minimization, or beam search to find the most likely output sequence of a RNN. Despite all of these libraries for doing automata-based inference on sequences, there are few (if any) general-purpose tools for working with the automata themselves as manipulable mathematical objects. As one illustration of why this is useful, we consider GeneWise, one of the most successful (and elaborate) automata of bioinformatics. The underlying state machine of GeneWise aligns an amino acid sequence to the (unspliced and imperfectly observed) protein-coding genomic DNA. In doing so, it simultaneously models translation, splicing and sequencing errors—as detailed in the GeneWise paper (Birney ). The machine developed to do this in full is highly intricate, containing 21 states and 93 transitions when expressed as a Moore (1956) machine; prototyping and subsequent optimization yielded a reduced-size machine with 6 states and 23 transitions. (In using these descriptions to assess the mathematical complexity of GeneWise and the computational complexity of its algorithms, one should note that the transition output labels are strings, not just individual characters, and the transition weights include contributions from all the constituent submodels: translation, splicing and sequencing errors.) GeneWise also includes an algorithm that aligns profile HMMs of protein domains to genomic DNA, accepting HMMER-format profiles as used by the PFAM database (El-Gebali ). All these state machines were laboriously developed, validated and debugged by hand (albeit with the help of Dynamite to generate the dynamic programming code once the state machines were specified). In principle—given that the GeneWise model can notionally be ‘factorized’ into independent component machines for translation, splicing and sequencing error—this approach should be amenable to incremental variations or enhancements; e.g. different profile HMM architectures (as may be found in later releases of HMMER), alternate genetic codes or richer models of the context-dependent error profile of later-emerging sequencing technologies like that of Oxford Nanopore Technologies (ONT) (Jain ). In practice, however, because this factorization of the GeneWise state machine into submachines is performed manually, these kinds of upgrade would take a significant amount of work—and quick prototyping would be impossible. More generally, many bioinformatic automata can be viewed as being derived from simpler state machines by operations such as concatenation, composition (or multiplication), intersection, union (or addition), reversal, complementation, substitution or other well-defined mathematical transformations. This is particularly useful in cases which inherently involve transforming one sequence into another via several steps. GeneWise is one example. Another, very simple, example is the conversion of a DNA motif to an RNA motif, as shown inFigure 1.

Fig. 1.

A state machine that generates DNA sequences matching the TaqI restriction enzyme binding site (left) can be converted, by multiplication with a DNA-to-RNA conversion machine (center), to a state machine that generates the corresponding RNA motifs (right) A more elaborate example occurs in the context of encoding binary information into DNA as a storage medium: in doing this, it is desirable to avoid repeated nucleotides (which are easily misread by DNA sequencers) and this can be conceived of as converting a binary sequence to a base-3 (ternary) sequence, followed by a conversion from ternary to DNA. Each conversion can be formulated as an input–output state machine (Fig. 2); related coding operations, such as the introduction of parity bits for error correction, can similarly be formulated using state machines (Supplementary Fig. S1).

Fig. 2.

A non-repeating DNA storage code can be factored into two separate state machines: one for converting binary sequences to ternary, and one for converting ternary to DNA (Goldman ). In this diagram, a state machine transition is annotated x/y if it inputs x and outputs y; the symbol ϵ denotes the empty string. (A) Machine for (imperfectly) converting a binary input sequence into ternary, batching the input into groups of three binary digits and outputting pairs of ternary digits. This machine is inefficient in that the output is times longer than it would be for a perfect conversion from base 2 to base 3 (because no triplet of input bits is ever converted the pair of output trits ‘22’, which means one of the nine possible output–trit pairs is wasted; more fundamentally, perfect conversion between indivisible integer bases is not possible with a finite state machine). In applications where the length of the input is not known in advance and so must be signaled with an end-of-file character (EOF), the ternary sequence ‘22’ can be used to encode this EOF. (B) Machine for converting a ternary input sequence into a non-repeating DNA sequence. The output of this machine is times longer than the DNA would be if repeated nucleotides are allowed. (AB) Machine for a binary input sequence into a non-repeating DNA sequence, obtained by ‘multiplying’ A and B. The output of this machine is 4/3 times longer than the Shannon limit (obtained by multiplying the inefficiencies of the two constituent machines). The machine diagrams in this figure were generated automatically using Machine Boss and GraphViz The approach of formally composing automata is well-documented in other applications of automata in computer science, for example, in linguistics (Mohri ). The automata, and the operations to combine or transform them, can be expressed compactly using the notation of linear algebra; in this view, an automaton formally represents an infinite matrix whose rows and columns are indexed by input and output sequences (Bouchard-Côté, 2013). To take one example, the GeneWise combination of three translation, splicing and error submachines corresponds straightforwardly to a three-way matrix multiplication. For some tasks, such as statistical phylogenetic alignment (where such automata generalize the idea of the ‘substitution matrix’ to whole sequences, allowing indels as well as substitutions), this view of state machines as algebraic objects that can be systematically combined on the branches of a tree is absolutely central to the underlying probabilistic framework (Holmes, 2017; Holmes and Bruno, 2001; Redelings and Suchard, 2005, 2007; Suchard and Redelings, 2006; Westesson ). Even without adopting the linear algebraic view, there is clear utility to being able to transform automata by simple operations like reverse-complementation or concatenation. Yet, for all the general-purpose state-machine libraries, this ability to formally operate on the state machines themselves is not generally available. Certainly, most libraries allow state machines to be constructed programmatically, by building appropriate data structures directly in the source code that links to the library. However, this is an intricate and error-prone procedure, and is a far cry from being able to construct state machines from modular components using reliable, general-purpose implementations of elementary operations such as ‘multiply’, ‘concatenate’ or ‘reverse-complement’. Motivated by this gap in the bioinformatics tool chain, and finding ourselves repeatedly in need of reference implementations of automata-theoretic algorithms (for prototyping and debugging purposes in both HMM- and RNN-based applications) that allowed for algebraic manipulation of the underlying state machines, we developed Machine Boss, an open source software package that meets this need. In Section 2 and the Supplementary Information, we review the representation of state machines used throughout Machine Boss, and outline its capabilities. In the Results section, we describe non-trivial example applications of Machine Boss to several problems of interest; these include incorporating context-dependent error models (appropriate for nanopore sequencing) into GeneWise-like protein-to-DNA aligners, decoding the output of neural network basecallers for ONT sequencing instruments, and prototyping modular codes for encoding binary information in DNA. In the Discussion, we discuss how this sort of prototyping fits into a bioinformatics tool development workflow, and briefly mention several further applications.

2 Materials and methods

Detailed descriptions of state machines and sequence data may be found in the Supplementary Information to this article.

2.1 Weighted finite-state machines

The following uses notation introduced in the studies by Mohri and Westesson . For our purposes, a machine is a tuple where is an input alphabet, is an output alphabet, is a non-empty ordered list of states (of which the first element is the start state and the last is the end state), is a set of transitions between states (labeled with input and/or output symbols), is a set of named parameters (a subset of which are assigned non-negative real values), is a set of constraints (partitioning into rates, mutually exclusive probability and other parameters) and is the transition weight function, where represents the set of closed-form differentiable expressions over (with an expression grammar that allows real numbers, arithmetic operators, powers, exponentials and logarithms). Let denote the set of all such possible machines. For a given input sequence and output sequence , let be the total weight of all paths through the transition graph of T that have input label x and output label y. The sequence weight can be computed by the Forward algorithm in time and memory . The derivatives for can be computed using the Forward–Backward algorithm. (Machine Boss implements this algorithm only for the case when all transition weights can be computed as real values, i.e. all relevant parameters are specified.) This notation encourages us to think of T as being like an infinite matrix, indexed by sequences, with being the element in row x and column y. We can then, for example, multiply machines like matrices: if , then we can readily find a machine such that . Other matrix expressions such as T + U, transpose(T) or αT (for some scalar α) are also straightforward to implement. A machine for which is called a generator. A machine for which is called a recognizer. We can, for example, think of profile HMMs as generators (because they generate sequences as output) and regular expressions as recognizers (because they accept sequences as input). The labeling of one sequence as ‘input’ and the other as ‘output’ is arbitrary and, for many purposes, largely irrelevant. In the linear algebra analogy, the distinction between a generator and a recognizer corresponds to the choice as to whether to represent a vector in row or column form, and exchanging the ‘input’ and ‘output’ labels corresponds to taking the transpose.

2.2 Machine boss capabilities

Machine Boss defines a (validatable) JSON format for machines in , and implements the following operations: Matrix-like operations such as multiplication, transposition, addition, intersection (a.k.a. point product), the matrix identity (for a given alphabet) and scalar multiplication; String-like operations such as concatenation, reversal, reverse-complement, repetition, Kleene closure, local matching (padding with flanking states), HMM transition graph-related operations such as topological sorting, elimination of ϵ-transitions, elimination of redundant or inaccessible states, downsampling, normalization and various probabilistic weightings; Constructing generators and recognizers for sequences, elementary patterns (e.g. wildcards) or regular expressions; Import of models from various sources such as HMMER files (Eddy, 2009), FASTA, CSV or HTTP fetches from PFAM (El-Gebali ) or DFAM (Hubley ), and export to GraphViz format; Useful built-in ‘preset’ machines such as probabilistic Smith-Waterman (Bucher and Hofmann, 1996), GeneWise-like models (Birney ), DNA storage codes (Goldman ), the Jukes-Cantor model (Jukes and Cantor, 1969) and the TKF model (Thorne ); Implementations of common dynamic programming (Forward, Forward–Backward, Viterbi) and HMM-related algorithms (Baum-Welch and other forms of EM, respecting user-specified constraints), including linear-space Forward/Viterbi (without traceback), and banding heuristics; Implementation of search algorithms for finding the highest-weighted input or output sequences, or sampling such sequences probabilistically by weight; including prefix search, beam search, MCMC and simulated annealing; Generation of C++ code (32- or 64-bit) and/or JavaScript code; Multiple ways to specify input and output sequences (as strings from the command line, JSON arrays or FASTA files); A flexible logging system for debugging and progress reports. To compute the sum over all state paths, the Forward algorithm requires that the ‘silent’ (i.e. ϵ-labeled) subset of the transition graph is acyclic and topologically sorted (Durbin ). Most Machine Boss operations attempt to maintain this property in the state machines that they construct, automatically topo-sorting and eliminating silent cycles by marginalization. However, for large state machines (particularly when the transition weights are expressed symbolically as closed-form algebraic formulae, rather than as real numbers), these operations can become computationally expensive. For such cases, Machine Boss also offers inexact versions of the operations that either attempt to break silent cycles (by deleting silent transitions where j < i, until no cycles remain) or just leave them in place (acknowledging that the Forward algorithm may then give a technically incorrect, albeit stable, result). With the operations described, prototyping and evaluating new machines with Machine Boss is a relatively quick process that can take place interactively on the command line. In fact, many of these operations can be accessed in multiple ways: from the command line, via the JSON API, or by interfacing directly to the C++ API. Machine Boss compiles on Apple Mac and Linux systems to a command-line executable with limited dependencies (GSL and SSL if using the network capabilities) and can also be compiled to WebAssembly using emscripten.

3 Results

3.1 Aligning protein sequences to nanopore reads with a context-dependent error model

Our first test of Machine Boss was an experiment to see whether richer error models might benefit a GeneWise-like protein-to-DNA search. Specifically, we sought to prototype an application to search amino acid sequences against individual nanopore reads, to see if they contained coding genes for known E.coli proteins. To do this, we combined a protein-to-DNA alignment model with two alternate error models: a ‘symmetric context-independent’ error model parameterized by a substitution matrix and gap opening/extension parameters, and a richer ‘asymmetric context-dependent’ error model with separate parameters for insertion and deletion (hence ‘asymmetric’), that also allows the error probabilities at a particular position of the genome to depend on the neighboring bases (hence ‘context-dependent’). For this application, we were primarily interested in the power of the DNA substitution model; to reduce computation time, we did not incorporate an amino acid substitution model (as the analogous GeneWise algorithm does), but this can easily be incorporated. The datasets and the construction and parameterization of the constituent state machines are described in more detail in Section 2 and Supplementary Information. The protein-to-DNA model with symmetric context-independent errors has 242 states and 803 transitions (329 IO-conditioned). The model with asymmetric context-dependent errors has 1349 states and 7464 transitions (2558 IO-conditioned). After constructing the state machines, we used Machine Boss to generate custom C++ code for the Forward algorithm, compiled this code, and ran it to scan for a representative E.coli IS26 transposase protein (insB1, 167aa). Figure 3A shows the results of this experiment. Broadly, both error models performed similarly, scoring on average around 1.4 bits per codon aligned. (This is a rather low alignment score, reflecting the extremely noisy nature of the training alignments.) We observed a small (3%) but significant improvement in log-odds scores for positives when using the asymmetric context-dependent model, with negligible effect on log-odds scores for negatives.

Fig. 3.

A richer error model slightly improves the discriminative power of both protein sequence (A) and profile HMM (B) alignments to noisy sequencing reads. The first plot (A) shows the smoothed density of log-odds ratios of global alignments of E.coli protein insB1 to nanopore reads that fully contain a gene for that protein (Positives), versus those that do not contain that protein or close homologs (Negatives). The second plot (B) shows the smoothed density of log-odds ratios of the PFAM domain DDE Tnp IS1 (PF03400) for the IS1 transposase, aligned to nanopore reads that fully contain a gene for the corresponding insB1 protein (‘Positives’), versus those that do not contain that protein or close homologs (‘Negatives’). These alignments were done using error models with and without insertion/deletion asymmetry and context dependence (SCI = symmetric context-independent; ACD = asymmetric context-dependent). The log-odds ratio for a read is where is the hypothesis that the read contains the insB1 gene or domain and the hypothesis that it does not. The mean of L is indicated for each group with a dashed line. For the sequence-DNA alignment (A), using the asymmetric context-dependent error model increases the mean of the positives ( bits, a relative increase of around 3%) with negligible effect on the negatives ( bits). Similarly, for the profile-DNA alignment (B), using the asymmetric context-dependent error model increases the mean of the positives ( bits) with a smaller effect on the negatives ( bits) To investigate how much of this improvement arose from the separate insertion and deletion probabilities, we prototyped a third error model, based on the symmetric context-independent model (and having similar state and transition counts), but relaxing the symmetry constraint between insertions and deletions. Results for this model are not shown in Figure 3 but its log-likelihoods for protein-to-DNA alignment generally lie in between the other two models, with a relative improvement of around 1% over the symmetric context-independent model. Thus, the improvement from allowing context-dependence appears to be roughly double the improvement from allowing symmetry-breaking, for this task.

3.2 Aligning protein domain profiles to nanopore reads

In the previous section, we developed a state machine to model nanopore-specific sequencing errors and used that to better align protein sequences to nanopore reads. Machine Boss is able to combine arbitrary state machines, so the previous error models can be easily combined with other bioinformatics state machines. We next sought to investigate whether a richer error model would also benefit a profile HMM search. Instead of aligning a single protein sequence to a nanopore read, we combined a profile HMM with our nanopore-specific error model and aligned it to the same set of nanopore reads from the previous section. For this experiment, we again focused on the insB1 transposase from the previous example and used the Pfam DDE_Tnp_IS1 domain (accession PF03400), which profiles its catalytic domain. We imported the HMMER-formatted profile HMM from the Pfam database into Machine Boss, and composed it with each of two error models: the symmetric context-independent error model, and the asymmetric context-dependent model. We did not use the asymmetric context-independent model in this experiment. For this experiment, we did not generate custom C++ code and instead relied on Machine Boss’s internal Forward algorithm implementation (see Supplementary Information). Results are shown in Figure 3B. Proceeding from the symmetric context-independent model to the richer asymmetric context-dependent model, we see a relative increase in the log-likelihood of 4.1% for positives and 2.2% for negatives.

3.3 Decoding the most likely output sequence of a neural network basecaller

Our third experiment tests the decoding algorithms used for basecalling on the Oxford Nanopore Technologies (ONT) sequencing platform. As a single strand of DNA (or RNA) passes through the protein nanopore, it perturbs an electrical current signal in a sequence-dependent way. A neural network trained with Connectionist Temporal Classification (CTC, Graves ) outputs a probability distribution over sequences, which requires an additional decoding step to find the most likely sequence. CTC was developed for speech recognition and first applied to nanopore sequencing by Chiron (Teng ), and was later adopted by various ONT basecallers. Bonito (https://github.com/nanoporetech/bonito) is ONT’s most recent research basecaller: it uses a convolutional architecture based on QuartzNet (Kriman ), and is trained with CTC loss. In practice, Bonito uses Viterbi decoding, which simply takes the argmax of the logits and concatenates the resulting nucleotide and gap characters. In this test, we compare the use of a Viterbi decoding scheme, which just finds the single most probable path through the data, with a beam search, a heuristic search algorithm which looks for the best label sequence. The CTC probability outputs are similar to profile HMMs and as such can be interpreted as state machines (Silvestre-Ryan and Holmes, 2018). We evaluated these algorithms on a small sample of 100 reads from a publically available R9.4 Klebsiella pneumoniae dataset (Wick ). Accuracy was evaluated by aligning basecalled reads with minimap2 (Li, 2018) to a reference genome from the same study. Each read was basecalled with a version of Bonito modified to save the network output, which was then loaded as a state machine into Machine Boss using the recognize-merge-csv option, which constructs a state machine that merges repeated characters in the same manner as the CTC loss, as described in (Graves ). Results are shown in Figure 4. We found that the beam search yielded an increase of 0.2% median accuracy over Viterbi, though in practice its greater computational cost would likely not be worth such a slight improvement. These results were obtained with a beam width of 5; a larger beam size of 50 did not noticeably improve the results.

Fig. 4.

A beam-search decoding of the maximum likelihood sequence of Oxford Nanopore’s Bonito basecaller slightly outperforms a Viterbi best-path decoding on a sample of 100 Klebsiella pneumoniae reads. The percent accuracy is defined as the number of identities in the alignment divided by the total alignment length. Median accuracy with Viterbi was 92.8% while beam search yielded a median accuracy of 93.0%. This slight increase in accuracy does incur a computational cost: the beam search (width of 5) takes roughly 1.25 times as long as the Viterbi decoding. We further observe that a bespoke Python implementation of Viterbi decoding (optimized for this model architecture) was roughly five times as fast as Machine Boss’s generic C++ implementation of Viterbi decoding (which spends most of its time constructing and topo-sorting the state machine). This reinforces the conclusion that Machine Boss is better suited to development-stage prototyping, than to computationally intensive end-user applications In addition to the decoding of single reads, more elaborate dynamic programming algorithms for consensus decoding (Silvestre-Ryan and Holmes, 2018) can also easily be implemented and tested in Machine Boss. In this case, performance is too slow for practical application to large datasets, though these reference implementations can be used to debug domain-specific software. In our case, Machine Boss has helped with testing our own consensus basecalling software PoreOver (https://github.com/jordisr/poreover, Silvestre-Ryan and Holmes, 2020).

3.4 Constructing a repeat-avoiding code for DNA data storage

Our last computational experiment is motivated not by sequence analysis, but by analysis of the state machines themselves. We sought to investigate the complexity of error-tolerant codes for storing information in DNA. We developed a state machine for converting binary information to DNA sequences using, as a starting point, the DNA storage code developed by Goldman . The central idea of this code is that DNA homopolymers are often misread by many sequencing technologies, so this class of errors can be avoided altogether by never repeating a base in the encoded sequence. This leaves three nucleotides available for encoding information at any given position in the sequence, which corresponds to a radix-3 (ternary) representation. We implement this as a multiplication of two machines: one that converts binary to ternary (with some unavoidable inflation of the message size), and a second machine that converts ternary to non-repeating DNA. This configuration is shown in Figure 2. Using Machine Boss, we were able to successfully prototype this machine and to confirm that it accurately encodes binary messages to non-repeating DNA strings and decodes in the opposite direction, with the ratio of binary to DNA message lengths asymptotically approaching the expected limit of 3/2. This ratio is calculated as follows. The conversion of binary to ternary is approximate, batching the input bits into triplets (with possibilities per batch) and outputting trits—i.e. ternary digits—in pairs (with possibilities per batch). Thus, the input sequence has 3/2 as many characters as the output. The ternary-to-DNA conversion converts each ternary digit to a single nucleotide, so the overall input/output ratio for the full binary-to-DNA conversion is also 3/2. This is slightly wasteful given that the Shannon information content of a non-repeating DNA sequence is bits/symbol, slightly greater than 3/2. The wastage is incurred by the batched binary-to-ternary conversion, since there are more output possibilities than input possibilities for each batch. As can be seen in machine A of Figure 2, there is no triplet of input bits that will ever output the pair of trits ‘22’. This reflects a more general phenomenon that a finite-state machine cannot perfectly convert a radix-2 input to a radix-3 output (essentially, it can only compute the last digit of this conversion, which amounts to dividing the input by 3 and outputting the remainder; the quotient must then be fed into a similar machine to compute the second-last digit, and so on). It is possible to get quite close to the limit, though, by batching input bits in this way, and a batch size of 3 bits is a reasonably efficient compromise in terms of the number of states required by the machine; no improvement can be gained by improving the batch size until one reaches 11 bits, whereupon the ratio of input/output message lengths is , but this requires states to track each batch. Such a machine can readily be prototyped with Machine Boss, but the algorithms to manipulate and use the state machines become quite cumbersome for large machines (in addition to the well-known time complexity of dynamic programming to state machines, Machine Boss performs operations like topological sort and state elimination that can be slow for very large machines). The input/output ratio of 3/2 is approached asymptotically from below, because there is a necessary overhead involved in encoding the message length itself; our machine encodes this using the otherwise-unused pair of output trits ‘22’ as an end-of-message terminator sequence. For simplicity, this mechanism is not included in Figure 2; when it is included, the combined machine for binary-to-non-repeating-DNA conversion has 85 states and 132 transitions (44 IO-conditioned). The component machines were constructed with a short JavaScript program, and are available as presets in Machine Boss. We can readily extend the above-described approach to study more elaborate DNA storage codes. For example, we can develop a DNA-encoding machine that avoids not just repeated nucleotides in the output, but also avoids certain nucleotide motifs, such as restriction enzyme sites; briefly, the transition graph of such a machine can be found by starting with a de Bruijn graph over k-mers, from which the prohibited k-mers are then deleted. (Of course, restriction enzyme sites that contain repeated nucleotides would already be excluded.) We might also incorporate error-correction units, such as Hamming codes or indel-resistant ‘watermarks’. Finally, we can incorporate technology-specific models of sequencing error, such as the nanopore error models described in previous sections, when decoding messages. All these variations can be implemented as modular machines and factored into the ‘matrix multiplication’ of Figure 2. For example, introducing a Hamming(7,4) error-correcting parity code (Supplementary Fig. S1) to the non-repeating-DNA code (Fig. 2) yields a machine with 1365 states and 1812 transitions (292 IO-conditioned), whose input/output ratio is . A deeper exploration of these ideas, using state machines to prototype the codes and investigating their error-correcting properties by simulation, is available in a separate preprint (Holmes, 2016).

4 Discussion

Machine Boss can be useful for prototyping, testing, and theoretical analysis of state machines. In most cases, it is not suitable for developing polished bioinformatics tools, since further heuristic or custom optimizations of the generated state machines and code (beyond Machine Boss’s automated capabilities) is often possible. As an example of this further optimization, our context-dependent error model has 50 states: a start state, an end state, and 48 states which consist of match, insert, and delete states repeated in 16 different flanking contexts. However, it is unnecessary to allocate storage for all 50 states during dynamic programming: the flanking context is always exactly determined by the position in the input genomic sequence, so only 5 states are ever accessible at any position in the dynamic programming matrix. An optimized implementation could make use of this, but Machine Boss currently lacks the sophistication to deduce such optimizations automatically. Rather, Machine Boss can be used (as we have done here) to evaluate whether such development is worthwhile, and to provide a robust reference implementation against which the results of a more optimized version can be checked. Another application involves nanopore basecalling. The outputs of deep learning basecallers can be interpreted as machine transition weights (Jain ; Wick ). In building on these results, we have found Machine Boss useful as a debugging and profiling tool (Silvestre-Ryan and Holmes, 2018). Compared to recent deep learning approaches, automata retain some merits: they are highly interpretable, conceptually straightforward and generally predictable. The interpretability is especially appealing when paths through the automaton have clear meaning—as is the case when state machines are used to represent biological processes such as translation and splicing, information-theoretic processes like radix-based coding, or evolutionary processes such as indels (for which purpose Machine Boss includes a reference implementation of the Thorne–Kishino–Felsenstein model, Thorne ). The software development was motivated directly by these cases, but the algorithms implemented are general enough that we have been able to use it for applications in nanopore analysis as well. The README file in the Machine Boss repository describes several further applications, including machines to search for a PROSITE regular expression in a protein sequence and to count copies of this motif in a (translated) DNA sequence. As with the examples in this article, the power of this approach rests on the ability to combine such state machines in a general way, together with new machines as yet undeveloped. Click here for additional data file.

39 in total

1. Evolutionary HMMs: a Bayesian approach to multiple alignment.

Authors: I Holmes; W J Bruno
Journal: Bioinformatics Date: 2001-09 Impact factor: 6.937

2. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model.

Authors: Marina Alexandersson; Simon Cawley; Lior Pachter
Journal: Genome Res Date: 2003-03 Impact factor: 9.043

3. Phylogenetic estimation of context-dependent substitution rates by maximum likelihood.

Authors: Adam Siepel; David Haussler
Journal: Mol Biol Evol Date: 2003-12-05 Impact factor: 16.240

4. Profile analysis: detection of distantly related proteins.

Authors: M Gribskov; A D McLachlan; D Eisenberg
Journal: Proc Natl Acad Sci U S A Date: 1987-07 Impact factor: 11.205

5. A general method applicable to the search for similarities in the amino acid sequence of two proteins.

Authors: S B Needleman; C D Wunsch
Journal: J Mol Biol Date: 1970-03 Impact factor: 5.469

6. Incorporating indel information into phylogeny estimation for rapidly emerging pathogens.

Authors: Benjamin D Redelings; Marc A Suchard
Journal: BMC Evol Biol Date: 2007-03-14 Impact factor: 3.260

7. Automated generation of heuristics for biological sequence comparison.

Authors: Guy St C Slater; Ewan Birney
Journal: BMC Bioinformatics Date: 2005-02-15 Impact factor: 3.169

8. Consistency of VDJ Rearrangement and Substitution Parameters Enables Accurate B Cell Receptor Sequence Annotation.

Authors: Duncan K Ralph; Frederick A Matsen
Journal: PLoS Comput Biol Date: 2016-01-11 Impact factor: 4.475

9. The Dfam database of repetitive DNA families.

Authors: Robert Hubley; Robert D Finn; Jody Clements; Sean R Eddy; Thomas A Jones; Weidong Bao; Arian F A Smit; Travis J Wheeler
Journal: Nucleic Acids Res Date: 2015-11-26 Impact factor: 16.971

10. The Pfam protein families database in 2019.

Authors: Sara El-Gebali; Jaina Mistry; Alex Bateman; Sean R Eddy; Aurélien Luciani; Simon C Potter; Matloob Qureshi; Lorna J Richardson; Gustavo A Salazar; Alfredo Smart; Erik L L Sonnhammer; Layla Hirsh; Lisanna Paladin; Damiano Piovesan; Silvio C E Tosatto; Robert D Finn
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

1 in total

1. A Model of Indel Evolution by Finite-State, Continuous-Time Machines.

Authors: Ian Holmes
Journal: Genetics Date: 2020-10-05 Impact factor: 4.562

1 in total