Literature DB >> 32968428

A linear-time algorithm that avoids inverses and computes Jackknife (leave-one-out) products like convolutions or other operators in commutative semigroups.

John L Spouge¹, Joseph M Ziegelbauer², Mileidy Gonzalez³.

Abstract

BACKGROUND: Data about herpesvirus microRNA motifs on human circular RNAs suggested the following statistical question. Consider independent random counts, not necessarily identically distributed. Conditioned on the sum, decide whether one of the counts is unusually large. Exact computation of the p-value leads to a specific algorithmic problem. Given n elements g 0 , g 1 , … , g n - 1 in a set G with the closure and associative properties and a commutative product without inverses, compute the jackknife (leave-one-out) products g ¯ j = g 0 g 1 ⋯ g j - 1 g j + 1 ⋯ g n - 1 ( 0 ≤ j < n ).
RESULTS: This article gives a linear-time Jackknife Product algorithm. Its upward phase constructs a standard segment tree for computing segment products like g i , j = g i g i + 1 ⋯ g j - 1 ; its novel downward phase mirrors the upward phase while exploiting the symmetry of g j and its complement g ¯ j . The algorithm requires storage for 2 n elements of G and only about 3 n products. In contrast, the standard segment tree algorithms require about n products for construction and log 2 n products for calculating each g ¯ j , i.e., about n log 2 n products in total; and a naïve quadratic algorithm using n - 2 element-by-element products to compute each g ¯ j requires n n - 2 products.
CONCLUSIONS: In the herpesvirus application, the Jackknife Product algorithm required 15 min; standard segment tree algorithms would have taken an estimated 3 h; and the quadratic algorithm, an estimated 1 month. The Jackknife Product algorithm has many possible uses in bioinformatics and statistics.

Entities: Chemical

Keywords: Commutative semigroup; Data structure; Jackknife products; Leave-one-out; Segment tree

Year: 2020 PMID： 32968428 PMCID： PMC7502207 DOI： 10.1186/s13015-020-00178-x

Source DB: PubMed Journal: Algorithms Mol Biol ISSN： 1748-7188 Impact factor: 1.405

Background

A biological question

Circular RNAs (circRNAs) are single-stranded noncoding RNAs that can inhibit another RNA molecule by binding to it, mopping it up like a sponge. During herpesvirus infection, human hosts produce circRNAs with target sites that may bind herpesvirus microRNA (miRNA) [1] (see Fig. 1). Given a sequence motif, e.g., a target site for a miRNA, researchers counted how many times the motif occurs in each circRNA sequence. They then posed a question: is the motif unusually enriched in any of the circRNAs, i.e., does any circRNA have too many occurrences of the motif to be explained by chance alone? If “yes”, the researchers could then focus their further experimental efforts on those circRNAs.

Fig. 1

A schematic diagram of herpesvirus miRNA motif occurring on a human circRNA. As indicated in the legend, each thin circle represents a circRNA; each thick line segment, the occurrence of a miRNA motif on the corresponding circRNA. Both circRNAs and the miRNA motif have nucleotide sequences represented by IUPAC codes (A, C, G, U). This figure illustrates occurrences of a single miRNA motif (e.g., UUACAGG) on the circRNAs. The biological question is: “does any circRNA have too many occurrences of the motif to be explained by chance alone?” In the actual application, the circRNAs ranged in length from 69 nt to 158565 nt

A statistical answer

Figure 1 illustrates a set of circRNAs with varying length, with a single miRNA motif occurring as indicated on each circRNA. Let index the circRNAs; the random variate count the motif occurrences in the -th circRNA; equal the observed count for ; and the sum count the total motif occurrences among the circRNAs, with observed total . The following set-up provides a general statistical test for deciding the biological question. Let represent independent random counts (i.e., non-negative integer random variates), not necessarily identically distributed, with sum . Given observed values with observed sum , consider the computation of the conditional p-values (). The conditional p-values can decide the question: “Is any term in the sum unusually large relative to the others?” The abstract question in the previous paragraph generalizes some common tests. For example, the standard 2 × 2 Fisher exact test [2, p. 96] answers the question in the special case of categories: each has a binomial distribution with common success probability , conditional on known numbers of trials (). Although the Fisher exact test generalizes directly to a single exact p-value for a table [3], the generalization can require prohibitive amounts of computation. The abstract question corresponds to a computationally cheaper alternative that also decides which columns in the table are unusual [4]. To derive an expression for the conditional p-value, therefore, let be given, so the array gives the distribution of , truncated at the observed total . Because is a truncated probability distribution, , the set of all real ()-tuples satisfies () and . The truncation still permits exact calculation of the probabilities below. To calculate the distribution of the sum for , define the truncated convolution operation , for which (). Hereafter, the operation “” is often left implicit: . Let , so (). Define the “jackknife products” () (implicitly including the products and ). The jackknife products contain the same products as , except that in turn each skips over (). Like the jackknife procedure in statistics, therefore, jackknife products successively omit each datum in a dataset [5]. With the jackknife products in hand, the conditional p-values are a straightforward computation: With respect to Eq. (1) and the biological question in Fig. 1, Appendix B gives the count of the ways that circRNAs of known but varying length may contain miRNA motifs of equal length, the count of the ways that the th circRNA may contain motifs, and the count of the ways that all circRNAs but the th may contain motifs. Appendix B derives the count for circRNAs from the easier count for placing motifs on a linear RNA molecule. For combinatorial probabilities like , Eq. (1) remains relevant, even if are counts instead of probabilities. The biological question therefore exemplifies a commonplace computational need in applied combinatorial probability. The Discussion indicates that in our application, transform methods can encounter substantial obstacles when computing Eq. (1) (e.g., see [6]), because the quantities in Eq. (1) can range over many orders of magnitude. This article therefore pursues direct exact calculation of . The product forms of and suggest that any efficient algorithm may be abstracted to broaden its applications, as follows.

Semigroups, groups, and commutative groups

Let denote a set with a binary product on its elements. Let “” denote “ is an element of ”, and consider the following properties [7]. Closure for every Associative for every Identity There exists an identity element , such that for every Commutative for every If the Closure and Associative properties hold, is a semigroup. Without loss of generality, we assume below that the Identity property holds. If not, adjoin an element , such that for every . In addition, if the Commutative property holds for every , the semigroup is commutative. Unless stated otherwise hereafter, denotes a commutative semigroup. The Jackknife Product algorithm central to this article is correct in a commutative semigroup. Inverse For every , there exists an inverse , such that As shown later, the Jackknife Product algorithm does not require the Inverse property. In passing, note that the convolution semigroup relevant to the circRNA–miRNA application lacks the Inverse property, as does any convolution semigroup for calculating p-values, e.g., the ones relevant to sequence motif matching [6]. To demonstrate, let be independent integer random variates. The identity for convolution corresponds to the variate , because for every variate . If , however, the independence of and implies that both are constant and therefore . In the relevant convolution semigroup, therefore, all elements except the identity lack an inverse. The non-zero real numbers under ordinary multiplication form a commutative semigroup with the Inverse property. They provide a familiar setting for discussing some algorithmic issues when computing . Let be the usual product of real numbers, and consider the toy problem of computing all jackknife products that omit a single factor () from . Inverses are available, so an obvious algorithm computes and then with inverses and products. If the inverses were unavailable, however, the naïve algorithm using element-by-element products to compute each would require products. The quadratic time renders the naïve algorithm impractical for many applications. Figure 1 illustrates a standard data structure called a segment tree, omitting the root at the top of the segment tree. Algorithms based solely on a segment tree can calculate the jackknife products in time , fast enough for many applications. The segment tree computes segment products like without using the commutative property, so it can similarly compute jackknife products like . If the semigroup is commutative, however, a Jackknife Product algorithm can avoid inverses and reduce the computational time further, from to . With in-place computations requiring only the space for the segment tree, the Jackknife Product algorithm avoids inverses yet still requires only about products and storage for numbers. It is therefore surprisingly economical, even when compared to the obvious algorithm using inverses. Indeed, our application to circular RNA required some economy, with its convolution of distributions, some truncated only after terms. In a general statistical setting, convolutions form a commutative semigroup without inverses, so our application already indicates that the Jackknife Product algorithm has broad applicability.

Theory

Appendix A proves the correctness of the Jackknife Product algorithm given below.

The Jackknife Product algorithm

Let be a commutative semigroup. The Jackknife Product algorithm has three phases: upward, downward, and transposition. Its upward phase simply constructs a segment tree (see Fig. 1); its downward phase exploits the symmetry of and its complement to mirror the upward phase while computing (see Fig. 2); and its final transposition phase then swaps successive pairs in an array (not pictured). As Figs. 1 and 2 suggest, the three phases yield a simpler algorithm if is a binary power. To recover the algorithm from them, pad on the right with copies of the identity up to elements, where is the smallest binary power greater than or equal to , i.e., replace with , with copies of . The algorithm can therefore pad any input of elements up to elements without loss of generality. The algorithm given below is therefore slightly more intricate than the algorithm, but it may save almost a factor of 2 in storage and time by omitting the padded copies of . In any case, the simpler algorithm can always be recovered from the phases for general given here, if desired.

Fig. 2

A (rootless) segment tree. This figure illustrates the rootless segment tree constructed in the upward phase of the Jackknife Product algorithm. The commutative semigroup illustrated is the set of nonnegative integers under addition. The bottom row of squares () contains (). In the next row up, as indicated by the arrow pairs leading into each circle, the array contains consecutive sums of consecutive disjoint pairs in , e.g., . The rest of the segment tree is constructed recursively upward to , just as was constructed from . Here, 2 copies of the additive identity pad out on the right. Padded on the right, the copies contribute literally nothing to the segment tree above them. Their non-contributions have dotted outlines A (rootless) complementary segment tree. This figure illustrates the rootless complementary segment tree constructed in the downward phase of the Jackknife Product algorithm from the rootless segment tree in Fig. 2. The downward phase starts by initializing the topmost row () with the topmost row of the rootless segment tree. The row in Fig. 2 and the row in Fig. 3, e.g., contain 22 and 11. For each in Fig. 3, downward arrows run from to . As they indicate, each node in contributes to its 2 “nieces” in Fig. 2 to produce the next row down in Fig. 3, e.g., contributes to its nieces and in the segment, to produce and in the complementary segment tree. The rest of the complementary segment tree is constructed recursively downward to , just as was constructed from . In Fig. 2, the elements of (in squares) total 33. To demonstrate the effect of the Jackknife Product algorithm, subtract in turn in Fig. 3 each element (25, 28, 30, 27, 26, 29, 33, 33) in the bottom row from the total 33. The result (8, 5, 3, 6, 7, 4, 0, 0) is the bottom row in Fig. 2 with successive pairs transposed, so , or equivalently We start with notational preliminaries. Define the floor function and the ceiling function (both standard); and the binary right-shift function . Other quantities also smooth our presentation. Given a product of interest, define and for . Below, the symbol “□” connotes the end of a proof.

The upward phase

The upward phase starts with the initial array () and simply computes a standard (but rootless) segment tree consisting of segment products for and .

Comments

(1) If is a binary power, and the final line in the upward phase can be omitted. (2) Of some peripheral interest, Laaksonen [8] gives the algorithm in a different context, embedding a binary tree in a single array of length . If any changes, he also shows how to update the single array with multiplications. If the downward phase (next) does not overwrite the segment tree by using in-place computation, it permits a similar update.

The downward phase

The transposition function transposes adjacent indices, e.g., . We also require for and , the index of “aunts”, as illustrated by Fig. 2. Just as Fig. 1 illustrates a rootless segment tree in the upward phase, Fig. 2 illustrates the corresponding rootless complementary segment tree in the downward phase. The downward phase computes complementary segment products for and . (1) If is a binary power, , , and the final line in the downward phase can be omitted. (2) The downward phase can be modified in the obvious fashion to permit in-place calculation of from , reducing total memory allocation by about 2. As Appendix A proves, the final array has elements (), with an additional final element if is odd, so the Jackknife Product algorithm ends with a straightforward transposition phase. The transposition phase can permit an in-place calculation of to overwrite .

Computational time and storage

Note , so . To compute from or to compute from and , the Jackknife Product algorithm requires products. For large , therefore, the upward phase computing the segment tree requires about products; the downward phase, about products. Likewise, if the downward and transposition phases compute in place by replacing with and with , the algorithm storage is semigroup elements. Each of the three estimates just given for products and storage have an error bounded by . Although the case of general could be handled by the algorithm for binary powers by setting and , the truncated arrays in the Jackknife Product algorithm for general save about a factor of in both products and storage. As written, the conditional copy statements at the end of the upward and downward phases replicate elements already in storage. If the downward phase of the Jackknife Product algorithm is implemented with in-place computation of from , the copy statements ensure that the algorithm never overwrites any array element it needs later. Some statements may copy some elements more than once (and therefore unnecessarily), but a negligible copies at most are unnecessary. The complementary segment tree in Fig. 2 implicitly indicates the nodes in the segment tree required to compute for each , i.e., exactly one node in each row (). Alone, the segment tree therefore requires at least multiplications to compute .

Results

Appendix B gives the combinatorics relevant to the circRNA-miRNA application described in “Background” section. As is typical in combinatorial probability, the quantities were counts of configurations, here, the ways of placing miRNA motifs on circRNAs. The length of each motif was m = 7; the largest circRNA (hsa-circ-0003473) contained = 158,565 nt, and the most abundant motif (CCCAGCU, for the m12-9star miRNA family) appeared = 997 times, so the spanned thousands of orders of magnitude in Eq. (13) of Appendix B, from to . In Eq. (1), the dimension controls the number of terms in the convolutions. In the application, over each miRNA motif examined, the maximum number of motif occurrences on the circRNAs was . An Intel Core i7-3770 CPU computed the p-value relevant to the biological application on June 17, 2015. To compare later with estimated times for competing algorithms, the Jackknife Product algorithm with computed the relevant p-values in about 45 min, requiring about products. In the application, therefore, products required about 15 min. The application of this article to circRNA–miRNA data appears elsewhere [1].

Discussion

This article has presented a Jackknife Product algorithm, which applies to any commutative semi-group . The biological application to a circRNA–miRNA system exemplifies a general statistical method in combinatorial probability. In turn, the application in combinatorial probability exemplifies an even more general statistical test for whether a term in a sum of independent counting variates (not necessarily identically distributed) is unusually large. Many biological contexts lead naturally to sums of independent counting variates. Domain alignments of proteins from cancer patients, e.g., display point mutations in their columns. For a given domain, a column with an excess of mutations might be inferred to cause cancer [9]. The Background section gives the pattern: let represent the mutation count in column , with total mutations . Given observed mutation counts with observed sum , the conditional p-values () can decide the question: “Does any column have an excess of mutations?” The actual application used other, very different statistical methods [9]. Unlike those methods, however, our methods can incorporate information from control (non-cancer) protein sequences to set column-specific background distributions for . The Benjamini–Hochberg procedure for controlling the false discovery rate in multiple tests requires either independent p-values [10] or dependent p-values with a positive regression dependency property [11]. Loosely, the positive regression dependency property means that the p-values tend to be small together, i.e., under the null hypothesis, given that one p-value is small, then the other p-values tend to be smaller also. Inconveniently, our null hypothesis posits a fixed sum of independent counting variates, so if one variate is large and has a small p-value, it tends to reduce the other variates and increase their p-values. The circRNA-miRNA application therefore violates the statistical hypotheses of the Benjamini–Hochberg procedure. Fortunately, in the circRNA-miRNA application, a Bonferroni multiple test correction [12] sufficed because empirically, any p-value was either close to 1 or extremely small. The Results state that for , the Jackknifed Product algorithm computed the relevant p-values in about 45 min, with products requiring about 15 min of computation. In contrast, the naïve algorithm avoiding inverses and requiring products would have taken about min, i.e., about 1 month. As explained under the “Computational Time and Storage” heading in the Theory section, without exploiting the special form of the jackknife products , a segment tree requires about products for its construction and at least products for the computation of the products . Alone, segment tree algorithms would therefore have taken a minimum of about min, i.e., about 3 h. The convolutions in Eq. (1) might suggest that jackknife products are susceptible to computation with Fourier or Laplace transforms, which convert convolutions into products. “Results” section notes that in the biological application, however, in Eq. (1) spanned thousands of orders of magnitude, at least from to , obstructing the direct use of transforms (e.g., see [6]). On one hand, the widely varying magnitudes necessitated an internal logarithmic representation of in the computer, a minor inconvenience for direct computation with the Jackknife Product algorithm. On the other hand, they might have presented a substantial obstacle for transforms. The famous Feynman anecdote about Paul Olum’s problem indicates the reason [13]: So Paul is walking past the lunch place and these guys are all excited. “Hey, Paul!” they call out. “Feynman’s terrific! We give him a problem that can be stated in ten seconds, and in a minute he gets the answer to 10 percent. Why don’t you give him one?” Without hardly stopping, he says, “The tangent of 10 to the 100th.” I was sunk: you have to divide by pi to 100 decimal places! It was hopeless. The Jackknife Product algorithm also abstracts to any commutative semigroup , broadening its applicability enormously. As usual, abstraction eases debugging. Consider, e.g., the commutative semigroup consisting of all bit strings of length under the bitwise “or” operation. If the bit string has 1 in the j-th position and 0 s elsewhere, then the segment product equals the bit string with 1 s in positions and 0 s elsewhere. Similarly, the complementary segment product equals the bit string with 0 s in positions and 1 s elsewhere. The Jackknife Product algorithm is easily debugged with output consisting of the segment and complementary segment trees for the bit strings. As a final note, even if a semigroup lacks the Commutative property, the general product algorithm for a segment tree can still compute in time . In a commutative semigroup , however, the downward phase of the Jackknife Product algorithm exploits the special form of the products to decrease the time to .

Conclusions

This article has presented a Jackknife Product algorithm, which applies to any commutative semi-group . The biological application to a circRNA–miRNA system uses a commutative semigroup of truncated convolutions to exemplify a specific application to combinatorial probabilities. In turn, the specific application in combinatorial probability exemplifies an even more general statistical test for whether a term in a sum of independent counting variates (not necessarily identically distributed) is unusually large. The general statistical test can evaluate the results of searching for a sequence or structure motif, or several motifs simultaneously. As “Discussion” section explains, the test violates the hypotheses of the Benjamini–Hochberg procedure for estimating false discovery rates, but fortunately the Bonferroni and other multiple-test corrections remain available to control familywise errors. Abstraction from convolutions to commutative semi-groups broadens the algorithm’s applicability even further. If an application only requires jackknife products and their number is large enough, “Results” and “Theory” sections show that the linear time of the Jackknife Product algorithm can make it well worth the programming effort.

4 in total

1. FAST: Fourier transform based algorithms for significance testing of ungapped multiple alignments.

Authors: Niranjan Nagarajan; Uri Keich
Journal: Bioinformatics Date: 2008-01-06 Impact factor: 6.937

2. Note on an exact treatment of contingency, goodness of fit and other problems of significance.

Authors: G H FREEMAN; J H HALTON
Journal: Biometrika Date: 1951-06 Impact factor: 2.445

3. Discovery of Kaposi's sarcoma herpesvirus-encoded circular RNAs and a human antiviral circular RNA.

Authors: Takanobu Tagawa; Shaojian Gao; Vishal N Koparde; Mileidy Gonzalez; John L Spouge; Anna P Serquiña; Kathryn Lurain; Ramya Ramaswami; Thomas S Uldrick; Robert Yarchoan; Joseph M Ziegelbauer
Journal: Proc Natl Acad Sci U S A Date: 2018-11-19 Impact factor: 11.205

4. Oncodomains: A protein domain-centric framework for analyzing rare variants in tumor samples.

Authors: Thomas A Peterson; Iris Ivy M Gauran; Junyong Park; DoHwan Park; Maricel G Kann
Journal: PLoS Comput Biol Date: 2017-04-20 Impact factor: 4.475

4 in total