Literature DB >> 31508512

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer.

Guan-Da Huang¹, Xue-Mei Liu¹, Tian-Lai Huang¹, Li-C Xia².

Abstract

Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics T s u m S and T s u m * , which subsample metagenome contigs by their representative regions, and summarize the regional D 2 S and D 2 * metrics by their upper bounds. We systematically studied the aggregative statistics' power at different k-mer size using simulations. Our analysis showed that, in general, the power of T s u m S and T s u m * increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of T s u m S and T s u m * was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.

Entities: Disease Species

Keywords: Alignment-free sequence comparison; Horizontal gene transfer; Statistical power; k-mer

Year: 2019 PMID： 31508512 PMCID： PMC6723412 DOI： 10.1016/j.synbio.2019.08.001

Source DB: PubMed Journal: Synth Syst Biotechnol ISSN： 2405-805X

Introduction

Horizontal gene transfers (HGT) is the transversal exchange of genetic material between different organisms, cells, or organelles, as well as that between organisms and environment. It accelerates the speed of biological adaptation to the environment, promotes the evolution of organisms, and promotes the convergence of species [1]. The main modes of HGT in prokaryotic organisms are conjugation, transformation and transduction [[2], [3], [4]], while the mode of HGT for eukaryotic organisms are complex, with increased potential in the presence of viral transfection, host parasites or direct contacts of symbiotic organisms [5]. Researchers are highly interested in identifying and analyzing HGT in metagenomics sequence data, so as to understand how it reshapes and rebuilds the microbial community in response to a changing environment. The computational methods for predicting HGT genes from biological sequences include: (1) phylogenetic methods, which uses bootstrap resampling [6,7] and maximum likelihood framework [8,9] for identifying the significant differences between phylogenetic trees; (2) parametric methods, which extract the characteristics of genes from a genome as its tags [10]. These tags can include sequence structure-based features [11], statistical features [12,13], and information entropy-based features [14]; and (3) sequence distance-based methods, which detects HGT using sequence comparison. The distance-based methods were predominated by alignment-based metrics. These methods require computing the optimal pairwise or multiple alignment (locally or globally) using dynamic programming, such as algorithms implemented by ClustalW [15] and Muscle [16], or performing alignment-based database search, such as algorithms implemented by BWA [17], Blat [18], Blast [19] and Fasta [20]. However, in either cases, finding the optimal alignment demands a significant amount of computation both in time and memory space. Consequently, these alignment-based methods are limited by their throughput in comparatively analyzing a large number of contigs as seen in today's metagenomics studies. Alternatively, alignment-free statistics are potential useful distance metrics, which are more efficient computationally and do not require high sequence homology [21,22]. In principle, alignment-free methods decompose the DNA or protein sequences into short subsequences (or k-mers). They first compute k-mer frequency vectors based on k-mer occurrence and then calculate the sequence distances based on these vectors. These methods originated from the application of comparing computer programs [23] and were introduced to molecular sequence analysis by the early works of Hao et al. [24] and others [25] dating back to the early 2000s. Notably, Dr. Hao's [26] work in CV-Tree methods represents the first systematic whole-genome phylogenetic attempt using alignment-free k-mer statistics. Nowadays, k-mer based statistics have become a cornerstone in alignment-free whole-genome and gene-based sequence comparisons [[27], [28], [29], [30]]. It was proved useful in predicting protein structure and function [31], in predicting HGT [32,33], and in constructing phylogenetic trees [34]. Those applications typically involve comparing conserved sites or homology regions between two sequences and quantifying molecular evolutionary relationships at the sequence level. Those successful applications were built upon the good performance that k-mer based statistics and distance metrics have in clustering and classifying amino acid and nucleic acid sequences. In this paper, we specifically extended the aggregative k-mer statistics previously developed by Liu et al. [35] to the task of detecting HGT in metagenomics sequence data. We developed a sequence subsampling scheme that computes and summarizes the k-mer statistic metrics from the regional contigs using their upper bounds with aggregative statistics. The rationale is that the identified regions of the two contigs that maximize such k-mer metrics will also have the highest potential for being HGT. We parameterized the subsampling based on realistic assumptions of sequencing platform and HGT mechanisms. We tested the developed aggregative statistics in the cases of seven different k-mer metrics and with a k-mer size ranging from 4 to 8 by simulations. We found and , as parameterized by and metrics respectively, at the k-mer size 6 have the best power for detecting HGT events.

Materials and methods

k-mer based alignment-free statistics, first count the k-mer frequency and then correlate these k-mer frequency vectors (or their transformations), to evaluate the similarity, or homology, between two biological sequences. A k-mer is defined as a short subsequence of the DNA (nucleic acid) or protein (amino acid) sequence. The length of a k-mer is defined by the value of k (e.g., if k = 4, then the k-mer length is 4 base pair). In the case of DNA sequences, a k-mer is composed of k random letters from the alphabet . When a k is chosen, there are 4 possible k-mers and their occurrence in a sequence can be recorded in a 4 dimension k-mer frequency vector. All alignment-free k-mer statistics based themselves on the observed fact that these k-mer frequency vectors will show higher similarity if the two biological sequences being compared are more evolutionarily related.

Alignment-free k-mer statistics

Formally, we can define two molecular sequences as and and their shared sequence length n. For a given k, the occurrence of all possible k-mer words, as denoted by , is counted for the sequence and recorded in the k -mer frequency vector . Similarly, the k-mer words are counted and recorded for the sequence in . Note, both and are 4 dimensional vectors made up of the occurrence numbers of all possible 4-mers, which is denoted by . Therefore, the similarity, or relatedness, between the two sequences can be measured by correlating the two k-mer frequency vectors and . This similarity or relatedness metric can be transformed to obtain proper dissimilarity or distance metrics. In this setting, Torney et al. [36], derived the similarity metric statistic as: Kantorovitz et al. [37] extended to , which allows a generalized subtraction of background distribution from the statistic. The technique proved very effective for predicting the regulatory region of seemingly unrelated genes. E.g., Foret et al. [38,39] found that the statistic is superior to both Blast and exact k-mer matching in the evolutionary comparison of sequences. It can be seen from Eq. (1) that correlates the occurrence numbers of all k-mers directly, without adjusting for the total number of k-mers presented in individual sequences, which means that is subject to bias in sequence length as well as background sequencing noise incorporated in individual sequence. Reinert et al., [40] then identified that the performance of in large-scale biological sequences comparison could be improved by adjusting for the noise as introduced by individual sequences. They developed two generalized versions of , namely and [41,42] and proved that they are advantageous to in comparing sequences. These statistics were derived following the theorem by Shepp et al. [43], which states that if the independent variables U and V are normally distributed and their means are zero, then is also normally distributed. Based on that fact, Reinert and Liu standardized the statistic for raw k-mer frequency vectors and , as the follows:Here, is the probability of the word () occurs in the sequence and under an independent and identical distribution (i.i.d.) model.

Aggregative measures and

To extend and to whole-genome sequence comparison, in which high intra-sequence heterogeneity could alter local k-mer distributions, Liu (2011) et al. introduced an aggregative measure . subsamples the sequence distances using sliding windows and summarizes them using local upper bounding of and statistics. The subsampling strategy was configurable by the window and shift sizes. Using simulations, Liu et al. found that the statistical power of , which aggregates and statistics, was maximized when the whole-genome sequence was subsampled with non-overlapping sliding windows. Following the procedure of Liu et al. (see Fig. 1), we defined F1 to F as the subsampled fragments, and G the size of subsampling gap. We defined as the maxima of between the ith subsampled fragment of and all subsampled fragments of . The same was true for . Thus,

Fig. 1

The resampling scheme. (On Seq and Seq , F1 to FN are the subsampled fragments, and G is gap. is the maxima of between the ith subsampled fragment of and all subsampled fragments of . The same for is the maxima of between the ith subsampled fragment of and all subsampled fragments of .) Next, by summarizing over all ’s and ’s, we defined the aggregative k-mer statistic between and as: Similarly, we can derive , and -- the aggregative k-mer statistic for ’s and ’s when is used. To simply the simulation, we defined the parameter genome coverage rate R as:and the combined gap and fragment length, i.e. the entire genome segment length E as: It is thus convenient to vary E and R in our simulation with this new parameterization, where R represents the fraction of unknown parts of sequences and E represents the expected genome segment given full sequence information. We then evaluated the statistical power of and by simulating a large number of HGT and non-HGT sequence pairs with a range of parameters representing real NGS metagenomics data.

Simulation model and statistical power

We built two computer models to simulate the background and foreground biological sequences respectively. We simulated the background sequences as pairs of independently and identically distributed (i.i.d.) random letter sequences of the same length from the nucleic acid alphabet . These pairs of sequences were considered as unrelated and assigned to the control group. For the foreground simulation, we considered both uniform and non-uniform distributions of random letters. In the case of uniform distribution, the probability of observing in a sequence is all the same at 1/4. We refer to this simulated distribution as the equally independent and identical distribution (e.i.i.d) scheme. In the case of non-uniform distribution, we assign a probability of 1/3 to observing and , and a probability of 1/6 to observing and . We refer to this simulated non-equally independent and identical distribution (n.i.i.d) as the n.i.i.d (gc-rich) scheme because of its enrichment of and bases in the resultant sequences. Reinert and Sun et al. (2009) proposed two procedures to simulate related sequence pairs representing true biological relationships, one for mimicking the real sequence compositions as seen in cis-regulatory modules (CRM) and the other for mimicking that as seen in horizontal gene transfer. As we are studying the HGT in prokaryotes, we simulated the foreground sequence pairs by adapting their HGT procedure (see Fig. 2). Specifically, we randomly selected a set of motif sites in the first background sequence of a pair. We then transferred these motifs to the other background sequence by using them to replace the corresponding sites in the second sequence. These site positions were randomly selected by generating the Bernoulli random numbers (1s for being selected) with the probability that a position gets selected was 0.05. The length of a motif L was set to 5.

Fig. 2

Simulating the foreground sequence pairs using the Horizontal Gene Transfer (HGT) procedure with motif length L = 5.

Simulating the foreground sequence pairs using the Horizontal Gene Transfer (HGT) procedure with motif length L = 5. We computed the and statistics for all such generated background and foreground sequence pairs. To compute the statistical power, we regarded background sequence pairs as true negatives and foreground pairs as true positives. We set the required statistical significance level (Type-I error rate) to ɑ = 5%, which is the fraction of true negatives that were also declared positive by the statistic. The statistical significance threshold t was learned from a statistic's 1-ɑ = 95% percentile value (ranked in ascendance) from 10,000 simulated background sequence pairs. As we applied the threshold t to the collection of (either or ) statistics computed for the foreground pairs, we estimated the fraction of true positives which were not declared positive by the threshold, which is the Type-II error, i.e. 1-beta. Beta is then the statistical power with a range from 0 to 1 and we computed it as follows: The individual simulation element is analogous to the hypothesis testing for each pair by which the H0 states the two sequences is not related (null), while the H1 states otherwise. When the statistic computed is larger than t, the alternative hypothesis H1 is accepted, which is correct if the pair is foreground, or wrong if the pair is background. Conversely, if the statistic is less than t, the null hypothesis H0 is retained, which is correct if the pair is background, or wrong if it is foreground. Our overall simulation process was as follows: first, we set the common motif length L in the range of (4,5,6,7,8) and k-mer size k in the range of (4,5,6,7,8); Next, we calculated the power values of and for each k and L parameter combination with the coverage rate R ranging from 25% to 75%; Then, we iterated the power computation for the genome segment length ranging from 1000 to 10,000 with the increment of 1000; In each iteration, we simulated 10,000 pairs of background and foreground sequences to compute power; Finally, the obtained power values were plotted in curves where the -axis is the genome length and the -axis is the power. Finally, we identified the optimal k for different L values based on genome segment length and the obtained maximum statistical power.

Simulation and statistical software

We developed a software package for carrying out the power analysis, termed SeqPowerK, using the C# language. SeqPowerK can compute the statistical power for generic k-mer statistics with versatile parameter specifications as detailed in its user's manual (https://github.com/liuxuemeiscut/SeqPowerK).

Results and discussion

The effect of subsampling coverage rate R on the power of and

The subsampling coverage rate R is an important parameter because it represents the fraction of genomes available for alignment-free analysis. Under different combinations of k and L, also considering both the e.i.i.d and n.i.i.d (gc-rich) background schemes, we explored the statistical power of and with a varying subsampling coverage rate R from 25%, 50%, to 75% (see Fig. 3). For both and , when the coverage rate R is high, the statistical power is high. For example, at R = 25%, the statistical power is only 65%, which is significantly lower than that of R = 50% and 75%, when the powers are >85%. The statistical power at R = 50% and 75% are reasonably close, suggesting that the statistic efficiency is saturated by a reasonable high coverage rate at ∼50%.

Fig. 3

Statistical power of and when the coverage rate R = 25%, 50%, 75%. (Two statistics and are used in this figure. Full length is the length of whole sequence, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in n.i.i.d (gc-rich) model.).

The optimal k-mer size for achieving the best power of and

The k-mer size choice is typically the first and foremost important parameter influencing alignment-free sequence comparison. Power simulations are useful ways to find the optimal k value given realistic parameter settings. We simulated HGT sequence data with motif length L in (4,5,6,7,8) and fixed E = 800. We compared and identified the optimal k when the statistical power is the highest given these parameters (see Fig. 4). We first observed that if the k is too small, the statistics power is low. When k = 4, the highest statistical power is <70%. When the k is larger than 5, the statistical power is generally better than 80%. And the optimal k depends on coverage rate and the underlying sequence relatedness as represented by L.

Fig. 4

Statistical power of and when k = 4, 5, 6, 7, 8. (Two statistics and are used in this figure. E is the entire genome segment length, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in n.i.i.d (gc-rich) model.) We also observed that the statistical power of and differed slightly on k values. For both e.i.i.d and n.i.i.d (gc-rich) background models, no matter of the values of L, the optimal k for the highest power of is either 5 or 6 (Fig. 4a–d), while the optimal k of is 6 or 7 (Fig. 4e–h). Based on these facts, k = 6 is likely the best choice for general alignment-free sequence analysis, for example, the statistical power at k = 6 is consistently >80% when the coverage rate is >20%, which is at least comparable to and often times far better than what were achievable by other k values for either or . We also compared the classification performance between and . Overall, the power of was higher than that of . As we known, was based on and was based on , which could mean that is better than in this type of application. This is because the subsampling procedure divides the sequence into a series of genome segments. Previous research based on statistics showed that, in general, and have higher power than and that is more suitable for longer sequence comparisons, while is more suitable for shorter sequences. However, in practice, we often do not know the genome segment length of the related subsequences. In this respect, the robustness of a statistic is required, and is generally better than .

The effect of genome segment length E on the power of and

We also studied the statistical power with different segment length E under k = 6 and L = 6. We varied genome segment size E form 400, 600, 800 to 1000. Fig. 5 showed that the statistical power of and positively correlates with E. However, compared with the coverage rate R, the genome segment length E has only a modest effect on the power. The maximal difference in power is < 0.1. Segment length E represents the degree of fragmentations in the subsampling process. Segment length E determines the amount of sequences per fragment that is available for estimating k-mer frequency. Thus the smaller E the lessser accurate such estimates are, which in turn reduces statistical power.

Fig. 5

Statistical power of and when E = 400, 600, 800, 1000. (Two statistics and are used in this figure. E is the entire genome segment length, R is subsampling coverage rate, k is k-mer's length and L is the length of motif. The probability of four bases is P(A) = P(C)=P(G) = P(T) = 1/4 in e.i.i.d model, and P(A) = P(T) = 1/6, P(G) = P(C) = 1/3 in n.i.i.d (gc-rich) model.)

Conclusions

HGT describes how organisms transfer genetic material to other cells rather than offspring. HGT plays a key role in the evolution of species and microbial genome diversity. With the increasing number of NGS data, the prediction of HGT is of great practical significance for better understanding the impact of environment on community structure. So far, the many methods relying upon sequence alignment to identify HGT were burdened by computation challenges. k-mer based alignment-free sequence comparison were effective in comparing multiple sequences and identifying HGTs without incurring significant computational cost. In this paper, we proposed two new aggregative k-mer statistics and by subsampling and finding the upper bounds of underlying and statistics to identify HGT. We conducted an extensive simulation benchmark to evaluate the statistical power of various k-mer distances using these aggregative metrics. Our power analysis showed that, in general, the statistical power of k-mer statistics increases with sequencing coverage. We found that the optimal k-mer size for sequence comparison is 6 for both and , which ensures a statistical power >80% given the assumed Type-I error rate of 5% and the genome coverage rate >0.2x. To summarize, we propose and and metrics and k-mer size 6 as the best aggregative statistics for detecting HGT events, a conclusion may help guide further research in this direction.

37 in total

Review 1. Phylogenetic classification and the universal tree.

Authors: W F Doolittle
Journal: Science Date: 1999-06-25 Impact factor: 47.728

2. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

3. Whole proteome prokaryote phylogeny without sequence alignment: a K-string composition approach.

Authors: Ji Qi; Bin Wang; Bai-Iin Hao
Journal: J Mol Evol Date: 2004-01 Impact factor: 2.395

4. A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

Authors: Stéphane Guindon; Olivier Gascuel
Journal: Syst Biol Date: 2003-10 Impact factor: 15.683

Review 5. Shaping bacterial genomes with integrative and conjugative elements.

Authors: Vincent Burrus; Matthew K Waldor
Journal: Res Microbiol Date: 2004-06 Impact factor: 3.992

6. MUSCLE: multiple sequence alignment with high accuracy and high throughput.

Authors: Robert C Edgar
Journal: Nucleic Acids Res Date: 2004-03-19 Impact factor: 16.971

Review 7. Lateral gene transfer in eukaryotes.

Authors: J O Andersson
Journal: Cell Mol Life Sci Date: 2005-06 Impact factor: 9.261

1. Exploring short k-mer profiles in cells and mobile elements from Archaea highlights the major influence of both the ecological niche and evolutionary history.

Authors: Ariane Bize; Cédric Midoux; Mahendra Mariadassou; Sophie Schbath; Patrick Forterre; Violette Da Cunha
Journal: BMC Genomics Date: 2021-03-16 Impact factor: 3.969

1 in total