Literature DB >> 21224287

A k-mer scheme to predict piRNAs and characterize locust piRNAs.

Yi Zhang1, Xianhui Wang, Le Kang.   

Abstract

MOTIVATION: Identifying piwi-interacting RNAs (piRNAs) of non-model organisms is a difficult and unsolved problem because piRNAs lack conservative secondary structure motifs and sequence homology in different species.
RESULTS: In this article, a k-mer scheme is proposed to identify piRNA sequences, relying on the training sets from non-piRNA and piRNA sequences of five model species sequenced: rat, mouse, human, fruit fly and nematode. Compared with the existing 'static' scheme based on the position-specific base usage, our novel 'dynamic' algorithm performs much better with a precision of over 90% and a sensitivity of over 60%, and the precision is verified by 5-fold cross-validation in these species. To test its validity, we use the algorithm to identify piRNAs of the migratory locust based on 603 607 deep-sequenced small RNA sequences. Totally, 87 536 piRNAs of the locust are predicted, and 4426 of them matched with existing locust transposons. The transcriptional difference between solitary and gregarious locusts was described. We also revisit the position-specific base usage of piRNAs and find the conservation in the end of piRNAs. Therefore, the method we developed can be used to identify piRNAs of non-model organisms without complete genome sequences. AVAILABILITY: The web server for implementing the algorithm and the software code are freely available to the academic community at http://59.79.168.90/piRNA/index.php.

Entities:  

Mesh:

Substances:

Year:  2011        PMID: 21224287      PMCID: PMC3051322          DOI: 10.1093/bioinformatics/btr016

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 INTRODUCTION

Non-coding RNAs (ncRNAs) are functional RNA molecules that are not translated into proteins, including highly abundant and functionally important RNAs such as transfer RNA (tRNA) and ribosomal RNA (rRNA), as well as other RNAs such as snoRNAs, microRNAs, siRNAs and piRNAs and the long ncRNAs. Among them, the ones of typically 20∼ 30 nt in length are called small RNA. Piwi-interacting RNA (piRNA) is the largest class of small RNA molecules expressed in animal cells, especially in germ cells, and 25∼32 nt long in general (Aravin ; Girard ; Grivna ). piRNAs form RNA–protein complexes through interactions with PIWI proteins, and has no clear secondary structure motifs (Kandhavelu ), and its length is slightly longer than miRNA. Compared with miRNAs, piRNA is lack of primary sequence conservation, and the presence of a 5′ uridine is common in both vertebrates and invertebrates. piRNAs in the nematode have a 5′ monophosphate and a 3′ modification that block either 2′ or 3′ oxygen (Ruby ), and are confirmed to exist in fruit fly (Yin and Lin, 2007; Vagin ), zebrafish (Houwing ), mice (Kirino and Mourelatos, 2007; Watanabe ) and rats (Houwing ). PIWI/ARGONAUTE (also known as PAZ-PIWI domain or PPD) protein family is evolutionarily conserved owing to its functional significance in stem cell self-renewal and germline development (Vagin ). piRNA derives from the post-transcriptional amplification ‘Ping-Pong Model’, and it may be involved in germ cell formation, germline stem cell maintenance, spermiogenesis and oogenesis (Brennecke ; Cox ; Thomson and Lin, 2009). Therefore, available piRNA data mainly come from model species with complete genome sequences. A general approach to detecting piRNA is based on the combination of immunoprecipitation and deep sequencing in model and sequenced organisms (Yin and Lin, 2007). However, the lowly expressed or issue-specific piRNAs might be missed using this method. In addition, some of piRNAs are not produced by ‘Ping-Pong Model’ (Das ; Robine ). Thus, computational methods may provide an alternative approach to detect piRNAs, which can summarize general properties from known piRNAs and then train them to predict novel piRNAs. Betel ) first use the position-specific usage of 10 upstream bases and 10 downstream bases of 5′ U to construct a vector with 21 × 4 components, by which they characterized and identified mouse piRNAs with a precision of 61–72%. They also found that mouse piRNAs have some position-special properties, such as G or A at +1 position, A at +4 position and a slight underrepresentation of G at −1 position. However, their method has limitations in predicting piRNAs from the organism without genome information (Lakshmi and Agrawal, 2008). Meanwhile, this method cannot efficiently detect those piRNAs derived from 3′ UTR of mRNA, which are not produced by ‘Ping-Pong Model’ (Das ; Robine ). Furthermore, piRNA sequences are quite divergent among different species (Lakshmi and Agrawal, 2008; Seto ). Most general methods, such as BLAST and MEME, are inappropriate for piRNA prediction. For example, we cannot find any homologous piRNAs between fruit fly and other species with BLAST; and not any conserved motifs are found in piRNAs with MEME. Therefore, more efficient computational methods are urgently demanded. The k-mer scheme is widely used to characterize biosequences (Burge ; Gutierrez, 1993; Karlin and Ladunga,1994), because patterns using k-mers have been found to be species or taxon special (Karlin ; Karlin and Mrazek, 1997; Madera ). It is important to note that, the Solexa small RNA data may also include miRNA, piRNA, as well as some short fractions of snoRNA, snRNA, long ncRNA and un-annotated mRNAs, which have similar lengths to piRNA. In fact, NONCODE version 2.0 (Liu ) has 9257/20 7765 = 4.46% ncRNAs shorter than 25 nt, suggesting that most ncRNA may produce a sequence fraction with similar length to piRNA. Moreover, possible background noise also exists, including the background information introduced by Solexa, the degraded RNA fragments in sample preparation and the noise caused by the random match of data with genome. Studies of Wei ) demonstrated that there are about 30–40% short sequences in 603 607 possible candidates of locust small RNAs are unannotated. On the other words, there are about 200 000 unannoted small RNAs in which there would be a huge number of piRNAs and other short sequences. Obviously, these remaining sequences could cover real piRNAs, the fractions of long ncRNAs, unannotated mRNAs and noise produced by Solexa. Therefore, piRNAs in Solexa small RNA data cannot be identified merely based on their lengths. In order to predict piRNAs, an efficient algorithm is required to distinguish the real piRNA from the sequences within similar range of length. Here, we used all the 1364 1–5 nt strings and an improved Fisher Discriminant algorithm to characterize piRNA sequences in five model species: rat, mouse, human, fruit fly and nematode. The novel algorithm reached a prediction precision of over 90% and a sensitivity of over 60% in these species. We applied this algorithm to the deep-sequenced small RNA data (Wei ) of the migratory locust (Locusta migratoria), which is an important agricultural pest and model species for physiology, neuroscience and behavior. This method successfully identified 87 536 piRNAs of the locust with the same precision and sensitivity as the five model species. Therefore, the method proposed in this study can be used to predict piRNAs for both model and non-model organisms.

2 METHODS

2.1 k-mer string

In bioinformatics, k-mers usually refer to specific k-tuples or k-grams of nucleic acid or amino acid sequences that can be used to identify certain regions within biomolecules like DNA or proteins. Either k-mer strings as such can be used for finding regions of interest or k-mer statistics giving discrete probability distributions of a number of possible k-mer combinations are used. To characterize piRNA sequences, we use all the 1–5 nt strings, including 4 1mer strings: A, G, C and T, 16 2mer strings, 64 3mer strings, 256 4mer strings, 1024 5mer strings, and totally 1364 strings. A bio-sequence can be characterized by a vector consisting of frequencies of the 1364 k-mer (k = 1, 2, 3, 4, 5) strings. Because there are significant differences of string usages between piRNA and non-piRNA sequences, the 1364 D vectors provide a novel approach to distinguish piRNA from non-piRNA.

2.2 Construction of training set

Constructing training set to computationally detect piRNA with Fisher discriminant algorithm (Fisher, 1936), we use two groups of samples: a positive group consisting of true piRNA sequences from five model species and a negative group of non-piRNA sequences. The positive dataset consists of known piRNA sequences of five species: rat, mouse, human, fruit fly and nematode. piRNAs from the first three species are downloaded from NONCODE version 2.0 (Liu ), and piRNAs from the last two species are obtained from NCBI (nematode: gi222138841 ∼ 222138290; fruit fly: gi157362817 ∼ 157361675). In total, we obtain 173 090 positive samples, including 32 046 human piRNAs, 72 747 mouse piRNAs, 66 758 rat piRNAs, 552 Caenorhabditis elegans piRNAs and 987 Drosophila piRNAs. The negative samples are derived from NONCODE version 2.0 (Liu ). NONCODE is a database of a wide variety of ncRNA classes (small and long ncRNAs) from 861 organisms covering all kingdoms of life (eukaryotes, eubacteria, archea and viruses). Data are from three sources: (i) manual extracts from literature, (ii) automatically filtered and manually confirmed GenBank sequences and (iii) experimental data from Chen's laboratory (Giulia ). In detail, it includes ‘miRNA’, ‘piRNA’, ‘mlRNA’, ‘snoRNA’, ‘snRNA’, ‘tmRNA’, ‘SRP RNA’, ‘gRNA’, ‘sbRNA’, ‘snlRNA’, etc. Thus, it is qualified as the source for the ncRNA study. There are 34 675 non-piRNA non-coding RNA sequences, and it should be noted that most of them are much longer than positive sequences. At first, the 34 675 non-piRNA ncRNA were selected as negative samples. Then, to make the number of negative samples close to that of the positive samples, we generated 158 646 random sequences as negative samples from the 34 675 non-piRNA ncRNA sequences by the following method. For each of the 34 675 non-piRNA sequences, we shuffled it 10 000 times to destroy any potential functional structures, then randomly selected start points and generated no more than 5 subsequences with a length of 18–32 nt. Since there are 9257/207 765 = 4.46% ncRNAs shorter than 25 nt in ncRNA database NONCODE version 2.0 (Liu ), we randomly produced 8678/193 321 = 4.49% sequences shorter than 25 nt to make the length distribution of negative samples similar to that of a real database. In detail, the random processes generating 158 646 negative samples cover three steps. Firstly, we divided each sequence into 40 nt-long non-overlap blocks, and chose no more than five blocks as random candidate blocks. Secondly, the length distribution was confined to 18–32 nt, which has little effect on the result because we only use the frequency of strings. Finally, the negative sequences can start at every possible position in a selected block.

2.3 Improved Fisher Algorithm in a 1364 D space

The Fisher discriminant algorithm uses a training set formed by these two groups of samples to obtain a discriminant vector w and threshold y0.The Fisher linear discriminant equation in this case represents a super-plane in the 1364 D space, described by a vector w, which is extremely simple in the two-class case. Let Group 1 (denoted by G1) correspond to piRNA samples, Group 2 (denoted by G2) non-piRNA samples and x = (x, x,…, x) the 1364 D vector defined above of the k-th sample in group g (g = 1, 2), where k = 1, 2,…, n (n1, n2 are the numbers of samples in G1 and G2, respectively). We calculate the average vector m for each group: . Denoting by S the sum of the covariance matrices of two groups, we have S = ∑2 ∑ (x − m)(x − m)τ, g = 1, 2. The Fisher vector w is simply determined by the following equation: w = S−1(m1 − m2), where S−1 is the inverse of the matrix S. Thus, for any 1364 D vector x = (x, x,…, x), k=1, 2,…, n. its projective point is y = wτ x. Notice that w is not unique in the sense that w multiplied by a constant is still acceptable. Without loss of generality, we choose such w satisfying ‖w‖ = 1. Based on the data in the training set, an appropriate threshold y0 is determined to make the piRNA/non-piRNA decision. The threshold y0 is determined by the formula: , where . Once the Fisher vector w and the threshold y0 are obtained, the decision of piRNA/non-piRNA in the test set is simply performed by the criterion of f(x) > 0/f(x)< 0, where f(x) = wτ x − y0. To improve the Fisher algorithm, we introduce the ‘cutoff’, which in theoretical physics means the maximal (or minimal) value of energy, momentum or length, so that the objects with even smaller (or larger) values than these physical quantities are ignored. A popular method in increasing precision is to set higher cutoff values (Candolfi ; Huang ). The is the mean value of the projective points [i.e. f(x) = wτ x] of non−piRNAs. Here, we set Nstd to be the SD of the projective points of non-piRNAs. With the two variables, we may improve the Fisher discriminant algorithm. In detail, we change the discriminant formula into the new one shown below. and once the Fisher vector w is obtained, the decision of piRNA/non-piRNA is performed by the criterion of f(x) > 0 and f(x) < 0, respectively. Obviously, the cutoff value is .

3 RESULTS AND DISCUSSION

3.1 Different string usage of piRNA and non-piRNA

piRNA and non-piRNA sequences have significant differences in string usage. First, for each sequence (piRNA or non-piRNA), we calculate the frequencies of all the 1364 k-mer (k = 1, 2, 3, 4, 5) strings, and construct a 1364 D vector to characterize the sequence. Then, we use rank sum test to determine which string usage is significantly different between piRNAs and non-piRNAs. With a significance level of 10−300, we found that the usage of 1337 strings (Supplementary Material S1) is significantly different between piRNAs and non-piRNAs. Therefore, the k-mer string scheme can spot the difference between piRNAs and non-piRNAs, and the difference can be visualized by comparing the frequencies of each string in piRNAs and non-piRNAs (Fig. 1). To identify the most significant strings whose usage are different between piRNAs and non-piRNAs, we define the string frequency relative difference as the ratio of absolute value of string frequency difference to the sum of string frequency in piRNAs and non-piRNAs. For example, for string ‘TGCTG’, its string frequency relative difference is where fpiRNA(TGCTG) is the frequency of string TGCTG appeared in piRNAs. There are 32 strings with string frequency relative difference larger than 0.7 (Supplementary Material S2), while only the string ‘TGCTG’ with a higher frequency in piRNAs than in non-piRNAs, perhaps because ‘TGCTG’ is the first 5 bases of many piRNAs. The left 31 strings all have low expression in piRNAs, but their biological significance requires further exploration.
Fig. 1.

Average frequencies of 1337 strings in piRNAs and non-piRNAs. The 1337 strings are used differently between piRNAs and non-piRNAs, and the difference is visualized by comparing the average frequencies of the 1337 strings in two groups of samples. Here, the red and black lines represent piRNAs and non-piRNAs, respectively.

Average frequencies of 1337 strings in piRNAs and non-piRNAs. The 1337 strings are used differently between piRNAs and non-piRNAs, and the difference is visualized by comparing the average frequencies of the 1337 strings in two groups of samples. Here, the red and black lines represent piRNAs and non-piRNAs, respectively.

3.2 Position-specific base usage of piRNA

The size distribution of all known piRNAs largely varied ranging from 18 nt to 32 nt, and mainly distributed in 28, 29, 30 and 31 nt which cover 72.32% known piRNAs (Supplementary Material Fig. S1). With the comparison of piRNAs and non-piRNAs, we revisited the position-specific properties in detail. Then, we calculated the frequencies of four bases in each position, and identified conserved position-specific bases at the beginning and the end of piRNAs (Fig. 2A), besides G or A at +1 position, A at +4 position and a slight underrepresentation of G at −1 position, especially in the 30, 31 and the 32 position. Furthermore, we detected position-specific base usages by using rank sum test. In the analysis, we only considered the beginning 21 base positions of piRNA to cover all possible piRNA sequences. Setting significant level to be 10−100, we found that 21 position-specific base usages are significantly different between piRNA and non-piRNA. They are a1, g1, c1, t1, g2, t2, g3, c3, t3, a4, t4, a5, g5, t5, a6, c6, t6, t7, a10, c10 and c12. The difference can be visualized by comparing the frequencies of these position-specific bases in piRNAs and non-piRNAs (Fig. 2B).
Fig. 2.

(A) The frequencies of 32 × 4 position-specific bases in piRNA. Conservative base usages are found in the first 10 and the last three positions of piRNAs. (B) With a significance level of 10−100, the usage of 21 position-specific bases is different between piRNAs and non-piRNAs. The first 10 positions, except for the 8th and 9th positions, all have conserved base usage.

(A) The frequencies of 32 × 4 position-specific bases in piRNA. Conservative base usages are found in the first 10 and the last three positions of piRNAs. (B) With a significance level of 10−100, the usage of 21 position-specific bases is different between piRNAs and non-piRNAs. The first 10 positions, except for the 8th and 9th positions, all have conserved base usage.

3.3 Cross-validation tests

Prediction precision and sensitivity are widely used to evaluate the performance of an algorithm. Sensitivity is the ratio of number of true positive samples to that of actual positive samples, and the precision is the ratio of number of true positive samples to those of predicted positive samples. Their definitions are listed in Table 1. We performed five cross-validation tests in five species: rat, mouse, human, fruit fly and nematode. In order to evaluate the precision and sensitivity of current algorithm in predicting piRNA for a new species, we used the piRNAs of four species as training set and the piRNAs of another species as test set. Each time we use 50 000 pairs of piRNA and non-piRNA sequences derived from four species as drill set to predict piRNAs of another species. An improved Fisher formula: can promote the prediction precision. As t augments from 0 to 3.4, the precisions of five cross-validation tests are significantly increased. When t = 2, the precisions for all species are above 90% (Fig. 3A). In this study, we used the as piRNA cutoff value, to ensure the piRNA prediction precision over 90%. To compare current algorithm with that proposed by Betel ), 36 373 mouse piRNAs were taken as training positive set to predict the remaining 36 374 piRNAs, when t = 0, precision is 68.41%, and sensitivity is 99.31%; setting t = 2, precision is 95.53%, and sensitivity is 72.47%. However, the prediction precision of Betel ) is only 61%, suggesting that our algorithm may still be useful for the species with full genome information.
Table 1.

Definitions of precision and sensitivity of prediction

Predicted positivesPredicted negatives
Actual positivesTPFN
Actual negativesFPTN
sn
sp

TP, True positives; FP, False positives; TN, Ture negatives; sn, Sensitivity; sp, Precision.

Fig. 3.

The relationship of t to precision and sensitivity of 5-fold cross-validation tests. Here, ‘sn’ and ‘sp’ denotes the prediction sensitivity and precision, respectively. Nematode: C.elegans; fruit fly: D.melanogaster; rat: R.norvegicus; human: H.sapiens; mouse: M.musculus. (A) The dynamic algorithm based on string usage has an increasing precision and a decreasing sensitivity with t increasing. When t = 2, all precisions are above 90% and most sensitivities (except for fruit fly) are 60–70%. (B) The static algorithm based on the position-specific base usage has an increasing precision and a decreasing sensitivity with t increasing. When t = 2.5, most precisions reach 90%, but sensitivities are only about 10%.

The relationship of t to precision and sensitivity of 5-fold cross-validation tests. Here, ‘sn’ and ‘sp’ denotes the prediction sensitivity and precision, respectively. Nematode: C.elegans; fruit fly: D.melanogaster; rat: R.norvegicus; human: H.sapiens; mouse: M.musculus. (A) The dynamic algorithm based on string usage has an increasing precision and a decreasing sensitivity with t increasing. When t = 2, all precisions are above 90% and most sensitivities (except for fruit fly) are 60–70%. (B) The static algorithm based on the position-specific base usage has an increasing precision and a decreasing sensitivity with t increasing. When t = 2.5, most precisions reach 90%, but sensitivities are only about 10%. Definitions of precision and sensitivity of prediction TP, True positives; FP, False positives; TN, Ture negatives; sn, Sensitivity; sp, Precision.

3.4 The method validity tests and locust piRNA prediction

Wei ) reported the small RNA transcriptome of the migratory locust (Locusta migratoria) from gregarious and solitary phase libraries containing 603 607 sequences and a subset of small RNA in a peak at 25–29 nt. These data provide a valuable source to test the validity of new method and to identify piRNAs of the migratory locust. With the improved Fisher Algorithm, using 120 000 piRNAs derived from the five model species mentioned above and 120 000 non-piRNAs as drill set, we identified 87 536 locust piRNAs with length larger than 24 nt (Supplementary Material), including 12 386 gregarious-specific piRNAs, 69 151 solitary-specific piRNAs and 5999 piRNAs for both two phases. The analysis of prediction sensitivity showed that the sensitivity decreases as t increases (Fig. 3A). Especially, when t = 2, the sensitivities of most species (except for fruit fly) are 60–70%, indicating that the 87 536 predicted piRNAs are only a fraction of all locust piRNAs. Therefore, we estimated the total number of locust piRNAs is about 130 000, which is less than that of Drosophila's piRNAs (Lakshmi and Agrawal, 2008). After analyzing the usage of the 1337 strings in the 87 536 predicted locust piRNAs, we found that the usage of the strings in locust piRNAs are consistent with that of the five model organisms (Fig. 4). The principle of the Fisher discriminant algorithm improved is to detect the different orientations of string usage between positive and negative samples, and the sequences with amplified positive orientations will be predicted as piRNAs. The method using k-mer frequency ‘dynamic’ we proposed is different from the method using position-specific base frequency ‘static’ method (Betel ). When comparing the advantage of the two methods through the relation of precision/sensitivity versus t values, the ‘dynamic’ method outperforms the ‘static’ method constructing an 84 D vector (21 positions ×4 bases, to include all piRNA sequences) in identifying piRNAs (Fig. 3B). The ‘dynamic’ method can identify 1337 strings in piRNAs and non-piRNAs with a significant level of 10−300, while ‘static’ method can only identify 21 position-specific base usages with a significance level of 10−100.
Fig. 4.

Average frequencies of 1337 strings in piRNAs of the five model organisms and 87 536 locust piRNAs. The two groups of string usages are similar because the two curves are very close. Comparing with the significant difference between piRNAs and non-piRNAs as shown in the Figure 1, the algorithm firstly detects the different orientations of string usage between positive and negative samples, and then determines the sequences with amplified positive orientations as piRNAs.

Average frequencies of 1337 strings in piRNAs of the five model organisms and 87 536 locust piRNAs. The two groups of string usages are similar because the two curves are very close. Comparing with the significant difference between piRNAs and non-piRNAs as shown in the Figure 1, the algorithm firstly detects the different orientations of string usage between positive and negative samples, and then determines the sequences with amplified positive orientations as piRNAs.

3.5 The difference of piRNAs between the solitary and gregarious locusts

We found that 15 strings are differently used between two phase locusts, and the expression of piRNAs in solitary locust is much higher than in gregarious ones. Differences between solitary and gregarious locusts are contributed to gene expression and regulation level modulated by piRNA (Wei ), because they have the same genome sequence. In the 87 536 locust predicted piRNAs, there are 12 386 gregarious-specific piRNAs and 69 151 solitary-specific piRNAs. Fifteen strings in gregarious and solitary-specific piRNAs display significantly different utilization rate with a significance level of 10−100 (Table 2).
Table 2.

The 15 strings that are differently used in two phases with a significance level of 10−100

k-merBase strings
1merC
2merCC
3merACC,GCC,CCA,CCC,CCT,CTC
4merCACC,CCTT
5merCTGCA,CTCTG,TCCGA,TTGCT,TTGTA
The 15 strings that are differently used in two phases with a significance level of 10−100 This difference of string utilization rate can be visualized by comparing the average frequencies of the 15 strings in solitary and gregarious locusts (Fig. 5A). The most significant difference is the high content of C in the gregarious locust piRNAs. Based on the 87 536 predicted piRNAs in the locust, the distribution patterns (Fig. 5B) of piRNA number versus the length and transcriptional profiling of solitary and gregarious locusts are consistent with the results reported by Wei ). This result further confirms the robustness of our method in detecting piRNAs. There are 5999 piRNAs shared by two phase locusts, and the piRNAs in solitary locusts have more reads than in gregarious locusts. Of total, 3912 of piRNAs have more reads in solitary locusts, 1435 piRNAs have equal reads in two phases and only 652 piRNAs have more reads in gregarious than in solitary locusts (Supplementary Material Fig. S2). These highly expressed piRNAs may play an important role in maintaining strong propagation of the solitary locusts. We calculated ratios of piRNA reads in solitary to gregarious locusts, and found that the ratios of 84 piRNAs reads are above 30 (Supplementary Material). The 84 piRNAs are ideal candidates for further piRNA interference in investigating piRNA modulation mechanism of phase transition in locusts.
Fig. 5.

The difference between solitary and gregarious locust piRNAs. (A) 15 strings are differently used between two phases with a significance level of 10−100, and this difference is visualized by comparing the frequencies of the 15 strings in two phases. The most significant difference is the high content of C in gregarious locust piRNAs. (B) The distribution of piRNA reads versus length in solitary and gregarious locust. The solitary locusts always have more reads in each length than gregarious ones.

The difference between solitary and gregarious locust piRNAs. (A) 15 strings are differently used between two phases with a significance level of 10−100, and this difference is visualized by comparing the frequencies of the 15 strings in two phases. The most significant difference is the high content of C in gregarious locust piRNAs. (B) The distribution of piRNA reads versus length in solitary and gregarious locust. The solitary locusts always have more reads in each length than gregarious ones.

3.6 Match the 87 536 predicted piRNAs with transposons

There are 4426 of 87 536 locust piRNAs matched with locust transposons from the locust transcriptome data (Kang et al., unpublished data). Not all transposons are transcribed in the locust transcriptome data, so we only get 6635 locust transposons. When the locust piRNAs are compared with the transposons, 4426 matches are found and over half transposons are hit (Supplementary Material). As expected, most of them have the largest values of F(x) = wτ x among all 603 607 candidate sequences (Fig. 6). In fact, the figure presented the framework of this article, which showed the distributions of projective points for 120 000 pairs of drill sequences, 603 607 deep-sequenced candidate small RNA sequences, 87 536 predicted locust piRNAs, and 4426 locust piRNAs matched with locust transposons.
Fig. 6.

The distribution of projective points representing small RNAs A small RNA sequence is first characterized by a 1364 D vector, and then further mapped to a projective point by Fisher formula F(x) = wτ x. This figure shows the distributions of projective points of 120 000 pairs of drill sequences (piRNAs and non-piRNAs), 603 607 deep-sequenced candidate small RNA sequences, 87 536 predicted locust piRNAs and 4426 locust piRNAs matched with the locust transposons. From the top down, the framework of this article is presented.

The distribution of projective points representing small RNAs A small RNA sequence is first characterized by a 1364 D vector, and then further mapped to a projective point by Fisher formula F(x) = wτ x. This figure shows the distributions of projective points of 120 000 pairs of drill sequences (piRNAs and non-piRNAs), 603 607 deep-sequenced candidate small RNA sequences, 87 536 predicted locust piRNAs and 4426 locust piRNAs matched with the locust transposons. From the top down, the framework of this article is presented.

4 CONCLUSION

In this article, we implemented a k-mer algorithm to predict piRNAs. Compared with previous approaches, the new method does not require a reference genome and gives a much better performance on piRNA prediction. We also improved the Fisher algorithm by setting different cutoffs and elevating the precision rate. The basic work is to obtain the Fisher vector w, the mean value and the SD Nstd of the negative samples. Then, a sequence represented by a 1364 D vector x can be regarded as a piRNA if its wτ x is larger than . Using this new scheme, we obtained 87 536 putative piRNAs from the locust, which would be very helpful in studying the phase transition mechanism of insects, especially hemimetamorphosis insects. Moreover, the 84 locust piRNAs, which have the largest ratio of solitary to gregarious reads, and the 4426 locust piRNAs matched with existing transposons may provide excellent candidates for studying phase transition via locust transcriptome and RNAi experiments. On the other hand, the results provided important cues understanding the molecular mechanism of fecundity difference between solitary and gregarious locusts. Most notably different from other methods in literatures, the novel prediction approach is based on the general property of string usage in piRNAs, which is extracted from all available piRNAs. We believe that this method can be widely used to predict piRNAs from both model and non-model organisms.
  28 in total

1.  A novel class of small RNAs bind to MILI protein in mouse testes.

Authors:  Alexei Aravin; Dimos Gaidatzis; Sébastien Pfeffer; Mariana Lagos-Quintana; Pablo Landgraf; Nicola Iovino; Patricia Morris; Michael J Brownstein; Satomi Kuramochi-Miyagawa; Toru Nakano; Minchen Chien; James J Russo; Jingyue Ju; Robert Sheridan; Chris Sander; Mihaela Zavolan; Thomas Tuschl
Journal:  Nature       Date:  2006-06-04       Impact factor: 49.962

2.  A germline-specific class of small RNAs binds mammalian Piwi proteins.

Authors:  Angélique Girard; Ravi Sachidanandam; Gregory J Hannon; Michelle A Carmell
Journal:  Nature       Date:  2006-06-04       Impact factor: 49.962

3.  A distinct small RNA pathway silences selfish genetic elements in the germline.

Authors:  Vasily V Vagin; Alla Sigova; Chengjian Li; Hervé Seitz; Vladimir Gvozdev; Phillip D Zamore
Journal:  Science       Date:  2006-06-29       Impact factor: 47.728

4.  A novel class of small RNAs in mouse spermatogenic cells.

Authors:  Shane T Grivna; Ergin Beyret; Zhong Wang; Haifan Lin
Journal:  Genes Dev       Date:  2006-06-09       Impact factor: 11.361

5.  Mouse Piwi-interacting RNAs are 2'-O-methylated at their 3' termini.

Authors:  Yohei Kirino; Zissimos Mourelatos
Journal:  Nat Struct Mol Biol       Date:  2007-03-25       Impact factor: 15.369

6.  Large-scale sequencing reveals 21U-RNAs and additional microRNAs and endogenous siRNAs in C. elegans.

Authors:  J Graham Ruby; Calvin Jan; Christopher Player; Michael J Axtell; William Lee; Chad Nusbaum; Hui Ge; David P Bartel
Journal:  Cell       Date:  2006-12-15       Impact factor: 41.582

7.  Higher cut-off index value of immunoglobulin M antibody to hepatitis B core antigen in Taiwanese patients with hepatitis B.

Authors:  Yi-Wen Huang; Chih-Lin Lin; Pei-Jer Chen; Ming-Yang Lai; Jia-Horng Kao; Ding-Shinn Chen
Journal:  J Gastroenterol Hepatol       Date:  2006-05       Impact factor: 4.029

8.  A role for Piwi and piRNAs in germ cell maintenance and transposon silencing in Zebrafish.

Authors:  Saskia Houwing; Leonie M Kamminga; Eugene Berezikov; Daniela Cronembold; Angélique Girard; Hans van den Elst; Dmitri V Filippov; Heiko Blaser; Erez Raz; Cecilia B Moens; Ronald H A Plasterk; Gregory J Hannon; Bruce W Draper; René F Ketting
Journal:  Cell       Date:  2007-04-06       Impact factor: 41.582

9.  Discrete small RNA-generating loci as master regulators of transposon activity in Drosophila.

Authors:  Julius Brennecke; Alexei A Aravin; Alexander Stark; Monica Dus; Manolis Kellis; Ravi Sachidanandam; Gregory J Hannon
Journal:  Cell       Date:  2007-03-08       Impact factor: 41.582

10.  NONCODE: an integrated knowledge database of non-coding RNAs.

Authors:  Changning Liu; Baoyan Bai; Geir Skogerbø; Lun Cai; Wei Deng; Yong Zhang; Dongbo Bu; Yi Zhao; Runsheng Chen
Journal:  Nucleic Acids Res       Date:  2005-01-01       Impact factor: 16.971

View more
  50 in total

1.  RNAdetector: a free user-friendly stand-alone and cloud-based system for RNA-Seq data analysis.

Authors:  Alessandro La Ferlita; Salvatore Alaimo; Sebastiano Di Bella; Emanuele Martorana; Georgios I Laliotis; Francesco Bertoni; Luciano Cascione; Philip N Tsichlis; Alfredo Ferro; Roberta Bosotti; Alfredo Pulvirenti
Journal:  BMC Bioinformatics       Date:  2021-06-03       Impact factor: 3.169

2.  mirTools 2.0 for non-coding RNA discovery, profiling, and functional annotation based on high-throughput sequencing.

Authors:  Jinyu Wu; Qi Liu; Xin Wang; Jiayong Zheng; Tao Wang; Mingcong You; Zhong Sheng Sun; Qinghua Shi
Journal:  RNA Biol       Date:  2013-05-29       Impact factor: 4.652

3.  repRNA: a web server for generating various feature vectors of RNA sequences.

Authors:  Bin Liu; Fule Liu; Longyun Fang; Xiaolong Wang; Kuo-Chen Chou
Journal:  Mol Genet Genomics       Date:  2015-06-18       Impact factor: 3.291

4.  XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites.

Authors:  Kewei Liu; Wei Chen; Hao Lin
Journal:  Mol Genet Genomics       Date:  2019-08-07       Impact factor: 3.291

Review 5.  Computational Methods and Online Resources for Identification of piRNA-Related Molecules.

Authors:  Yajun Liu; Aimin Li; Guo Xie; Guangming Liu; Xinhong Hei
Journal:  Interdiscip Sci       Date:  2021-04-22       Impact factor: 2.233

Review 6.  Applications and perspectives of nanomaterials in novel vaccine development.

Authors:  Yingbin Shen; Tianyao Hao; Shiyi Ou; Churan Hu; Long Chen
Journal:  Medchemcomm       Date:  2017-10-17       Impact factor: 3.597

7.  Evolutionary mechanism and biological functions of 8-mers containing CG dinucleotide in yeast.

Authors:  Yan Zheng; Hong Li; Yue Wang; Hu Meng; Qiang Zhang; Xiaoqing Zhao
Journal:  Chromosome Res       Date:  2017-02-09       Impact factor: 5.239

8.  2lpiRNApred: a two-layered integrated algorithm for identifying piRNAs and their functions based on LFE-GM feature selection.

Authors:  Yun Zuo; Quan Zou; Jianyuan Lin; Min Jiang; Xiangrong Liu
Journal:  RNA Biol       Date:  2020-03-05       Impact factor: 4.652

9.  Promoter hypermethylation of PIWI/piRNA pathway genes associated with diminished pachytene piRNA production in bovine hybrid male sterility.

Authors:  Gong-Wei Zhang; Ling Wang; Huiyou Chen; Jiuqiang Guan; Yuhui Wu; Jianjun Zhao; Zonggang Luo; Wenming Huang; Fuyuan Zuo
Journal:  Epigenetics       Date:  2020-03-06       Impact factor: 4.528

10.  Multiclass relevance units machine: benchmark evaluation and application to small ncRNA discovery.

Authors:  Mark Menor; Kyungim Baek; Guylaine Poisson
Journal:  BMC Genomics       Date:  2013-02-15       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.