Literature DB >> 17617639

Intron length distributions and gene prediction.

Scott William Roy1, David Penny.   

Abstract

Accurate gene prediction in eukaryotes is a difficult and subtle problem. Here we point out a useful feature of expected distributions of spliceosomal intron lengths. Since introns are removed from transcripts prior to translation, intron lengths are not expected to respect coding frame, thus the number of genomic introns that are a multiple of three bases ('3n introns') should be similar to the number that are a multiple of three plus one bases (or plus two bases). Skewed predicted intron length distributions thus suggest systematic errors in intron prediction. For instance, a genome-wide excess of 3n introns suggests that many internal exonic sequences have been incorrectly called introns, whereas a deficit of 3n introns suggests that many 3n introns that lack stop codons have been mistaken for exonic sequence. A survey of genomic annotations for 29 diverse eukaryotic species showed that skew in intron length distributions is a common problem. We discuss several examples of skews in genome-wide intron length distributions that indicate systematic problems with gene prediction. We suggest that evaluation of length distributions of predicted introns is a fast and simple method for detecting a variety of possible systematic biases in gene prediction or even problems with genome assemblies, and discuss ways in which these insights could be incorporated into genome annotation protocols.

Entities:  

Mesh:

Year:  2007        PMID: 17617639      PMCID: PMC1950532          DOI: 10.1093/nar/gkm281

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


INTRODUCTION

Ever since the dawn of the genomic age, accurate prediction of protein coding (and other) genes has been a central problem of biology (1,2). New annotations are continually released, even of the best-annotated and most carefully studied genomes. The statistical task of distinguishing true coding genes from non-coding sequences requires evading an array of pitfalls, among them pseudogenes, coding elements of repetitive sequences, short ORFs that are biologically but not statistically significant, and transcription of non-coding sequences. A wide variety of statistical and bioinformatic approaches to gene prediction have been developed, including comparisons with sequences from transcripts and from other species, statistical analysis of ORF length and Bayesian comparison of gene models (3–7). One of the major obstacles to accurate gene prediction in eukaryotes is the presence of spliceosomal introns. Splicesomal introns are genomic sequences that are removed from RNA transcripts prior to translation by a very large RNA–protein complex called the spliceosome. Spliceosomal introns show large variations in intron number per gene, typically exhibit no length or sequence conservation either within or between species, and afford opportunities for alternative splicing, further complicating accurate deduction of a species’ protein arsenal. Here we point out a simple aspect of the expected distribution of spliceosomal intron lengths within a genome, which we hope may be helpful to ongoing and future annotation efforts. Due to their removal from transcripts prior to translation, intron sequences are generally not expected to respect the coding frame and meaning of the surrounding coding sequence. Correspondingly, many predicted introns in the most thoroughly annotated eukaryotic genomes have in frame stop codons, and predicted introns in these genomes are equally as likely to be a multiple of 3 basepairs (bp) (‘3n’), and thus to conserve reading frame, as to contain an ‘extra’ one (3n + 1) or two (3n + 2) bp. For genomic sequences without exhaustive databases of transcript (EST and/or cDNA) sequences, prediction of introns is a difficult task. Here, examination of the distribution of intron lengths can provide insights into the possibility of intron over/underprediction. We report distributions of predicted intron lengths for 29 fully sequenced eukaryotic species. We find frequent deviations in the number of predicted 3n introns relative to 3n + 1 and 3n + 2 introns. Some species show a pronounced deficit of 3n introns, others an excess of 3n introns. We discuss five different species that show highly skewed length distributions among predicted introns, and suggest ways in which current annotations might be improved.

METHODS

We downloaded genome sequences and predicted gene sequences and coordinates as indicated in Table 1. For each Entamoeba histolytica gene with an annotated intron, we performed BLASTN searches of the corresponding genomic region (predicted intron plus 60 flanking upstream and downstream bases) against all E. histolytica reads in the NCBI Trace Archive and compared the assembled sequence against the best hit. As a negative control, we also BLASTed randomly selected uninterrupted predicted coding sequences of length 180 bp against the sequence reads. Only 132/5000 random sequence (2.6%) showed a gap, tenfold less than for the predicted intronic regions. In order to ensure that our observations were not due to errors in the GenBank files, we downloaded gene predictions from the individual websites for each genome showing deviations from equal proportions that is discussed below. In each case these annotations showed (nearly) identical proportions of introns in the three categories as found from the GenBank files. Novel perl scripts were written to perform the analyses described.
Table 1.

Distribution of lengths of predicted introns in complete genomes from 29 eukaryotic species

SpeciesVersionSource
Anopheles gambiaeAgamP3Genbank
Apis melliferaAmel2.0Genbank
Arabidopsis thalianaTAIR, version 5Genbank
Aspergillus fumigatusNC_007194.1-201.1Genbank
Bigelowiella natansDQ158856.1-8.1Genbank
Caenorhabditis elegansWormbase 160Wormbase
Ciona intestinalisCINT1.95Genbank
Cryptococcus neoformansVersion 1Genbank
Cyanidoischyzon merolaeVersion 1Genbank
Dictyostelium discoideumAAFI01000000.1Genbank
Drosophila melanogasterr4.3Flybase
Encephalitozoon cuniculiVersion 1Genbank
Entamoeba histolyticaAAFB01000000.1Genbank
Fugu rubripesFUGU4EnsEMBL
Homo sapiensNCBI35EnsEMBL
Oryza sativaREFSEQGenbank
Ostreococcus tauriOct 9 submissionGenbank
Paramecium tetraurelia*NC_006058.1Genbank
Phytophthora sojaeVersion 1.1JGI
Phytophtora ramorumVersion 1.1JGI
Plasmodium falciparum3/10/2002 VersionPlasmoDb
Plasmodium yoeliiVersion 1PlasmoDb
Saccharomyces cerevisiaeNC_001133-48Genbank
Schizosaccharomyces pombeNC_00341.2-4.2Genbank
Thalassiosira pseudonanaThaps3JGI
Toxoplasma gondiiann3TIGR
Trichomonas vaginalisVersion 1Genbank
Ustilago maydisVersion 1Genbank
Yarrowia lipolyticaCR382127.1-32.1Genbank
Distribution of lengths of predicted introns in complete genomes from 29 eukaryotic species

RESULTS AND DISCUSSION

Distribution of intron lengths across genome annotations for 29 species

We studied current genome sequence annotations for 29 different eukaryotic species (Table 2). For most genomes with large numbers of introns, there are very similar numbers of 3n + 1 and 3n + 2 introns: among genomes with >300 introns, the percentages of 3n + 1 and 3n + 2 introns are within 2.8% of each other in 23/25 genomes. In stark contrast, the number of 3n introns varies much more widely, falling only within 2.8% of the average of 3n + 1 and 3n + 2 for half (13/25) of the genomes. Species are roughly evenly divided between those that show an excess of 3n introns relative to 3n + 1 or 3n + 2 introns, and those that show a deficit of 3n introns. Two species, E. histolytica and the nucleomorph of Bigelowiella natans, show very different patterns—an excess of 3n + 1 introns in B. natans and an excess of 3n + 2 introns in E. histolytica. We next analyzed the sets of predicted introns for several cases that showed pronounced deviations from equal intron numbers in the three classes.
Table 2.

Genome annotations for 29 fully sequenced species used in this study

SpeciesIntrons3n3n + 13n + 2Excess 3n(3n + 1)–(3n + 2)
Anopheles gambiae37 9010.4050.2980.2970.1080.001
Apis mellifera145 4540.3760.3090.3160.063−0.007
Arabidopsis thaliana91 2220.3340.3360.3300.0010.006
Aspergillus fumigatus18 2930.3030.3460.350−0.045−0.004
Aspergillus nidulans24 7720.3190.3420.339−0.0220.003
Bigelowiella natans8610.1100.7040.186−0.3340.518
Caenorhabditis elegans137 7520.3320.3350.333−0.0020.002
Ciona intestinalis196 1390.3440.3280.3270.0160.001
Cryptococcus neoformans35 0320.3210.3390.341−0.019−0.002
Cyanidoischyzon merolae270.2220.3330.444−0.167−0.111
Dictyostelium discoideum17 4680.3350.3320.3330.002−0.001
Drosophila melanogaster19 3900.3640.3170.3190.047−0.002
Encephalitozoon cuniculi150.2670.4670.267−0.1000.200
Entamoeba histolytica31250.2580.2810.461−0.113−0.180
Fugu rubripes171 9120.3140.3400.346−0.029−0.006
Homo sapiens307 0190.3300.3330.337−0.005−0.003
Oryza sativa100 2620.3410.3300.3290.0120.002
Ostreococcus tauri64500.2000.4020.398−0.2000.004
Paramecium tetraurelia10820.1700.4030.427−0.245−0.024
Phanerochaete crysosporium44 8550.3060.3420.352−0.041−0.011
Phytophthora sojae34 5250.3700.3010.3290.054−0.028
Phytophtora ramorum24 8960.3890.3080.3030.0830.005
Plasmodium falciparum74260.3420.3230.3350.013−0.012
Plasmodium yoelii81430.3470.3210.3330.020−0.012
Saccharomyces cerevisiae2660.3500.2860.3650.024−0.079
Schizosaccharomyces pombe47300.3150.3450.340−0.0280.005
Thalassiosira pseudonana15 6360.6120.1940.1940.4180.001
Toxoplasma gondii27 4950.3360.3310.3320.004−0.001
Trichomonas vaginalis580.3620.4480.1900.0430.259
Ustilago maydis49000.2740.3550.371−0.088−0.016
Yarrowia lipolytica8290.3060.3380.356−0.040−0.018
Genome annotations for 29 fully sequenced species used in this study

Excess of 3n introns: Thalassiosira pseudonana

Due to the lack of close relatives with sequenced genomes or very large samples of transcript sequences, genes in the Thalassiosira pseudonana genome were largely predicted by homology searches against sequences from deeply diverged eukaryotic species (8). Predicted T. pseudonana genes have on average 1.4 introns per gene. These predicted introns show a strongly skewed length distribution, with 3n introns accounting for 61.2% of all predicted introns (9573 3n introns, 3037 3n + 1, 3029 3n + 2; an example is shown in Figure 1). Such a skew suggests that many predicted 3n introns are not true introns but instead represent exonic sequences. In keeping with this possibility, most 3n introns (75.2%) lack inframe stop codons, in stark contrast to 3n + 1 (29.1%) and 3n + 2 (28.6%) introns.
Figure 1.

Introns 1–3 and flanking sequences from T. pseudonana predicted gene 10 0621. Upper/lowercase sequence indicates predicted exonic/intronic sequence. Asterisks indicate frameshifts introduced by non-3n introns; intronic inframe stop codons are underlined. Intron 1 is an 86 bp intron (3n + 2) with two inframe stop codons. Intron 2 is an 84 bp intron (3n), which lacks inframe stop codons, and thus does not interrupt the ORF. Intron 3 is a 121 bp intron (3n + 1), which lacks inframe stop codons.

Introns 1–3 and flanking sequences from T. pseudonana predicted gene 10 0621. Upper/lowercase sequence indicates predicted exonic/intronic sequence. Asterisks indicate frameshifts introduced by non-3n introns; intronic inframe stop codons are underlined. Intron 1 is an 86 bp intron (3n + 2) with two inframe stop codons. Intron 2 is an 84 bp intron (3n), which lacks inframe stop codons, and thus does not interrupt the ORF. Intron 3 is a 121 bp intron (3n + 1), which lacks inframe stop codons. From these two observations it is possible to obtain independent estimates of the number of predicted introns that in fact represent coding sequence (i.e. false positive intron predictions). First, based on the assumption of equal numbers of 3n, 3n + 1 and 3n + 2 introns there is a 3n excess of 6540 introns (note that numbers of 3n + 1 and 3n + 2 introns are nearly identical). Second, roughly equal fractions of 3n, 3n + 1 and 3n + 2 introns are expected to lack inframe stop codons (for instance 29.1% of 3n + 1 introns and 28.6% of 3n + 2 introns should lack stop codons). There are 2368 stop-containing 3n introns, thus we expect roughly 972 (=2368 × 28.8%/[1–28.9%]) 3n introns without inframe stop codons, 6233 fewer than predicted. Thus, two independent estimates suggest that 6200–6600 (86–90%) of the 7205 predicted 3n stop codon-lacking introns instead represent unspliced coding sequence.

Deficit of 3n introns: Paramecium tetraurelia and Ostreococcus tauri

A second case suggestive of the reverse problem, of underprediction of 3n introns, is found in the annotation of the largest somatic chromosome of Paramecium tetraurelia (14). In this case, there is a striking deficit of predicted 3n introns (185 total) compared to 3n + 1 (436) and 3n + 2 (462) introns. In this case, this deficit is likely due to the short intron lengths in P. tetraurelia: all predicted introns are less than 36 bp in length. Whereas, long non-coding sequences are likely to contain in-frame stop codons by chance, but short introns may lack stop codons, in which case 3n introns may be mistaken for coding sequence, whereas the presence of a 3n + 1 or 3n + 2 intron may be inferred from the disruption of the coding frame. That many stop codon-lacking 3n introns may have gone unpredicted is underscored by the high frequency of stop codons in the predicted 3n introns (91.3% contain a stop codon), much higher than in 3n + 1 (46.0%) or 3n + 2 (47.7%). If there were 264 3n introns currently incorrectly predicted as coding sequence, the number of 3n introns would be equal to the average of 3n + 1 and 3n + 2, and the fraction of stop codon-containing introns would be similar (37.6% for 3n) across classes. A similar pattern is seen in the predictions for the Ostreococcus tauri genome, where 3n introns (20% of all predicted introns; 1290 total) are only half as frequent as 3n + 1 (2592) or 3n + 2 (2567) introns (9). Again, a much higher fraction of predicted 3n introns contain stop codons (67.5%) than for 3n + 1 (42.1%) or 3n + 2 (38.5%). If there were an additional 1290 unpredicted 3n introns, there would be equal numbers across the classes, and similar fractions of stop-containing introns (33.8% for 3n).

Difficulties associated with genome assemblies: Entamoeba histolytica

The above examples each concern difficulties of predicting introns based on an accurate genome sequence. Alternatively, errors in a genome assembly can lead to overprediction of introns. One example involves the genome of E. histolytica (10). Previous analysis of introns in genes thought to have been laterally transferred from prokaryotes showed that many predicted introns were associated with errors in the assembly in which a single base was missing in the assembly relative to the corresponding individual sequencing reads (11). Insertion of this missing base into the assembly yielded an ORF that continued through the predicted intronic sequence, suggesting that there is no intron present (e.g. Figure 2). Thus, these assembly indels led to frameshifts in coding sequences, which were compensated for by prediction of an intron.
Figure 2.

Bigelowiella natans gene ABA27371.1, introns 1–3 and flanking sequence. Note that introns 1 and 3 are 18 bp, but also note that they contain stops, and therefore disrupt the ORF. As discussed in the text, other 18 bp introns without stops may have excaped prediction.

Bigelowiella natans gene ABA27371.1, introns 1–3 and flanking sequence. Note that introns 1 and 3 are 18 bp, but also note that they contain stops, and therefore disrupt the ORF. As discussed in the text, other 18 bp introns without stops may have excaped prediction. Further analysis of the E. histolytica genome suggests that this may be a common problem. Among predicted E. histolytica introns, there is an excess of 3n + 2 introns (1449) over 3n (809) or 3n + 1 (878) introns. BLAST searches of the predicted intronic and flanking exonic sequences against individual sequencing reads showed that 23.1% (722/3126) of predicted introns were associated with gaps in the intron or within 120 bp of the intron (Figure 2). Of the gaps, 98.3% (831/845) were single-base gaps, and 81.3% (687/831) were missing bases in the assembly relative to the sequencing reads (the predominance of missing bases in the assembly is consistent with the excess of 3n + 2 introns, since the added base yields a 3n sequence; the smaller number of extra assembly bases relative to reads is consistent with the smaller excess of 3n + 1 introns over 3n introns). Correction of these apparent assembly errors extended the ORF through the intronic sequence in 79.8% (538/649) of these cases, suggesting that the predicted intronic sequence instead represents coding sequence. In an additional 48 cases (7.4%), correction of the apparent assembly error yields an ORF spanning from within (or upstream of) the predicted intron to the predicted stop codon (or to the next predicted intron boundary in the case of multi-intron genes). These results suggest that at least some 20% of predicted E. histolytica introns are not in fact introns but instead coding sequence. Thus, upwards of 6% of predicted genes in the E. histolytica genome appear to have an assembly error within their sequence.

The peculiar case of the B. natans nucleomorph

Finally, it is worth noting that there is at least one apparently bona fide case of stiking genome-wide difference between introns of different length classes (12,13). Among predicted introns in the genome of the nucleomorph of the B. natans genome, 70.3% of introns are 3n + 1 (12,13). However, this is due to the extreme regularity of intron length, with 70.2% of predicted introns having length 19 bp, and 99.1% being of length 18–20 bp (an example is given in Figure 3).
Figure 3.

Assembly indels in E. histolytica lead to artifactual intron predictions. Predicted intron (lowercase) and flanking exon (uppercase) sequence is shown. The underlined base is present in sequencing reads but absent in the assembly. Bold amino acid sequencs are unique to the corrected sequence. (A) Gene EAL42479.1. An assembly deletion of a single A lead to prediction of a 35 bp intron (3n + 2) spanning the deletion position to correct the coding frame. (B) Gene EAL42572. An assembly deletion of a single G lead to prediction of a 32 bp intron (3n + 2) downstream of the deletion.

Assembly indels in E. histolytica lead to artifactual intron predictions. Predicted intron (lowercase) and flanking exon (uppercase) sequence is shown. The underlined base is present in sequencing reads but absent in the assembly. Bold amino acid sequencs are unique to the corrected sequence. (A) Gene EAL42479.1. An assembly deletion of a single A lead to prediction of a 35 bp intron (3n + 2) spanning the deletion position to correct the coding frame. (B) Gene EAL42572. An assembly deletion of a single G lead to prediction of a 32 bp intron (3n + 2) downstream of the deletion. How accurate is this length distribution likely to be? Introns in B. natans are predicted based on maximizing ORF length. On its face, then, we would expect the rates of detection of 19 and 20 bp introns to be similar (since both impose a frameshift). Thus, the great excess of 19 bp introns over 20 bp introns is likely to be a true feature of intron lengths in B. natans. On the other hand, detection of 18 bp introns will be far more difficult since these introns will not disrupt the ORF unless they contain an inframe stop codon (which is not likely in a 18 bp sequence). Correspondingly, the fraction of predicted introns containing inframe stop codons is higher for 18 bp introns (89.7%) than for 19 bp introns (48.4%). This suggests that 18 bp introns are underpredicted by perhaps twofold, however underprediction is unlikely to account for the 7.0-fold excess of predicted 18 bp introns over 19 bp introns.

Integrating genome assembly and annotation

The example of E. histolytica raises a larger point of the possible utility of integrating genome assembly with gene annotation. In this case, examination of predicted genes indicated errors in the genome assembly itself, thus integration of the two processes should lead to improvements in both assembly and annotation quality. Such considerations are likely to be all the more important with the increased number of partial and low-coverage genome sequencing projects. Analysis of the apparent coding meaning of preliminary assemblies could identify probable indels in the assembly, and corresponding individual sequencing reads could then be scrutinized in order to correct actual errors. This process lends itself well to automation and we think that such feedback between assembly and annotation could potentially substantially diminish assembly errors in coding regions.

Exploiting intron length distributions in genome annotation pipelines

In the absence of extensive transcript data, intron prediction often proceeds by statistical comparison of alternative gene models: introns are called when an intron-containing structure has a higher probability given model perameters than does an intron-lacking structure (3). In the case of non-3n or stop-containing introns, this is comparatively straightforward, with intron prediction being likely if the resulting coding sequence is significantly longer than the intronless alternative. For non-stop-containing 3n introns, the case is quite different. In this case intron calling involves comparison of two structures with identical 5′ and 3′ flanking coding sequences, thus whether an intron is called will depend only on the extent to which the intervening sequence conforms to expected intron sequences. Here, too much/little sensitivity to intron-like structures will lead to over/under-prediction of introns. One way to address such potential problems would be to introduce an additional training step into gene annotation. Intron sensitivity (i.e. prior probability of intron calling) could be automaticallly fine tuned based on a training set (genomic region) with unknown intron coordinates until predicted intron distributions (both in terms of lengths and in terms of stop frequencies within introns from different categories) became similar to expected values. Such a procedure is likely to be very helpful in species for which very few verified gene structures are known, since in such cases sensitivity in intron calling is otherwise difficult to gauge. Trial and error will no doubt be required to arrive at a fully functioning protocol, however we suspect that such efforts would be well rewarded by increased accuracy of intron and therefore proteome prediction.

Alternative splicing and intron length distribution

One possible actual biological deviation from equal proportions is worthy of discussion. In a subset of alternative splicing events, an intronic sequence from one splicing isoform is retained in another sequence (so-called ‘intron retention’). In such cases, the alternatively spliced intron must be 3n and lack inframe stop codons in order for both isoforms to encode proteins. Thus, frequent alternative splicing could conceivably bias intron lengths towards 3n introns. Four observations suggest that such alternative splicing events are not a major contributor to the skewed distributions reported here. First, the genomes which show the highest frequencies of alternative splicing (for instance Drosophila melanogaster, Caenorhabditis elegans, Homo sapiens and Arabidopsis thaliana) are among the genomes that show the intron length proportions most nearly equal. Second, in all studied genomes, not more than 5% of introns are known to be alternatively spliced in the manner discussed above, and many of these are not 3n introns, thus this effect is unlikely to drive any more than a small bias even in the most highly alternatively spliced genomes. Third, there is no evidence for frequent alternative splicing in any of the genomes that show pronounced intron length skew. Finally, such an effect specifically predicts an excess of 3n introns, and as such is not a contributor to the other observed skews (3n deficit, 3n + 2 excess).

Concluding remarks

Accurate genome annotation is an extremely difficult problem, requiring balancing of false negatives and positives, and accuracy versus time constraints. Even the best annotation sets are subject to improvement. Evaluation of distributions of predicted intron lengths promises rapid and straightforward detection of a variety of possible systematic biases in gene prediction or even, as in the case of E. histolytica, problems with genome assemblies.
  14 in total

1.  A Bayesian framework for combining gene predictions.

Authors:  Vladimir Pavlović; Ashutosh Garg; Simon Kasif
Journal:  Bioinformatics       Date:  2002-01       Impact factor: 6.937

2.  Computational gene prediction using multiple sources of evidence.

Authors:  Jonathan E Allen; Mihaela Pertea; Steven L Salzberg
Journal:  Genome Res       Date:  2004-01       Impact factor: 9.043

Review 3.  Computational prediction of eukaryotic protein-coding genes.

Authors:  Michael Q Zhang
Journal:  Nat Rev Genet       Date:  2002-09       Impact factor: 53.242

4.  GeneWise and Genomewise.

Authors:  Ewan Birney; Michele Clamp; Richard Durbin
Journal:  Genome Res       Date:  2004-05       Impact factor: 9.043

5.  The genome of the diatom Thalassiosira pseudonana: ecology, evolution, and metabolism.

Authors:  E Virginia Armbrust; John A Berges; Chris Bowler; Beverley R Green; Diego Martinez; Nicholas H Putnam; Shiguo Zhou; Andrew E Allen; Kirk E Apt; Michael Bechner; Mark A Brzezinski; Balbir K Chaal; Anthony Chiovitti; Aubrey K Davis; Mark S Demarest; J Chris Detter; Tijana Glavina; David Goodstein; Masood Z Hadi; Uffe Hellsten; Mark Hildebrand; Bethany D Jenkins; Jerzy Jurka; Vladimir V Kapitonov; Nils Kröger; Winnie W Y Lau; Todd W Lane; Frank W Larimer; J Casey Lippmeier; Susan Lucas; Mónica Medina; Anton Montsant; Miroslav Obornik; Micaela Schnitzler Parker; Brian Palenik; Gregory J Pazour; Paul M Richardson; Tatiana A Rynearson; Mak A Saito; David C Schwartz; Kimberlee Thamatrakoln; Klaus Valentin; Assaf Vardi; Frances P Wilkerson; Daniel S Rokhsar
Journal:  Science       Date:  2004-10-01       Impact factor: 47.728

6.  The miniaturized nuclear genome of eukaryotic endosymbiont contains genes that overlap, genes that are cotranscribed, and the smallest known spliceosomal introns.

Authors:  P R Gilson; G I McFadden
Journal:  Proc Natl Acad Sci U S A       Date:  1996-07-23       Impact factor: 11.205

7.  Prediction of complete gene structures in human genomic DNA.

Authors:  C Burge; S Karlin
Journal:  J Mol Biol       Date:  1997-04-25       Impact factor: 5.469

8.  Very little intron gain in Entamoeba histolytica genes laterally transferred from prokaryotes.

Authors:  Scott William Roy; Manuel Irimia; David Penny
Journal:  Mol Biol Evol       Date:  2006-07-17       Impact factor: 16.240

9.  Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information.

Authors:  S M Hebsgaard; P G Korning; N Tolstrup; J Engelbrecht; P Rouzé; S Brunak
Journal:  Nucleic Acids Res       Date:  1996-09-01       Impact factor: 16.971

10.  High coding density on the largest Paramecium tetraurelia somatic chromosome.

Authors:  Marek Zagulski; Jacek K Nowak; Anne Le Mouël; Mariusz Nowacki; Andrzej Migdalski; Robert Gromadka; Benjamin Noël; Isabelle Blanc; Philippe Dessen; Patrick Wincker; Anne-Marie Keller; Jean Cohen; Eric Meyer; Linda Sperling
Journal:  Curr Biol       Date:  2004-08-10       Impact factor: 10.834

View more
  9 in total

1.  In silico characterization and comparative genomic analysis of the Culex quinquefasciatus glutathione S-transferase (GST) supergene family.

Authors:  B P Niranjan Reddy; G B K S Prasad; K Raghavendra
Journal:  Parasitol Res       Date:  2011-04-15       Impact factor: 2.289

2.  Rapid evolution of a few members of nasuta-albomicans complex of Drosophila: study on two candidate genes, Sod1 and Rpd3.

Authors:  Mysore S Ranjini; Nallur B Ramachandra
Journal:  J Mol Evol       Date:  2013-04-26       Impact factor: 2.395

3.  Genome-wide analysis of alternative splicing in Zea mays: landscape and genetic regulation.

Authors:  Shawn R Thatcher; Wengang Zhou; April Leonard; Bing-Bing Wang; Mary Beatty; Gina Zastrow-Hayes; Xiangyu Zhao; Andy Baumgarten; Bailin Li
Journal:  Plant Cell       Date:  2014-09-23       Impact factor: 11.277

4.  Splice Sites Seldom Slide: Intron Evolution in Oomycetes.

Authors:  Steven Sêton Bocco; Miklós Csűrös
Journal:  Genome Biol Evol       Date:  2016-08-25       Impact factor: 3.416

5.  COGNATE: comparative gene annotation characterizer.

Authors:  Jeanne Wilbrandt; Bernhard Misof; Oliver Niehuis
Journal:  BMC Genomics       Date:  2017-07-17       Impact factor: 3.969

6.  Comparative evaluation of intron prediction methods and detection of plant genome annotation using intron length distributions.

Authors:  Long Yang; Hwan-Gue Cho
Journal:  Genomics Inform       Date:  2012-03-31

7.  Quantification of stochastic noise of splicing and polyadenylation in Entamoeba histolytica.

Authors:  Chung-Chau Hon; Christian Weber; Odile Sismeiro; Caroline Proux; Mikael Koutero; Marc Deloger; Sarbashis Das; Mridula Agrahari; Marie-Agnes Dillies; Bernd Jagla; Jean-Yves Coppee; Alok Bhattacharya; Nancy Guillen
Journal:  Nucleic Acids Res       Date:  2012-12-20       Impact factor: 16.971

Review 8.  Spliceosomal introns as tools for genomic and evolutionary analysis.

Authors:  Manuel Irimia; Scott William Roy
Journal:  Nucleic Acids Res       Date:  2008-02-07       Impact factor: 16.971

9.  ASPic-GeneID: a lightweight pipeline for gene prediction and alternative isoforms detection.

Authors:  Tyler Alioto; Ernesto Picardi; Roderic Guigó; Graziano Pesole
Journal:  Biomed Res Int       Date:  2013-11-07       Impact factor: 3.411

  9 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.