| Literature DB >> 31874598 |
Prapaporn Techa-Angkoon1,2, Kevin L Childs3, Yanni Sun4.
Abstract
BACKGROUND: Gene is a key step in genome annotation. Ab initio gene prediction enables gene annotation of new genomes regardless of availability of homologous sequences. There exist a number of ab initio gene prediction tools and they have been widely used for gene annotation for various species. However, existing tools are not optimized for identifying genes with highly variable GC content. In addition, some genes in grass genomes exhibit a sharp 5 '- 3' decreasing GC content gradient, which is not carefully modeled by available gene prediction tools. Thus, there is still room to improve the sensitivity and accuracy for predicting genes with GC gradients.Entities:
Keywords: GC contents; Gene finding; Grass genomes; Hidden Markov model; Plant genome gene prediction
Mesh:
Year: 2019 PMID: 31874598 PMCID: PMC6929509 DOI: 10.1186/s12859-019-3047-3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1A gene LOC_Os03g44820.1 with GC content gradient from Oryza sativa data set. X-axis represents each exon inside the gene. Y-axis represents the GC content
Fig. 2An overview of the training and predicting genes. (a) Training. (b) Prediction
Fig. 3For each internal exon, three states (, E0+,M,E0+,L) are used to model exons of high, medium, and low GC content. This figure only illustrates three internal exon states for one phase on the plus strand (corresponding to one reading frame). The internal exons of other phases, the initial exon, the terminal exon, and the single exon all have three states for high, medium, and low GC content. Genes of various GC contents and gradients can be represented as various paths through the exon states
Fig. 4The state diagram of GPRED-GC. The states beginning with r represents the reverse strand. : a single exon of high GC content. : a single exon of medium GC content. : a single exon of low GC content. Einit H: the initial coding exon of a multi-exon gene with high GC content. Einit M : the initial exon of a multi-exon gene with medium GC content. Einit L: the initial exon of a multi-exon gene with low GC content. DSS: a donor splice site. Ishort: an intron emitting at most d nucleotides. Ifixed: a longer intron with the first d nucleotides. Igeo: a longer intron emitting one nucleotide at a time after the first d nucleotides. ASS: an acceptor splice site with branch point. EH: an internal coding exon of a multi-exon gene with high GC content. EM: the internal exon of a multi-exon gene with medium GC content. EL: the internal exon of a multi-exon gene with low GC content. : the last coding exon of a multi-exon gene with high GC content. : the terminal exon of a multi-exon gene with medium GC content. : the terminal exon of a multi-exon gene with low GC content. IR: intergenic region. Diamonds represent the states that emit fixed length strings. Ovals represent the states including explicit length distribution. The numbers at the arrows show the transition probabilities. The transition probabilities incident to new exon states are derived using equal divisions (strategy 1). The exponents 0, 1, and 2 represent the reading frame phase. For an exon state, this is the position of the last base of the exon in its codon. For the other states, the exponent are the preceding-exon phase. The small circles represent silent states
Fig. 5GC Content of exons in the A. thaliana data set
Performance comparison of gene prediction tools on A. thaliana with the transition probabilities divided into three equal portions
| Program | AUGUSTUS | GPRED-GC | ||||
|---|---|---|---|---|---|---|
| lowT=0.47 | lowT=0.30 | lowT=0.30 | lowT=0.60 | |||
| highT=0.63 | highT=0.60 | highT=0.70 | highT=0.70 | |||
| Base | Sen | 0.968 | 0.962 | 0.963 | 0.962 | 0.963 |
| level | Spe | 0.708 | ||||
| Exon | Sen | 0.870 | 0.848 | 0.848 | 0.845 | 0.848 |
| level | Spe | 0.666 | ||||
| Gene | Sen | 0.554 | 0.548 | 0.548 | 0.548 | |
| level | Spe | 0.352 | ||||
| Time(Sec.) | 40.3 | 52.4 | 52.8 | 54.2 | 53.0 | |
Bold number indicates that sensitivity or specificity of GPRED-GC are higher than those of AUGUSTUS. Time (Sec.) is the running time of AUGUSTUS and GPRED-GC under different sets of thresholds on A. thaliana dataset in seconds. Note: The running time is the total running time of prediction
Performance comparison of gene prediction tools on A. thaliana with the transition probabilities trained by computing maximum likelihood estimation
| Program | AUGUSTUS | GPRED-GC | ||||
|---|---|---|---|---|---|---|
| lowT=0.47 | lowT=0.30 | lowT=0.30 | lowT=0.60 | |||
| highT=0.63 | highT=0.60 | highT=0.70 | highT=0.70 | |||
| Base | Sen | 0.968 | 0.960 | 0.972 | ||
| level | Spe | 0.708 | ||||
| Exon | Sen | 0.870 | 0.851 | |||
| level | Spe | 0.666 | 0.677 | |||
| Gene | Sen | 0.554 | ||||
| level | Spe | 0.352 | 0.346 | 0.351 | 0.351 | 0.351 |
| Time(Sec.) | 40.3 | 51.1 | 57.7 | 56.4 | 57.7 | |
The two tools have comparable performance. Time (Sec.) is the running time of AUGUSTUS and GPRED-GC under different sets of thresholds on A. thaliana dataset in seconds. Note: The running time represents the total running time of prediction
Fig. 6The GC content change across all exons in a predicted multi-exon gene of A. thaliana. This gene SEQ16AC003000G7G8 was predicted by GPRED-GC. X-axis represents the exon index. Y-axis represents the GC content
Fig. 7GC Content of the exons in the first data set of O. sativa
Fig. 8GC Content of the exons in the second O. sativa data set
Performance comparison of gene prediction on the first O. sativa data set with the transition probabilities divided into three equal parts
| Program | AUGUSTUS | GPRED-GC | ||||
|---|---|---|---|---|---|---|
| lowT=0.39 | lowT=0.35 | lowT=0.50 | lowT=0.40 | |||
| highT=0.61 | highT=0.61 | highT=0.60 | highT=0.60 | |||
| Base | Sen | 0.839 | 0.831 | |||
| level | Spe | 0.892 | 0.883 | |||
| Exon | Sen | 0.613 | 0.589 | |||
| level | Spe | 0.694 | 0.692 | |||
| Gene | Sen | 0.260 | ||||
| level | Spe | 0.235 | 0.234 | |||
| Time(Sec.) | 37.6 | 58.1 | 58.2 | 57.0 | 56.0 | |
Bold number indicates that sensitivity or specificity of GPRED-GC are higher than those of AUGUSTUS. Time (Sec.) is the running time of AUGUSTUS and GPRED-GC under different sets of thresholds on the first O. sativa dataset in seconds. Note: The running time is the total running time of prediction
Performance comparison of gene prediction on the first O. sativa data set with the transition probabilities trained using maximum likelihood estimation
| Program | AUGUSTUS | GPRED-GC | ||||
|---|---|---|---|---|---|---|
| lowT=0.39 | lowT=0.35 | lowT=0.50 | lowT=0.40 | |||
| highT=0.61 | highT=0.61 | highT=0.60 | highT=0.60 | |||
| Base | Sen | 0.839 | ||||
| level | Spe | 0.892 | 0.876 | |||
| Exon | Sen | 0.613 | ||||
| level | Spe | 0.694 | 0.670 | |||
| Gene | Sen | 0.260 | 0.253 | 0.253 | ||
| level | Spe | 0.235 | 0.227 | |||
| Time(Sec.) | 37.6 | 57.4 | 57.8 | 56.0 | 57.4 | |
Bold number indicates that sensitivity or specificity of GPRED-GC are higher than those of AUGUSTUS. Time (Sec.) is the running time of AUGUSTUS and GPRED-GC under different sets of thresholds on the first O. sativa dataset in seconds. Note: The running time shows the total running time of prediction
Fig. 9Genes of the first O. sativa data set predicted correctly by GPRED-GC but missed or incorrectly annotated by Augustus. Four genes are listed in the four subplots: (a), (b), (c), and (d). X-axis represents the exon index inside a gene. Y-axis represents GC content
Performance comparison of gene prediction tools on the second O. sativa data set with the transition probabilities divided into three equal parts
| Program | AUGUSTUS | GPRED-GC | ||||
|---|---|---|---|---|---|---|
| lowT=0.31 | lowT=0.49 | lowT=0.30 | lowT=0.60 | |||
| highT=0.52 | highT=0.52 | highT=0.50 | highT=0.70 | |||
| Base | Sen | 0.859 | 0.840 | |||
| level | Spe | 0.619 | 0.607 | 0.597 | 0.590 | |
| Exon | Sen | 0.670 | 0.630 | |||
| level | Spe | 0.552 | 0.546 | 0.520 | ||
| Gene | Sen | 0.355 | ||||
| level | Spe | 0.191 | 0.177 | |||
| Time(Sec.) | 48.2 | 60.2 | 60.8 | 60.8 | 59.0 | |
The transition probabilities were divided into three equal parts. Bold number indicates that sensitivity or specificity of GPRED-GC are higher than those of AUGUSTUS. Time (Sec.) is the running time of AUGUSTUS and GPRED-GC under different sets of thresholds on the second O. sativa dataset in seconds. Note: The total running time of prediction is presented
Performance comparison of gene prediction tools on the second O. sativa data set with the transition probabilities trained using maximum likelihood estimation
| Program | AUGUSTUS | GPRED-GC | ||||
|---|---|---|---|---|---|---|
| lowT=0.31 | lowT=0.49 | lowT=0.30 | lowT=0.60 | |||
| highT=0.52 | highT=0.52 | highT=0.50 | highT=0.70 | |||
| Base | Sen | 0.859 | 0.858 | |||
| level | Spe | 0.619 | 0.607 | 0.601 | 0.586 | |
| Exon | Sen | 0.670 | 0.665 | |||
| level | Spe | 0.552 | 0.544 | 0.547 | ||
| Gene | Sen | 0.355 | 0.350 | |||
| level | Spe | 0.191 | 0.186 | |||
| Time(Sec.) | 48.2 | 64.9 | 62.4 | 63.5 | 60.3 | |
Bold number indicates that sensitivity or specificity of GPRED-GC are higher than those of AUGUSTUS. Time (Sec.) is the running time of AUGUSTUS and GPRED-GC under different sets of thresholds on the second O. sativa dataset in seconds. Note: The running time is the total running time of prediction
Fig. 10Summary of GC content profile of six genes correctly predicted by GPRED-GC. The names of the genes in each subplot are (a) LOC_Os03g44820.1, (b) LOC_Os04g52180.1, (c) LOC_Os04g52710.1, (d) LOC_Os05g30860.1, (e) LOC_Os06g11040.1, (f) LOC_Os10g03830.1, respectively from the second O. sativa data set. These genes cannot be detected or annotated correctly by AUGUSTUS. X-axis represents exon index inside the gene. Y-axis represents the GC content
The comparison of the corresponding parameters in the two HMMs for these two sets of cutoffs
| From | To | lowT=0.30, highT=0.50 | lowT=0.31, highT=0.52 | ||
|---|---|---|---|---|---|
| Transition probabilities | Training count | Transition probabilities | Training count | ||
| 0.051961 | 212 | 0.045098 | 184 | ||
| 0.125490 | 510 | 0.132353 | 540 | ||
| 0.000980 | 4 | 0.000980 | 4 | ||
| 0.034314 | 140 | 0.027450 | 112 | ||
| 0.124510 | 508 | 0.129412 | 528 | ||
| 0.000980 | 4 | 0.002941 | 12 | ||
| 0.119469 | 216 | 0.101770 | 184 | ||
| 0.316372 | 572 | 0.334071 | 604 | ||
| 0.004425 | 8 | 0.004425 | 8 | ||
| 0.066372 | 120 | 0.055310 | 100 | ||
| 0.130531 | 236 | 0.139381 | 252 | ||
| 0.002212 | 4 | 0.004425 | 8 | ||
| 0.107246 | 148 | 0.092754 | 128 | ||
| 0.272464 | 376 | 0.284058 | 392 | ||
| 0 | 0 | 0.002899 | 4 | ||
| 0.074074 | 96 | 0.067901 | 88 | ||
| 0.379630 | 492 | 0.385802 | 500 | ||
| 0.003086 | 4 | 0.003086 | 4 | ||
| 0.099088 | 348 | 0.077449 | 272 | ||
| 0.407745 | 1432 | 0.428246 | 1504 | ||
| 0.001139 | 4 | 0.002278 | 8 | ||
Set1: lowT and highT are 0.30 and 0.50. Set2: lowT and highT are 0.31 and 0.52. The different probabilities before using pseudocount and their corresponding training counts are listed