Literature DB >> 30943219

Prediction of Sphingosine protein-coding regions with a self adaptive spectral rotation method.

Zhongwei Li1, Yanan Guan1, Xiang Yuan1, Pan Zheng2, Hu Zhu3.   

Abstract

Identifying protein coding regions in DNA sequences by computational methods is an active research topic. Welan gum produced by Sphingomonas sp. WG has great application potential in oil recovery and concrete construction industry. Predicting the coding regions in the Sphingomonas sp. WG genome and addressing the mechanism underlying the explanation for the synthesis of Welan gum metabolism is an important issue at present. In this study, we apply a self adaptive spectral rotation (SASR, for short) method, which is based on the investigation of the Triplet Periodicity property, to predict the coding regions of the whole-genome data of Sphingomonas sp. WG without any previous training process, and 1115 suspected gene fragments are obtained. Suspected gene fragments are subjected to a similarity search against the non-redundant protein sequences (nr) database of NCBI with blastx, and 762 suspected gene fragments have been labeled as genes in the nr database.

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 30943219      PMCID: PMC6447165          DOI: 10.1371/journal.pone.0214442

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Genetic information is a set of general instructions that directs the translation from DNA to proteins. The vast majority of life on the earth stores genetic information in DNA sequences (some viruses store genetic information in RNA sequences). The information carried by DNA is expressed as proteins to construct cell components and perform genetic instructions for life [1]. Gene is a nucleotide sequence that can encode a substance with a certain biological function, which is the main carrier of the genetic inheritance of biological traits carrying protein information. The coding sequences of eukaryotic genes are not continuously arranged on the DNA molecule but are separated by non-coding introns, and the synthesis of protein is guided by the coding exons. Therefore, after a given genomic sequence, it is one of the central issues in bioinformatics to correctly identify the range of protein coding region in the DNA sequence and the precise position in the genomic sequence [2, 3]. Training the parameters of the biological signal model with a training set of known gene structures is an effective method for gene prediction. In general, computational methods for gene prediction can be generally categorized into three classes: (1)similarity-based prediction methods, (2) ab into techniques based on statistical models, and (3) machine learning-based methods [4-8]. For (1) and (2), various algorithms, including the Dynamic Programming (DP) or the Hidden Markov Model (HMM), are utilized to process the information gathered from these methods and subsequently predict the potential coding region of the genomic sequence [9]. Merging biological knowledge with computational techniques, machine learning aims to build a predictive model by learning the difference between coded and non-coded regions and use the learned model to predict the coding regions in the DNA sequence. The Hidden Markov Model (HMM) is the foundation of many current gene recognition algorithms [10-14]. The Hidden Markov Model considers the DNA sequence as a random process and automatically finds its internal hidden rules based on the difference in the frequency of nucleotide selection between the encoded and non-encoded DNA sequences. There are many gene identification softwares developed based on the HMM model, for example, Augustus, GeneMark.hmm [15], GENSCAN [16-20], GENIE [21], TWINSCAN [22], DOUBLESCAN [23] and Glimmer [24-28]. Among them, Augustus which is one of the most accurate gene prediction programs, adopted the Viterbi algorithm to define a probability distribution in the DNA sequence and gene structure [29-31]. After training the model with the appropriate training set, they predict the coding area with extremely high accuracy. However, training models that depend on known sequences greatly limit the adaptability of these methods, especially for new sequences from unknown organisms with no or small training sets. Therefore, in the absence of available additional information (training set), it is essential to develop some new methods to predict the locations of coding regions without any training process. Deoxyribonucleotide (DNA) is a chain structure arranged in a certain order by four bases of A, G, T, and C. Protein is composed of polypeptide chains that are formed by 20 different amino acids. Encoding an amino acid in a protein, the genetic code requires reading 3 bases of the gene. Because of this coding relationship, the triplet of bases is called a codon. It is noted that, in most species, synonymous codons are used with different frequencies (known as codon bias) and the order with which codons are used for one protein is far from random, which raises a universal property in coding regions [32-34], called the “Triplet Periodicity (TP)”. The TP property of genes is considered to be a simple and universal distinction between coding and non-encoding regions, and studying the application of it may be an effective method to predict gene coding regions and solve other genetic problems [35-39]. In 1982, Fickett first proposed the TP property [40]. After his work, various theoretical tools were applied by researchers to investigate the TP property, such as the hidden Markov chains [41-44], the Fourier transform [17, 18], Neural Networks [45, 46], the information theory [32, 35], the time series [47-50]. Based on the Fourier transform, a method called spectral content measurement (SCM) was developed to study the TP property by Tiwari et al [18]. Since researchers have expanded and improved Tiwari’s original method in many ways, a series of methods have been developed from the original SCM [51-54]. By optimizing four coefficients in the sequence mapping and the short time Fourier transform, Anastassiou [55] proposed the optimized spectral content measure (OSCM) to compute TP property of genes. Based on the information derived from the magnitude of the discrete Fourier transform, Kotlar and Lavner [54] proposed a spectral rotation measurement (SRM) method that uses the information of the phase component such as the phase angles’ expected values and variances, to maximize the differentiation between protein coding and non-coding regions. Marhon and Sajid [56] proposed a spectrum-based technique that uses a dynamic representation scheme to map DNA sequences into a numerical form. And a post-processing method was proposed to detect the period-3 peaks instead of an empirical threshold value. However, most high-performance methods based on TP require known genetic data from the target organism or homologous sequence data for training. Relying on training, such as the HMM-based methods, restricts the application of methods on unknown organisms, which can not offer known genetic data and homologous biological data. In addition, most of SCM related methods employ the moving slide window to investigate the TP properties for local sections of the gene sequence, and then find potential coding regions in the sequence [40, 42]. The sensitivity of these methods is highly dependent on the size of the sliding window. So, how to choose the appropriate size of the moving slide window is also one of the key issues in determining accuracy. The research object of our work is the whole-genome sequences data of Sphingomonas sp. WG with independent intellectual property rights. Due to the lack of known genetic data and homologous biological data, the prediction methods depending on the training process cannot be employed for the whole-genome sequences data of Sphingomonas sp. WG. Chen et al. proposed a method called Adaptive Spectral Rotation (SASR) [57], which is based on the TP properties of the coding region to visualize the DNA sequence without any training process and extra information, realizing the prediction of the gene coding region. When there is insufficient training set or no extra information available, the SASR method is helpful for identifying protein coding regions on unacquainted DNA sequences. Therefore, in our work, we used the SASR method to visualize the coding region of Sphingomonas sp. WG’s whole-genome data without any training process, and the code of the SASR method is obtained from Chen’s paper. Suspected gene fragments are obtained by manually distinguishing the position of the coding region. With comparing suspected gene fragments with the known gene in the NCBI database, we can make judgments about which the suspected gene fragments have been labeled as genes in the NCBI database. Besides, there is a high probability that real genes newly discovered in the whole-genome sequences data of Sphingomonas sp. WG will be found in the unlabeled suspected gene fragments.

Materials and methods

The Sphingomonas sp. WG’s whole genome data

Welan gum is an extracellular polysaccharide produced by aerobic fermentation of Sphingomonas sp [58]. With interesting rheological properties, Welan gum has been widely used as a stabilizing, suspending, emulsifying, and thickening agent in several areas such as coating materials, food, medicine, concrete additives, and enhanced oil recovery [59, 60]. Welan gum has become the hot spot of science research. In the petroleum fields, Welan gum is a new type of biological oil-displacing agent, which is of great value in tertiary oil recovery in the oil fields [61, 62]. With the deepening of the research on biochemical properties of Welan gum, its industrial value has been continuously developed. In the field of oil and natural gas extraction, Welan gum has shown great market value as an excellent tertiary oil-displacement agent. In order to increase the yield of Welan gum and create more commercial profits, it is necessary to analyze the whole-genome sequences data of Sphingomonas sp. WG, a producer of Welan gum, and explore the biological mechanism of Welan gum synthetic route. The Bioengineering and Technology Center, Chinese University of Petroleum (East China), screened a strain of Sphingomonas sp. WG, a high-yielding strain of Welan gum, from the sea mud of Jiaozhou Bay, Qingdao. The whole-genome sequences data of Sphingomonas sp. WG was obtained by whole genome sequencing. The genomic data contains 31 scaffold sequences, 4,042,223 base pairs (bps), and the GC content of 65.88%.

SASR

The TP profile of the base sequence is represented by a triple periodic matrix (TPM), which presented by Frenkel and Korotkov [32, 33]. The TPM is a 3 × 4-sized matrix, each row i (i = 1, 2, 3, 4) stands for a nucleotide base # (# = A, T, C or G), each column stands for a position j (j = 1, 2, 3) in the period and the entry m is the count by which the base i appears at the position j. In the SASR, for a certain base sequence X = {x | t = 1, 2, 3, …, N}, the posterior subsequence of base sequence X at position t0 is expressed as P(t0) = {x | t0 < t ≤ N}. The TP sequence, converted from the base sequence X, is represented by a sequence of TP vectors S(X) = {s | t = 1, 2, 3, …, N}, and TP verctor . That is, for each position t, the TP vector for each base # of the posterior subsequence P(t) of the current position is calculated, ie, M(P(t)), M(P(t)), M(P(t)) and M(P(t)), and s is selected from them, according to the base at the position t. The TPM of the posterior subsequence at each position t is calculated by recursively computing M #(P(t)) from M #(P(t + 1)), with the initial value and the recurrence formula: Remarkably, P(t + 1) is the sequence of the posterior subsequence of P(t). The operation “V >> n” means that the right circular shift operation is performed n times on the triplet row vector V. The TP walk is a movement trajectory in the complex plane generated from the TP sequence. The moving trajectory is represented by the sequence W = {W | t = 0, 1, 2, …, N}, with the initial value W0 = {0, 0, 0}, and for each step t>0: The function L(x) maps the vector s = {Z1, Z2, Z3} into a complex number by: The process of converting a DNA sequence into a TP vector and then generating a TP walk is called a SASR process. The recurrence equation means that, for each step t, the unit length is moved toward the corresponding complex number of the TP vector, in the complex plane. Therefore, TP walk can provide a good visualization of the TP properties in the complex plane. For the coding region (the region with the TP property), TP walk shows clearly and certain movement trends, and for the non-coding area, TP walk moves randomly around the stable points with insignificant movement trends. Stated thus, the difference in the visualization of the TP properties can be exploited as a basis for distinguishing the coding area from the non-coding area.

Data experiments

Verify the reliability and validity of the SASR method

In this work, we first apply SASR to the known gene coding regions and non-coding regions to verify the reliability and validity of the SASR method. Fig 1 shows the TP walk result of the coding region (No. J8VWM6), which encodes the proteins of Sphingomonas sp. LH128 partial outer membrane autotransporter barrel. The gene data is downloaded from the UniProt database and the length of it is 5244 bps. Fig 2 shows the TP walk result of an artificial DNA sequence generated randomly, whose length is 5000 bps. In Fig 1(, TP walk moves 5244 steps in the complex plane, moving rightward from zero (0, 0) to around (1600, -100), but in Fig 2(, after TP Walk moves 5000 steps in the complex plane, the walking result is randomly distributed near the zero with no specific direction. In Fig 1(, the real part value keeps increasing with the increase of the t value, and the imaginary part remains relatively constant. However, in Fig 2(, the real part and the imaginary part change more freely, and there is no fixed change pattern.
Fig 1

The TP walk of the coding region downloaded from the UniProt database.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

Fig 2

The TP walk of the randomly generated base sequence.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

The TP walk of the coding region downloaded from the UniProt database.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

The TP walk of the randomly generated base sequence.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green). To further verify the validity of SASR, the Sphingomonas gene in the above is inserted into a randomly generated artificial DNA sequence, with the insertion position starting from 2000 to 7243, to generate a new base sequence that includes both a coding region and non-coding regions. Then using the SASR method, we get the visualization of the TP properties of the base sequence (Fig 3) in the complex plane.
Fig 3

The TP walk of the new base sequence that includes both a coding region and non-coding regions.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

The TP walk of the new base sequence that includes both a coding region and non-coding regions.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green). In Fig 3(, in the vicinity of the zero point and the end of the walk (the position of the artificial DNA sequence), the TP walk moves randomly with indefinite direction of movement, however, in the middle part (the position of inserted Sphingomonas gene), the TP walk gradually moves to the right. Meanwhile, in Fig 3(, it is obvious that roughly from 2000 to 7000, with the growth of the t value, the real part is gradually increasing, while the artificial DNA sequence part has a smaller change in the real part value. Similar observations have been obtained after applying the SASR method to other known coding regions and non-coding regions. The reliability and validity of the SASR method have been verified through the above experiments. It can be seen that there is a large difference between the coding and non-coding regions in the graphic output of TP Walk. Therefore, after applying the SASR method to the base sequence being measured, by observing the change of the TP Walk’s real part and referring to the trend of TP walking in the complex plane, the coding and non-coding areas can be distinguished without any training process.

Predicting the protein-coding regions of the Sphingomonas sp. WG’s whole genome data

The SASR method is applied to the 31 scaffolds of the Sphingomonas sp. WG’s whole-genome data respectively to predict probable coding regions. Here, taking the processing of No. 21 scaffold as an example to illustrate the prediction processing of the possible coding regions. First, the SASR method is applied to No. 21 scaffold (the sequence length is 43986), and the visualization of the TP properties is obtained (Fig 4).
Fig 4

The TP walk of scaffold 21.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

The TP walk of scaffold 21.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green). According to the difference in the graphic output between the coding and non-coding regions, the fragment of the base sequence that corresponds to the characteristics of the coding region is identified, which is called a suspected gene fragment, and the specific position of the fragment is identified. Since one base sequence may contain multiple segments of coding regions, a plurality of base segments that conform to the characteristics of the coding region can be found in Fig 4, such as 2000-7000, 16000-20000 and so on. Because the number of bases is large, the trend of the real part of the partial TP walk may not be obvious. So, it is possible to cut out the invisible part of the fragment and use the SASR method again to obtain the position of the base fragment that corresponds to the coding region. Based on the position of the suspected gene fragment in No. 21 scaffold, the base sequence is divided using Matlab software to obtain multiple suspected gene fragments. Finally, the obtained suspected gene fragments are compared with the known gene in the NCBI nr database with blastx.

Results

The graphic output of the base sequence fragment, whose position starting from 21000 to 24250 (Simply expressed as 21-21000-24250, and other suspected gene fragments are expressed in the same way) in No. 21 scaffold, meets the characteristics of the coding region (Fig 5). The suspected gene fragment is compared with the known gene in the NCBI nr database using Blast software. The basic information of this sequence alignment please refer to the Table 1, such as the type of molecule, the length of the sequence, the name of the database to be compared, and the program used, etc. And the results of sequence alignment are shown in Fig 6 and Table 2.
Fig 5

The TP walk of the suspected gene fragment 21-21000-24250.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

Table 1

The basic information of the comparison result of the suspected gene fragment 21-21000-24250.

ItemContent
RIDSETZNMME01R
Query IDlcl—Query_217685
DescriptionNone
Database Namenr
DescriptionAll non-redundant GenBank CDS translations+ PDB+ SwissProt+ PIR+ PRF excluding environmental samples from WGS projects
ProgramBLASTX 2.8.0+
Molecule typenucleic acid
Query Length3251
Fig 6

Graphic summary of the sequence alignment result of the suspected gene fragment 21-21000-24250.

(a) Similarity color diagram, in which the similarity is arranged from high to low: red, purple, green, blue, black. (b) The suspected gene fragment has the highest degree of similarity and the best matching degree to the first gene sequence.

Table 2

The first 8 detailed alignment results of the suspected gene fragment 21-21000-24250 and different sequences in the database.

Five metrics are used to evaluate the match results of the query sequence: Max score, Total score, Query coverage, E value, and Ident.

DescriptionMax scoreTotal scoreQuery coverE valueIdent
MULTISPECIES: hybrid sensor histidine kinase/response regulator [Sphingomonas]1927192796%0.0100%
PAS domain S-box protein [Sphingomonas pituitosa]1570157096%0.081%
PAS domain S-box protein [Sphingomonas spermidinifaciens]1523152396%0.079%
PAS domain S-box-containing protein [Sphingomonas jatrophae]1269126996%0.066%
PAS domain-containing sensor histidine kinase [Sphingomonas jatrophae]1052125983%0.066%
PAS domain S-box protein [Phenylobacterium deserti]812111890%0.069%
PAS domain S-box protein [Asticcacaulis sp. CF398]744102884%0.051%
PAS domain S-box protein [Methylobacterium platani]69994194%0.055%

The TP walk of the suspected gene fragment 21-21000-24250.

(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).

Graphic summary of the sequence alignment result of the suspected gene fragment 21-21000-24250.

(a) Similarity color diagram, in which the similarity is arranged from high to low: red, purple, green, blue, black. (b) The suspected gene fragment has the highest degree of similarity and the best matching degree to the first gene sequence.

The first 8 detailed alignment results of the suspected gene fragment 21-21000-24250 and different sequences in the database.

Five metrics are used to evaluate the match results of the query sequence: Max score, Total score, Query coverage, E value, and Ident. In Fig 6, the color keys of different colors represent the magnitude of similarity. The more red regions, the higher the similarity of the fragments that are matched in the database. Simultaneously, the matched gene sequences are ranked by the matching scores from large to small, so the suspected gene fragment has the highest degree of similarity and the best matching degree to the first gene sequence. In Table 2, two of the most noteworthy values are the ident value and the E value. The ident value indicates the degree of similarity between the aligned sequence and the target sequence, that is, the number of bases on the match as a percentage of the total sequence length. The E value indicates the possibility of random matching. The greater the E value, the greater the likelihood of random matching. When the E value is close to zero or zero, it can be considered as an exact match. Table 2 shows, the sequence alignment results of the suspected gene fragment 21-21000-24250 and the alignment of hybrid sensor histidine kinase/response regulator [Sphingomonas] is the best one, whose ident value is 100% and the E value is 0, which can be considered as an exact match. To further verify the reliability of the sequence alignment results, the PSI-BLAST program was used to search for similar sequences of the suspected gene fragments 21-21000-24250 by multiple iterations in the database (the threshold is 0.001). There are several highly similar protein sequences in the summary table (hits reported in the 1^st iteration with E value is 0.0 and sequence identity > 30%; see Table 3 for details). The newly added sequences that were below the threshold in the previous search are indicated as “new” and “old” indicates that the sequence has been searched before this iteration. Table 3 contains the sequence alignment information of the first 5 sequences in the search results of the previous eight iterations. It can be seen that from the third iteration, although the order of the sequences in the search results is not exactly the same, the sequences with high similarity are still in front. After several iterations, the searched new sequence is gradually reduced, and when no new sequences are detected below the defined threshold, the iterative process is terminated. Therefore, through multiple iterative searches of PSI-BLAST, we can find that the suspected gene fragment 21-21000-24250 has distant sequence similarity with the sequence, whose entry name is A0A1A3QEF8_MYCSZ in UniProtKB, and can speculate on the possible structure and function of the protein compiled by the suspected gene fragment.
Table 3

The PSI-BLAST search results of the suspected gene sequence 21-21000-24250.

It contains the sequence alignment information of the first 5 sequences in the search results of the previous eight iterations. The similar sequences in search results sort descending based on score values.

iterationAlignmentDB:IDLengthScore(Bits)IdentitiesPositivesE()
1^st1 newTR:W4ZEQ5_STRPU72461156.032.032.00.0
2 newTR:W4YYP3_STRPU113091053.028.033.00.0
3 newTR:B2HHB3_MYCMM3634962.033.035.00.0
4 newTR:A0A2N2JNV2_9DELT11114946.027.030.00.0
5 newTR:A0A0D2WHT8_CAPO37350914.024.030.00.0
2^st1 oldTR:B2HHB3_MYCMM36342873.030.032.00.0
2 oldTR:A0A1X1REZ8_9MYCO40532804.027.028.00.0
3 oldTR:A0A1A3KX53_MYCGO37392791.027.029.00.0
4 oldTR:A0A1A3QEF8_MYCSZ48432784.027.029.00.0
5 oldTR:A0A1X1SZH0_9MYCO36422759.027.029.00.0
3^st1 oldTR:A0A1A3QEF8_MYCSZ48433685.025.027.00.0
2 oldTR:A0A1X1REZ8_9MYCO40533684.026.027.00.0
3 oldTR:A0A1A3KX53_MYCGO37393653.026.027.00.0
4 oldTR:A0A1A3NH48_MYCAS59613648.025.027.00.0
5 oldTR:A0A1X1WR67_MYCGO59613636.025.026.00.0
4^st1 oldTR:A0A1A3QEF8_MYCSZ48434076.025.027.00.0
2 oldTR:A0A1X1REZ8_9MYCO40534052.024.026.00.0
3 oldTR:A0A1A3NH48_MYCAS59614028.023.025.00.0
4 oldTR:A0A1X1WR67_MYCGO59614018.024.026.00.0
5 oldTR:A0A1A3KPQ5_MYCAS49454010.024.026.00.0
5^st1 oldTR:A0A1A3QEF8_MYCSZ48434188.024.025.00.0
2 oldTR:A0A1X1REZ8_9MYCO40534184.024.025.00.0
3 oldTR:A0A1A3NH48_MYCAS59614183.023.025.00.0
4 oldTR:A0A1X1WR67_MYCGO59614163.024.026.00.0
5 oldTR:A0A0Q2QJK1_MYCGO65694142.024.026.00.0
6^st1 oldTR:A0A1A3QEF8_MYCSZ48434205.022.023.00.0
2 oldTR:A0A1A3NH48_MYCAS59614197.023.025.00.0
3 oldTR:A0A1X1WR67_MYCGO59614234.023.025.00.0
4 oldTR:A0A1A3NH48_MYCAS59614226.024.025.00.0
5 oldTR:A0A0Q2QJK1_MYCGO65694213.023.024.00.0
7^st1 oldTR:A0A1A3QEF8_MYCSZ48434270.022.023.00.0
2 oldTR:A0A1X1REZ8_9MYCO40534240.023.024.00.0
3 oldTR:A0A1X1WR67_MYCGO59614181.023.025.00.0
4 oldTR:A0A1X1REZ8_9MYCO40534177.023.025.00.0
5 oldTR:A0A0Q2QJK1_MYCGO65694164.023.024.00.0
8^st1 oldTR:A0A1A3QEF8_MYCSZ48434293.022.024.00.0
2 oldTR:A0A1X1REZ8_9MYCO40534271.023.024.00.0
3 oldTR:A0A1A3NH48_MYCAS59614263.022.024.00.0
4 oldTR:A0A1X1WR67_MYCGO59614209.023.024.00.0
5 oldTR:A0A0Q2QJK1_MYCGO65694024.022.024.00.0

The PSI-BLAST search results of the suspected gene sequence 21-21000-24250.

It contains the sequence alignment information of the first 5 sequences in the search results of the previous eight iterations. The similar sequences in search results sort descending based on score values. The above can prove that by processing the No. 21 scaffold base sequence using the SASR method, the coding region of No. 21 scaffold is found, and the coding region has been labeled as genes in the NCBI database. After the SASR method is applied to predict the coding regions of the 31 scaffolds of the Sphingomonas sp. WG’s whole genome data, 1115 suspected gene fragments are obtained in total by slicing the base sequences. These results can, to a certain extent, prove that these suspected gene segments have been labeled as genes or there are gene sequences with high similarity to suspected gene fragments in NCBI database. So it can be considered that they are base sequences with the function of protein coding, further illustrating the reliability and validity of the SASR method. There are 353 suspected gene fragments not matched against any known gene sequences in the NCBI database. On the basis of that the reliability and validity of the SASR method have been verified, we can consider that the 353 suspected gene fragments are newly discovered suspected gene fragments with the function of protein coding that have not been included in the NCBI database in high probability. But, whether or not the 353 suspected gene fragments are truly gene sequences with the function of protein coding, and what their corresponding biological functions are, it is necessary to do corresponding biological experiments to further verify. However, due to differences in specialized fields, the lack of relevant biological theory knowledge, biological experimental procedures, and professional equipment, and the too high cost of manpower and material resources to complete biometric verification experiments, we can not independently carry out follow-up verification against the suspected gene fragments, and it is necessary to cooperate with other specialized biological laboratories.

Discussion

Without any preceding training process, the SASR method based on the TP property of the coding region provides a visualized presentation of unannotated protein-coding regions in DNA sequences, which implements the prediction of the coding regions in the DNA sequence. According to the visualization of the DNA sequence, the starting and ending positions of the suspected gene fragment are manually determined with certain errors. The error range can be controlled within 100 bases. Since the DNA base sequence has a large number of bases, mostly in the order of 10,000 and 100,000 digits, the magnitude of the error can reach 10−3 or even smaller, which can be accepted. The results of our work provide great reference value for biological experimental workers to identify protein coding regions in the whole-genome sequences data of Sphingomonas sp. WG, and it will greatly reduce their workload and improve their efficiency. In the follow-up work, we hope to cooperate with biological experimental workers to find real genes with protein-coding functions in unlabeled suspected gene fragments through biological experiments and to make gene function annotations, and further develop an efficient new algorithm that can extract the numerical results of the coding region prediction from the SASR’s graphical output, instead of manual segmentation, thereby improving the accuracy of the location of suspected gene segments.

DNA sequence dataset to replicate the analyses.

The dataset includes: J8VWM6 (a segment of DNA is downloaded from the UniProt database and the length of it is 5244 bps); 5000test(an artificial DNA sequence generated randomly); 5000test-2000-7243 (the DNA sequence J8VWM6 is inserted into a randomly generated artificial DNA sequence, with the insertion position starting from 2000 to 7243); Scaffold21 (a segment of DNA of the Sphingomonas sp. WG’s whole-genome data); 21-21000-24250 (the base sequence fragment whose position starting from 21000 to 24250 in No. 21 scaffold). (ZIP) Click here for additional data file.
  41 in total

1.  Ab initio gene identification: prokaryote genome annotation with GeneScan and GLIMMER.

Authors:  Gautam Aggarwal; Ramakrishna Ramaswamy
Journal:  J Biosci       Date:  2002-02       Impact factor: 1.826

Review 2.  Comparative genomics: genome-wide analysis in metazoan eukaryotes.

Authors:  Abel Ureta-Vidal; Laurence Ettwiller; Ewan Birney
Journal:  Nat Rev Genet       Date:  2003-04       Impact factor: 53.242

3.  Gene prediction by spectral rotation measure: a new method for identifying protein-coding regions.

Authors:  Daniel Kotlar; Yizhar Lavner
Journal:  Genome Res       Date:  2003-07-17       Impact factor: 9.043

4.  Location of the O-acetyl group in welan by the reductive-cleavage method.

Authors:  J D Stankowski; S G Zeller
Journal:  Carbohydr Res       Date:  1992-02-07       Impact factor: 2.104

5.  Identifying bacterial genes and endosymbiont DNA with Glimmer.

Authors:  Arthur L Delcher; Kirsten A Bratke; Edwin C Powers; Steven L Salzberg
Journal:  Bioinformatics       Date:  2007-01-19       Impact factor: 6.937

6.  [Classification of triplet periodicity in DNA sequences of genes taken from KEGG databank].

Authors:  F E Frenkel'; E V Korotkov
Journal:  Mol Biol (Mosk)       Date:  2008 Jul-Aug

7.  Spiking Neural P Systems With Learning Functions.

Authors:  Tao Song; Linqiang Pan; Tingfang Wu; Pan Zheng; M L Dennis Wong; Alfonso Rodriguez-Paton
Journal:  IEEE Trans Nanobioscience       Date:  2019-02-01       Impact factor: 2.935

8.  Eukaryotic gene prediction using GeneMark.hmm-E and GeneMark-ES.

Authors:  Mark Borodovsky; Alex Lomsadze
Journal:  Curr Protoc Bioinformatics       Date:  2011-09

9.  Microbial gene identification using interpolated Markov models.

Authors:  S L Salzberg; A L Delcher; S Kasif; O White
Journal:  Nucleic Acids Res       Date:  1998-01-15       Impact factor: 16.971

10.  Hidden Markov Models and their Applications in Biological Sequence Analysis.

Authors:  Byung-Jun Yoon
Journal:  Curr Genomics       Date:  2009-09       Impact factor: 2.236

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.