Zhongwei Li1, Yanan Guan1, Xiang Yuan1, Pan Zheng2, Hu Zhu3. 1. College of Computer and Communication Engineering, China University of Petroleum, Qingdao, Shandong, China. 2. Department of Accounting and Information Systems, University of Canterbury, Christchurch, New Zealand. 3. College of Chemistry and Materials, Fujian Normal University, Fuzhou, China.
Abstract
Identifying protein coding regions in DNA sequences by computational methods is an active research topic. Welan gum produced by Sphingomonas sp. WG has great application potential in oil recovery and concrete construction industry. Predicting the coding regions in the Sphingomonas sp. WG genome and addressing the mechanism underlying the explanation for the synthesis of Welan gum metabolism is an important issue at present. In this study, we apply a self adaptive spectral rotation (SASR, for short) method, which is based on the investigation of the Triplet Periodicity property, to predict the coding regions of the whole-genome data of Sphingomonas sp. WG without any previous training process, and 1115 suspected gene fragments are obtained. Suspected gene fragments are subjected to a similarity search against the non-redundant protein sequences (nr) database of NCBI with blastx, and 762 suspected gene fragments have been labeled as genes in the nr database.
Identifying protein coding regions in DNA sequences by computational methods is an active research topic. Welan gum produced by Sphingomonas sp. WG has great application potential in oil recovery and concrete construction industry. Predicting the coding regions in the Sphingomonas sp. WG genome and addressing the mechanism underlying the explanation for the synthesis of Welan gum metabolism is an important issue at present. In this study, we apply a self adaptive spectral rotation (SASR, for short) method, which is based on the investigation of the Triplet Periodicity property, to predict the coding regions of the whole-genome data of Sphingomonas sp. WG without any previous training process, and 1115 suspected gene fragments are obtained. Suspected gene fragments are subjected to a similarity search against the non-redundant protein sequences (nr) database of NCBI with blastx, and 762 suspected gene fragments have been labeled as genes in the nr database.
Genetic information is a set of general instructions that directs the translation from DNA to proteins. The vast majority of life on the earth stores genetic information in DNA sequences (some viruses store genetic information in RNA sequences). The information carried by DNA is expressed as proteins to construct cell components and perform genetic instructions for life [1]. Gene is a nucleotide sequence that can encode a substance with a certain biological function, which is the main carrier of the genetic inheritance of biological traits carrying protein information. The coding sequences of eukaryotic genes are not continuously arranged on the DNA molecule but are separated by non-coding introns, and the synthesis of protein is guided by the coding exons. Therefore, after a given genomic sequence, it is one of the central issues in bioinformatics to correctly identify the range of protein coding region in the DNA sequence and the precise position in the genomic sequence [2, 3].Training the parameters of the biological signal model with a training set of known gene structures is an effective method for gene prediction. In general, computational methods for gene prediction can be generally categorized into three classes: (1)similarity-based prediction methods, (2) ab into techniques based on statistical models, and (3) machine learning-based methods [4-8]. For (1) and (2), various algorithms, including the Dynamic Programming (DP) or the Hidden Markov Model (HMM), are utilized to process the information gathered from these methods and subsequently predict the potential coding region of the genomic sequence [9]. Merging biological knowledge with computational techniques, machine learning aims to build a predictive model by learning the difference between coded and non-coded regions and use the learned model to predict the coding regions in the DNA sequence. The Hidden Markov Model (HMM) is the foundation of many current gene recognition algorithms [10-14]. The Hidden Markov Model considers the DNA sequence as a random process and automatically finds its internal hidden rules based on the difference in the frequency of nucleotide selection between the encoded and non-encoded DNA sequences. There are many gene identification softwares developed based on the HMM model, for example, Augustus, GeneMark.hmm [15], GENSCAN [16-20], GENIE [21], TWINSCAN [22], DOUBLESCAN [23] and Glimmer [24-28]. Among them, Augustus which is one of the most accurate gene prediction programs, adopted the Viterbi algorithm to define a probability distribution in the DNA sequence and gene structure [29-31]. After training the model with the appropriate training set, they predict the coding area with extremely high accuracy. However, training models that depend on known sequences greatly limit the adaptability of these methods, especially for new sequences from unknown organisms with no or small training sets. Therefore, in the absence of available additional information (training set), it is essential to develop some new methods to predict the locations of coding regions without any training process.Deoxyribonucleotide (DNA) is a chain structure arranged in a certain order by four bases of A, G, T, and C. Protein is composed of polypeptide chains that are formed by 20 different amino acids. Encoding an amino acid in a protein, the genetic code requires reading 3 bases of the gene. Because of this coding relationship, the triplet of bases is called a codon. It is noted that, in most species, synonymous codons are used with different frequencies (known as codon bias) and the order with which codons are used for one protein is far from random, which raises a universal property in coding regions [32-34], called the “Triplet Periodicity (TP)”. The TP property of genes is considered to be a simple and universal distinction between coding and non-encoding regions, and studying the application of it may be an effective method to predict gene coding regions and solve other genetic problems [35-39].In 1982, Fickett first proposed the TP property [40]. After his work, various theoretical tools were applied by researchers to investigate the TP property, such as the hidden Markov chains [41-44], the Fourier transform [17, 18], Neural Networks [45, 46], the information theory [32, 35], the time series [47-50]. Based on the Fourier transform, a method called spectral content measurement (SCM) was developed to study the TP property by Tiwari et al [18]. Since researchers have expanded and improved Tiwari’s original method in many ways, a series of methods have been developed from the original SCM [51-54]. By optimizing four coefficients in the sequence mapping and the short time Fourier transform, Anastassiou [55] proposed the optimized spectral content measure (OSCM) to compute TP property of genes. Based on the information derived from the magnitude of the discrete Fourier transform, Kotlar and Lavner [54] proposed a spectral rotation measurement (SRM) method that uses the information of the phase component such as the phase angles’ expected values and variances, to maximize the differentiation between protein coding and non-coding regions. Marhon and Sajid [56] proposed a spectrum-based technique that uses a dynamic representation scheme to map DNA sequences into a numerical form. And a post-processing method was proposed to detect the period-3 peaks instead of an empirical threshold value.However, most high-performance methods based on TP require known genetic data from the target organism or homologous sequence data for training. Relying on training, such as the HMM-based methods, restricts the application of methods on unknown organisms, which can not offer known genetic data and homologous biological data. In addition, most of SCM related methods employ the moving slide window to investigate the TP properties for local sections of the gene sequence, and then find potential coding regions in the sequence [40, 42]. The sensitivity of these methods is highly dependent on the size of the sliding window. So, how to choose the appropriate size of the moving slide window is also one of the key issues in determining accuracy.The research object of our work is the whole-genome sequences data of Sphingomonas sp. WG with independent intellectual property rights. Due to the lack of known genetic data and homologous biological data, the prediction methods depending on the training process cannot be employed for the whole-genome sequences data of Sphingomonas sp. WG. Chen et al. proposed a method called Adaptive Spectral Rotation (SASR) [57], which is based on the TP properties of the coding region to visualize the DNA sequence without any training process and extra information, realizing the prediction of the gene coding region. When there is insufficient training set or no extra information available, the SASR method is helpful for identifying protein coding regions on unacquainted DNA sequences. Therefore, in our work, we used the SASR method to visualize the coding region of Sphingomonas sp. WG’s whole-genome data without any training process, and the code of the SASR method is obtained from Chen’s paper.Suspected gene fragments are obtained by manually distinguishing the position of the coding region. With comparing suspected gene fragments with the known gene in the NCBI database, we can make judgments about which the suspected gene fragments have been labeled as genes in the NCBI database. Besides, there is a high probability that real genes newly discovered in the whole-genome sequences data of Sphingomonas sp. WG will be found in the unlabeled suspected gene fragments.
Materials and methods
The Sphingomonas sp. WG’s whole genome data
Welan gum is an extracellular polysaccharide produced by aerobic fermentation of Sphingomonas sp [58]. With interesting rheological properties, Welan gum has been widely used as a stabilizing, suspending, emulsifying, and thickening agent in several areas such as coating materials, food, medicine, concrete additives, and enhanced oil recovery [59, 60]. Welan gum has become the hot spot of science research. In the petroleum fields, Welan gum is a new type of biological oil-displacing agent, which is of great value in tertiary oil recovery in the oil fields [61, 62]. With the deepening of the research on biochemical properties of Welan gum, its industrial value has been continuously developed. In the field of oil and natural gas extraction, Welan gum has shown great market value as an excellent tertiary oil-displacement agent.In order to increase the yield of Welan gum and create more commercial profits, it is necessary to analyze the whole-genome sequences data of Sphingomonas sp. WG, a producer of Welan gum, and explore the biological mechanism of Welan gum synthetic route. The Bioengineering and Technology Center, Chinese University of Petroleum (East China), screened a strain of Sphingomonas sp. WG, a high-yielding strain of Welan gum, from the sea mud of Jiaozhou Bay, Qingdao. The whole-genome sequences data of Sphingomonas sp. WG was obtained by whole genome sequencing. The genomic data contains 31 scaffold sequences, 4,042,223 base pairs (bps), and the GC content of 65.88%.
SASR
The TP profile of the base sequence is represented by a triple periodic matrix (TPM), which presented by Frenkel and Korotkov [32, 33]. The TPM is a 3 × 4-sized matrix, each row i (i = 1, 2, 3, 4) stands for a nucleotide base # (# = A, T, C or G), each column stands for a position j (j = 1, 2, 3) in the period and the entry m is the count by which the base i appears at the position j. In the SASR, for a certain base sequence X = {x | t = 1, 2, 3, …, N}, the posterior subsequence of base sequence X at position t0 is expressed as P(t0) = {x | t0 < t ≤ N}. The TP sequence, converted from the base sequence X, is represented by a sequence of TP vectors S(X) = {s | t = 1, 2, 3, …, N}, and TP verctor . That is, for each position t, the TP vector for each base # of the posterior subsequence P(t) of the current position is calculated, ie, M(P(t)), M(P(t)), M(P(t)) and M(P(t)), and s is selected from them, according to the base at the position t. The TPM of the posterior subsequence at each position t is calculated by recursively computing M #(P(t)) from M #(P(t + 1)), with the initial value and the recurrence formula:Remarkably, P(t + 1) is the sequence of the posterior subsequence of P(t). The operation “V >> n” means that the right circular shift operation is performed n times on the triplet row vector V.The TP walk is a movement trajectory in the complex plane generated from the TP sequence. The moving trajectory is represented by the sequence W = {W | t = 0, 1, 2, …, N}, with the initial value W0 = {0, 0, 0}, and for each step t>0:The function L(x) maps the vector s = {Z1, Z2, Z3} into a complex number by:The process of converting a DNA sequence into a TP vector and then generating a TP walk is called a SASR process. The recurrence equation means that, for each step t, the unit length is moved toward the corresponding complex number of the TP vector, in the complex plane. Therefore, TP walk can provide a good visualization of the TP properties in the complex plane. For the coding region (the region with the TP property), TP walk shows clearly and certain movement trends, and for the non-coding area, TP walk moves randomly around the stable points with insignificant movement trends. Stated thus, the difference in the visualization of the TP properties can be exploited as a basis for distinguishing the coding area from the non-coding area.
Data experiments
Verify the reliability and validity of the SASR method
In this work, we first apply SASR to the known gene coding regions and non-coding regions to verify the reliability and validity of the SASR method. Fig 1 shows the TP walk result of the coding region (No. J8VWM6), which encodes the proteins of Sphingomonas sp. LH128 partial outer membrane autotransporter barrel. The gene data is downloaded from the UniProt database and the length of it is 5244 bps. Fig 2 shows the TP walk result of an artificial DNA sequence generated randomly, whose length is 5000 bps. In Fig 1(, TP walk moves 5244 steps in the complex plane, moving rightward from zero (0, 0) to around (1600, -100), but in Fig 2(, after TP Walk moves 5000 steps in the complex plane, the walking result is randomly distributed near the zero with no specific direction. In Fig 1(, the real part value keeps increasing with the increase of the t value, and the imaginary part remains relatively constant. However, in Fig 2(, the real part and the imaginary part change more freely, and there is no fixed change pattern.
Fig 1
The TP walk of the coding region downloaded from the UniProt database.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
Fig 2
The TP walk of the randomly generated base sequence.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
The TP walk of the coding region downloaded from the UniProt database.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
The TP walk of the randomly generated base sequence.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).To further verify the validity of SASR, the Sphingomonas gene in the above is inserted into a randomly generated artificial DNA sequence, with the insertion position starting from 2000 to 7243, to generate a new base sequence that includes both a coding region and non-coding regions. Then using the SASR method, we get the visualization of the TP properties of the base sequence (Fig 3) in the complex plane.
Fig 3
The TP walk of the new base sequence that includes both a coding region and non-coding regions.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
The TP walk of the new base sequence that includes both a coding region and non-coding regions.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).In Fig 3(, in the vicinity of the zero point and the end of the walk (the position of the artificial DNA sequence), the TP walk moves randomly with indefinite direction of movement, however, in the middle part (the position of inserted Sphingomonas gene), the TP walk gradually moves to the right. Meanwhile, in Fig 3(, it is obvious that roughly from 2000 to 7000, with the growth of the t value, the real part is gradually increasing, while the artificial DNA sequence part has a smaller change in the real part value. Similar observations have been obtained after applying the SASR method to other known coding regions and non-coding regions. The reliability and validity of the SASR method have been verified through the above experiments. It can be seen that there is a large difference between the coding and non-coding regions in the graphic output of TP Walk. Therefore, after applying the SASR method to the base sequence being measured, by observing the change of the TP Walk’s real part and referring to the trend of TP walking in the complex plane, the coding and non-coding areas can be distinguished without any training process.
Predicting the protein-coding regions of the Sphingomonas sp. WG’s whole genome data
The SASR method is applied to the 31 scaffolds of the Sphingomonas sp. WG’s whole-genome data respectively to predict probable coding regions. Here, taking the processing of No. 21 scaffold as an example to illustrate the prediction processing of the possible coding regions. First, the SASR method is applied to No. 21 scaffold (the sequence length is 43986), and the visualization of the TP properties is obtained (Fig 4).
Fig 4
The TP walk of scaffold 21.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
The TP walk of scaffold 21.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).According to the difference in the graphic output between the coding and non-coding regions, the fragment of the base sequence that corresponds to the characteristics of the coding region is identified, which is called a suspected gene fragment, and the specific position of the fragment is identified. Since one base sequence may contain multiple segments of coding regions, a plurality of base segments that conform to the characteristics of the coding region can be found in Fig 4, such as 2000-7000, 16000-20000 and so on. Because the number of bases is large, the trend of the real part of the partial TP walk may not be obvious. So, it is possible to cut out the invisible part of the fragment and use the SASR method again to obtain the position of the base fragment that corresponds to the coding region. Based on the position of the suspected gene fragment in No. 21 scaffold, the base sequence is divided using Matlab software to obtain multiple suspected gene fragments. Finally, the obtained suspected gene fragments are compared with the known gene in the NCBI nr database with blastx.
Results
The graphic output of the base sequence fragment, whose position starting from 21000 to 24250 (Simply expressed as 21-21000-24250, and other suspected gene fragments are expressed in the same way) in No. 21 scaffold, meets the characteristics of the coding region (Fig 5). The suspected gene fragment is compared with the known gene in the NCBI nr database using Blast software. The basic information of this sequence alignment please refer to the Table 1, such as the type of molecule, the length of the sequence, the name of the database to be compared, and the program used, etc. And the results of sequence alignment are shown in Fig 6 and Table 2.
Fig 5
The TP walk of the suspected gene fragment 21-21000-24250.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
Table 1
The basic information of the comparison result of the suspected gene fragment 21-21000-24250.
Item
Content
RID
SETZNMME01R
Query ID
lcl—Query_217685
Description
None
Database Name
nr
Description
All non-redundant GenBank CDS translations+ PDB+ SwissProt+ PIR+ PRF excluding environmental samples from WGS projects
Program
BLASTX 2.8.0+
Molecule type
nucleic acid
Query Length
3251
Fig 6
Graphic summary of the sequence alignment result of the suspected gene fragment 21-21000-24250.
(a) Similarity color diagram, in which the similarity is arranged from high to low: red, purple, green, blue, black. (b) The suspected gene fragment has the highest degree of similarity and the best matching degree to the first gene sequence.
Table 2
The first 8 detailed alignment results of the suspected gene fragment 21-21000-24250 and different sequences in the database.
Five metrics are used to evaluate the match results of the query sequence: Max score, Total score, Query coverage, E value, and Ident.
PAS domain S-box protein [Sphingomonas spermidinifaciens]
1523
1523
96%
0.0
79%
PAS domain S-box-containing protein [Sphingomonas jatrophae]
1269
1269
96%
0.0
66%
PAS domain-containing sensor histidine kinase [Sphingomonas jatrophae]
1052
1259
83%
0.0
66%
PAS domain S-box protein [Phenylobacterium deserti]
812
1118
90%
0.0
69%
PAS domain S-box protein [Asticcacaulis sp. CF398]
744
1028
84%
0.0
51%
PAS domain S-box protein [Methylobacterium platani]
699
941
94%
0.0
55%
The TP walk of the suspected gene fragment 21-21000-24250.
(a) Plot the walk trace in the complex plane. (b) Plot the real part (red) and imaginary part (green).
Graphic summary of the sequence alignment result of the suspected gene fragment 21-21000-24250.
(a) Similarity color diagram, in which the similarity is arranged from high to low: red, purple, green, blue, black. (b) The suspected gene fragment has the highest degree of similarity and the best matching degree to the first gene sequence.
The first 8 detailed alignment results of the suspected gene fragment 21-21000-24250 and different sequences in the database.
Five metrics are used to evaluate the match results of the query sequence: Max score, Total score, Query coverage, E value, and Ident.In Fig 6, the color keys of different colors represent the magnitude of similarity. The more red regions, the higher the similarity of the fragments that are matched in the database. Simultaneously, the matched gene sequences are ranked by the matching scores from large to small, so the suspected gene fragment has the highest degree of similarity and the best matching degree to the first gene sequence. In Table 2, two of the most noteworthy values are the ident value and the E value. The ident value indicates the degree of similarity between the aligned sequence and the target sequence, that is, the number of bases on the match as a percentage of the total sequence length. The E value indicates the possibility of random matching. The greater the E value, the greater the likelihood of random matching. When the E value is close to zero or zero, it can be considered as an exact match. Table 2 shows, the sequence alignment results of the suspected gene fragment 21-21000-24250 and the alignment of hybrid sensor histidine kinase/response regulator [Sphingomonas] is the best one, whose ident value is 100% and the E value is 0, which can be considered as an exact match.To further verify the reliability of the sequence alignment results, the PSI-BLAST program was used to search for similar sequences of the suspected gene fragments 21-21000-24250 by multiple iterations in the database (the threshold is 0.001). There are several highly similar protein sequences in the summary table (hits reported in the 1^st iteration with E value is 0.0 and sequence identity > 30%; see Table 3 for details). The newly added sequences that were below the threshold in the previous search are indicated as “new” and “old” indicates that the sequence has been searched before this iteration. Table 3 contains the sequence alignment information of the first 5 sequences in the search results of the previous eight iterations. It can be seen that from the third iteration, although the order of the sequences in the search results is not exactly the same, the sequences with high similarity are still in front. After several iterations, the searched new sequence is gradually reduced, and when no new sequences are detected below the defined threshold, the iterative process is terminated. Therefore, through multiple iterative searches of PSI-BLAST, we can find that the suspected gene fragment 21-21000-24250 has distant sequence similarity with the sequence, whose entry name is A0A1A3QEF8_MYCSZ in UniProtKB, and can speculate on the possible structure and function of the protein compiled by the suspected gene fragment.
Table 3
The PSI-BLAST search results of the suspected gene sequence 21-21000-24250.
It contains the sequence alignment information of the first 5 sequences in the search results of the previous eight iterations. The similar sequences in search results sort descending based on score values.
iteration
Alignment
DB:ID
Length
Score(Bits)
Identities
Positives
E()
1^st
1 new
TR:W4ZEQ5_STRPU
7246
1156.0
32.0
32.0
0.0
2 new
TR:W4YYP3_STRPU
11309
1053.0
28.0
33.0
0.0
3 new
TR:B2HHB3_MYCMM
3634
962.0
33.0
35.0
0.0
4 new
TR:A0A2N2JNV2_9DELT
11114
946.0
27.0
30.0
0.0
5 new
TR:A0A0D2WHT8_CAPO3
7350
914.0
24.0
30.0
0.0
2^st
1 old
TR:B2HHB3_MYCMM
3634
2873.0
30.0
32.0
0.0
2 old
TR:A0A1X1REZ8_9MYCO
4053
2804.0
27.0
28.0
0.0
3 old
TR:A0A1A3KX53_MYCGO
3739
2791.0
27.0
29.0
0.0
4 old
TR:A0A1A3QEF8_MYCSZ
4843
2784.0
27.0
29.0
0.0
5 old
TR:A0A1X1SZH0_9MYCO
3642
2759.0
27.0
29.0
0.0
3^st
1 old
TR:A0A1A3QEF8_MYCSZ
4843
3685.0
25.0
27.0
0.0
2 old
TR:A0A1X1REZ8_9MYCO
4053
3684.0
26.0
27.0
0.0
3 old
TR:A0A1A3KX53_MYCGO
3739
3653.0
26.0
27.0
0.0
4 old
TR:A0A1A3NH48_MYCAS
5961
3648.0
25.0
27.0
0.0
5 old
TR:A0A1X1WR67_MYCGO
5961
3636.0
25.0
26.0
0.0
4^st
1 old
TR:A0A1A3QEF8_MYCSZ
4843
4076.0
25.0
27.0
0.0
2 old
TR:A0A1X1REZ8_9MYCO
4053
4052.0
24.0
26.0
0.0
3 old
TR:A0A1A3NH48_MYCAS
5961
4028.0
23.0
25.0
0.0
4 old
TR:A0A1X1WR67_MYCGO
5961
4018.0
24.0
26.0
0.0
5 old
TR:A0A1A3KPQ5_MYCAS
4945
4010.0
24.0
26.0
0.0
5^st
1 old
TR:A0A1A3QEF8_MYCSZ
4843
4188.0
24.0
25.0
0.0
2 old
TR:A0A1X1REZ8_9MYCO
4053
4184.0
24.0
25.0
0.0
3 old
TR:A0A1A3NH48_MYCAS
5961
4183.0
23.0
25.0
0.0
4 old
TR:A0A1X1WR67_MYCGO
5961
4163.0
24.0
26.0
0.0
5 old
TR:A0A0Q2QJK1_MYCGO
6569
4142.0
24.0
26.0
0.0
6^st
1 old
TR:A0A1A3QEF8_MYCSZ
4843
4205.0
22.0
23.0
0.0
2 old
TR:A0A1A3NH48_MYCAS
5961
4197.0
23.0
25.0
0.0
3 old
TR:A0A1X1WR67_MYCGO
5961
4234.0
23.0
25.0
0.0
4 old
TR:A0A1A3NH48_MYCAS
5961
4226.0
24.0
25.0
0.0
5 old
TR:A0A0Q2QJK1_MYCGO
6569
4213.0
23.0
24.0
0.0
7^st
1 old
TR:A0A1A3QEF8_MYCSZ
4843
4270.0
22.0
23.0
0.0
2 old
TR:A0A1X1REZ8_9MYCO
4053
4240.0
23.0
24.0
0.0
3 old
TR:A0A1X1WR67_MYCGO
5961
4181.0
23.0
25.0
0.0
4 old
TR:A0A1X1REZ8_9MYCO
4053
4177.0
23.0
25.0
0.0
5 old
TR:A0A0Q2QJK1_MYCGO
6569
4164.0
23.0
24.0
0.0
8^st
1 old
TR:A0A1A3QEF8_MYCSZ
4843
4293.0
22.0
24.0
0.0
2 old
TR:A0A1X1REZ8_9MYCO
4053
4271.0
23.0
24.0
0.0
3 old
TR:A0A1A3NH48_MYCAS
5961
4263.0
22.0
24.0
0.0
4 old
TR:A0A1X1WR67_MYCGO
5961
4209.0
23.0
24.0
0.0
5 old
TR:A0A0Q2QJK1_MYCGO
6569
4024.0
22.0
24.0
0.0
The PSI-BLAST search results of the suspected gene sequence 21-21000-24250.
It contains the sequence alignment information of the first 5 sequences in the search results of the previous eight iterations. The similar sequences in search results sort descending based on score values.The above can prove that by processing the No. 21 scaffold base sequence using the SASR method, the coding region of No. 21 scaffold is found, and the coding region has been labeled as genes in the NCBI database.After the SASR method is applied to predict the coding regions of the 31 scaffolds of the Sphingomonas sp. WG’s whole genome data, 1115 suspected gene fragments are obtained in total by slicing the base sequences. These results can, to a certain extent, prove that these suspected gene segments have been labeled as genes or there are gene sequences with high similarity to suspected gene fragments in NCBI database. So it can be considered that they are base sequences with the function of protein coding, further illustrating the reliability and validity of the SASR method.There are 353 suspected gene fragments not matched against any known gene sequences in the NCBI database. On the basis of that the reliability and validity of the SASR method have been verified, we can consider that the 353 suspected gene fragments are newly discovered suspected gene fragments with the function of protein coding that have not been included in the NCBI database in high probability. But, whether or not the 353 suspected gene fragments are truly gene sequences with the function of protein coding, and what their corresponding biological functions are, it is necessary to do corresponding biological experiments to further verify. However, due to differences in specialized fields, the lack of relevant biological theory knowledge, biological experimental procedures, and professional equipment, and the too high cost of manpower and material resources to complete biometric verification experiments, we can not independently carry out follow-up verification against the suspected gene fragments, and it is necessary to cooperate with other specialized biological laboratories.
Discussion
Without any preceding training process, the SASR method based on the TP property of the coding region provides a visualized presentation of unannotated protein-coding regions in DNA sequences, which implements the prediction of the coding regions in the DNA sequence. According to the visualization of the DNA sequence, the starting and ending positions of the suspected gene fragment are manually determined with certain errors. The error range can be controlled within 100 bases. Since the DNA base sequence has a large number of bases, mostly in the order of 10,000 and 100,000 digits, the magnitude of the error can reach 10−3 or even smaller, which can be accepted. The results of our work provide great reference value for biological experimental workers to identify protein coding regions in the whole-genome sequences data of Sphingomonas sp. WG, and it will greatly reduce their workload and improve their efficiency. In the follow-up work, we hope to cooperate with biological experimental workers to find real genes with protein-coding functions in unlabeled suspected gene fragments through biological experiments and to make gene function annotations, and further develop an efficient new algorithm that can extract the numerical results of the coding region prediction from the SASR’s graphical output, instead of manual segmentation, thereby improving the accuracy of the location of suspected gene segments.
DNA sequence dataset to replicate the analyses.
The dataset includes: J8VWM6 (a segment of DNA is downloaded from the UniProt database and the length of it is 5244 bps); 5000test(an artificial DNA sequence generated randomly); 5000test-2000-7243 (the DNA sequence J8VWM6 is inserted into a randomly generated artificial DNA sequence, with the insertion position starting from 2000 to 7243); Scaffold21 (a segment of DNA of the Sphingomonas sp. WG’s whole-genome data); 21-21000-24250 (the base sequence fragment whose position starting from 21000 to 24250 in No. 21 scaffold).(ZIP)Click here for additional data file.
Authors: Tao Song; Linqiang Pan; Tingfang Wu; Pan Zheng; M L Dennis Wong; Alfonso Rodriguez-Paton Journal: IEEE Trans Nanobioscience Date: 2019-02-01 Impact factor: 2.935