Literature DB >> 31871975

Data on the application of the molecular vector machine model: A database of protein pentafragments and computer software for predicting and designing secondary protein structures.

Vladimir Karasev1.   

Abstract

Based on ideas about the molecular vector machine of proteins [1], a database of protein pentafragments has been created and algorithms have been proposed for predicting the secondary structure of proteins according to their primary structure and for designing the primary protein structure for a given secondary structure that it takes on. A comprehensive software suite (Predicto @ Designer) has been developed using the pentafragments database and the said algorithms. For the proteins used to create the pentafragments database, a high accuracy (close to 100%) in predicting the secondary protein structure as well as good prospects for its use for designing secondary structures of proteins have been demonstrated.
© 2019 The Author(s).

Entities:  

Keywords:  Database of protein pentafragments; Molecular vector machine; Software for predicting and design the secondary protein structure

Year:  2019        PMID: 31871975      PMCID: PMC6911939          DOI: 10.1016/j.dib.2019.104815

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table A database of protein pentafragments, sorted according to a binary description of their structure. A computer program Predicto @ Designer using this database and algorithm has been written. This program may be useful in the problems of predicting and designing of protein structure. The obtained data can contribute to the development of a database and computer software.

Data

In this paper, software is described based on the model [1]. The process of predicting secondary protein structure described in the patent [2]. An example of prediction result is given in Table 1, A (a fragment of porcine myoglobin [3]). This fragment illustrates that the whole fragment under consideration can be predicted as a sequence of 10-digit numbers. The comparison with structured experimental data [4], visualized with “Protein 3D” software [5], proved that the software predicts this structure correctly (Fig. 1).
Table 1

Predicting secondary myoglobin structure without correction (A) and with correction based on the replacement of amino acids in pentafragments (B).

A. Without correction
B. Correction based on the replacement of amino acids
Pig(Pig without coorection.dbkx)Alligator(ALLIGAT without coorection.dbkx)Alligator(ALLIGAT amino acid correction.dbkx)
141 XXX D Asp 1111111111142 XXX D Asp 1111111111142 XXX D Asp 1111111111
140 XXX N Asn 1111111111141 XXX N Asn 1111111111141 XXX N Asn 1111111111
139 XXX R Arg 1111111111140 XXX R Arg 1111111111140 XXX R Arg 1111111111
138 XXX F Phe 1111111111139 XXX F Phe 1111111111139 XXX F Phe 1111111111
137 XXX L Leu 1111111111138 XXX L Leu 1111111121138 XXX L Leu 1111111111
136 XXX E Glu 1111111111137 XXX E Glu137 XXX E Glu 1111111111
135 XXX L Leu 1111111111136 XXX L Leu136 XXX L Leu 1111111111
134 XXX A Ala 1111111111135 XXX A Ala135 XXX A Ala 1111111111
133 XXX K Lys 1111111111134 XXX K Lys134 XXX K Lys 1111111111
132 XXX S Ser 1111111111133 XXX R Arg133 XXX R Arg 1111111111 ASN
131 XXX M Met 1111111101132 XXX M Met132 XXX M Met 1111111101
130 XXX A Ala 1111110101131 XXX A Ala131 XXX A Ala 1111110101
129 XXX G Gly 1111010101130 XXX A Ala130 XXX A Ala 1111010101 GLY
128 XXX Q Gln 1101010101129 XXX Q Gln129 XXX Q Gln 1101010101
127 XXX A Ala 0101010110128 XXX S Ser128 XXX S Ser 0101010110 ALA
126 XXX D Asp 0101011000127 XXX D Asp127 XXX D Asp 0101011030
125 XXX A Ala 0101100000126 XXX A Ala126 XXX A Ala 0101103000
124 XXX G Gly 0110000010125 XXX G Gly125 XXX G Gly 0110300000
123 XXX F Phe 1000001011124 XXX F Phe124 XXX F Phe 1030000012
122 XXX D Asp 0000101110123 XXX D Asp123 XXX D Asp 3000001210
121 XXX G Gly 0010111010122 XXX A Ala 0200000000122 XXX A Ala 0000121010 GLY
120 XXX P Pro 1011101011121 XXX P Pro 0000000000121 XXX P Pro 0012101010
119 XXX H His 1110101111120 XXX Y Tyr120 XXX Y Tyr 1210101011 HIS
118 XXX K Lys 1010111111119 XXX K Lys 0000000000119 XXX K Lys 1010101111 ARG
117 XXX S Ser 1011111111118 XXX E Glu118 XXX E Glu 1010111111 SER
116 XXX Q Gln 1111111111117 XXX A Ala 0000000000117 XXX A Ala 1011111111 HIS
115 XXX L Leu 1111111111116 XXX I Ile116 XXX I Ile 1111111111 LEU
114 XXX V Val 1111111111115 XXX V Val115 XXX V Val 1111111111

Bold indicate substitutions of amino acids in the polypeptide chain at which the prediction in column B occurs. The substituted amino acids used are shown in this column to the right.

Fig. 1

Fragment 114–141 of the polypeptide chain of porcine myoglobin [4].

Predicting secondary myoglobin structure without correction (A) and with correction based on the replacement of amino acids in pentafragments (B). Bold indicate substitutions of amino acids in the polypeptide chain at which the prediction in column B occurs. The substituted amino acids used are shown in this column to the right. Fragment 114–141 of the polypeptide chain of porcine myoglobin [4]. Correction of prediction. Since our approach uses digital description of pentafragment conformations, replacement of a single amino acid has an impact on prediction accuracy, which is a disadvantage of this method. In this situation, if some pentafragment is missing in the database for any reason, a gap in the structure is predicted, which is clearly seen in Table 1, A on the example of alligator's myoglobin fragment [5]. However, this disadvantage can be rectified by employing correction methods that we have developed [6]. A method for replacement of amino acids is the most interesting among them (See below). The results given by this method are shown on the example of alligator myoglobin, whose primary structure was determined by Ref. [7]. Whereas the results in the middle column in Table 1, to which correction was not different amino acids in i-th position, then it is possible to replace the original pentafragment search with the search for pentafragment with similar structure but with amino acid changed in i-th position.

Experimental design, materials, and methods

Creating the database of protein pentafragments

Text files describing hydrogen bonds in the secondary structure of proteins were obtained on the basis of about 2333 PDB-files of the Protein Data Bank (subunits – 2446). The list of proteins is given in the appendix. With the help of the Protein 3D program developed by us [5] (the program is free to download), these files were processed in a step-by-step fashion using mini-programs with a view to obtaining and sorting pentafragments. The steps are listed below.

Obtaining text files

Open the source PDB file using the Protein 3D program. The Rendering icon in the CIHBS settings submenu will show us the type of protein with a specification of its hydrogen bond systems. Next, in the CIHBS icon, check the box against the line item named Trace in memory. Open the bond types table from the Select bond types line item using the dropdown arrow, check the boxes against the NiH … Oi–3 and NiH … Oi–4 bonds, and uncheck the Show all line item. Next, click on the Show selected bonds line item and click OK. This will open a window with information about the H-bonds of the protein. After clicking the Save links button, we will get a text file with a description of these links. Table 2, A shows a sample fragment from a 1MWC text file (Sus scrofa myoglobin).
Table 2

Individual stages of how pentafragments to be inserted in the database are obtained.

A
B
C
D
Fragment from a text file (1MWD text file.txt)Fragment from an inverted text file (inv_1MWD inverted text file.txt)Examples of pentafragments obtained by cutting(rezfile cutting.txt)Example of simplified file(sim_2111111211.txt)
114 VAL114 VAL N - 110 ALA O114 VAL O - 118 LYS N115 LEU115 LEU N - 111 ILE O115 LEU O - 119 HIS N116 GLN116 GLN N - 112 ILE O116 GLN O - 120 PRO N117 SER117 SER N - 113 GLN O118 LYS118 LYS N - 114 VAL O119 HIS119 HIS N - 115 LEU O119 HIS O - 123 PHE N120 PRO120 PRO N - 116 GLN O121 GLY122 ASP123 PHE123 PHE N - 119 HIS O124 GLY124 GLY O - 128 GLN N125 ALA125 ALA O - 129 GLY N125 ALA O - 129 GLY N125 ALA124 GLY O - 128 GLN N124 GLY123 PHE N - 119 HIS O123 PHE122 ASP121 GLY120 PRO N - 116 GLN O120 PRO119 HIS O - 123 PHE N119 HIS N - 115 LEU O119 HIS118 LYS N - 114 VAL O118 LYS117 SER N - 113 GLN O117 SER116 GLN O - 120 PRO N116 GLN N - 112 ILE O116 GLN115 LEU O - 119 HIS N115 LEU N - 111 ILE O115 LEU114 VAL O - 118 LYS N114 VAL N - 110 ALA O114 VAL1MWC120 PRO N - 116 GLN O120 PRO119 HIS O - 123 PHE N119 HIS N - 115 LEU O119 HIS118 LYS N - 114 VAL O118 LYS117 SER N - 113 GLN O117 SER116 GLN O - 120 PRO N116 GLN N - 112 ILE O116 GLN1MWC119 HIS O - 123 PHE N119 HIS N - 115 LEU O119 HIS118 LYS N - 114 VAL O118 LYS117 SER N - 113 GLN O117 SER116 GLN O - 120 PRO N116 GLN N - 112 ILE O116 GLN115 LEU O - 119 HIS N115 LEU N - 111 ILE O115 LEU1MWC118 LYS N - 114 VAL O118 LYS117 SER N - 113 GLN O117 SER116 GLN O - 120 PRO N116 GLN N - 112 ILE O116 GLN115 LEU O - 119 HIS N115 LEU N - 111 ILE O115 LEU114 VAL O - 118 LYS N114 VAL N - 110 ALA O114 VAL1THB136 GLY135 ALA134 VAL133 VAL132 LYS1HDS134 ALA133 VAL132 VAL131 LYS130 GLN1THB69 ALA68 ASN67 THR66 LEU65 ALA1AZI67 VAL66 THR65 GLY64 HIS63 LYS
Individual stages of how pentafragments to be inserted in the database are obtained.

Inverting text files

For the Predicto @ Designer program to work, the amino acid sequences contained in our pentafragments database need to be written from bottom to top. This pattern simulates the protein synthesis process, which evolves from the N-end to the C-end. The Invertor program takes the data written in the text file and rearranges them from the bottom up (Table 2, B).

Cutting text files into pentafragments

Using the cutter_u program, cut the inverted files into pentafragments that will store information about the arrangement of H-bonds. Cutting is done by shifting the frame by one amino acid. Table 2, C shows some examples of such pentafragments.

Sorting and simplifying pentafragments

Use the Selector program to sort the pentafragments obtained as shown above in accordance with the link encoding system we have adopted (see Table 3, Table 4). Use the Simplification program to simplify the files obtained (Table 2, D).
Table 3

Notations of bonds in text PDB-files (A), types of H-bonds (B), their coding with Boolean pairs of variables (C). an example of pentafragment (D) and its 10-digit description (E).

А. Notations in text PDB-filesB. Types of H-bondsC. CodingD. An example of pentafragment and its coding
X1X2 AbcNo H-bondsNo H-bonds0051 Gln O - 55 Glu N51 Gln50 Pro49 Ala48 Asp47 Ser0100000000
X1X2 Abc O–Y1Y2 Deh NX1X2 AbcН-bond only with C <svg xmlns="http://www.w3.org/2000/svg" version="1.0" width="20.666667pt" height="16.000000pt" viewBox="0 0 20.666667 16.000000" preserveAspectRatio="xMidYMid meet"><metadata> Created by potrace 1.16, written by Peter Selinger 2001-2019 </metadata><g transform="translate(1.000000,15.000000) scale(0.019444,-0.019444)" fill="currentColor" stroke="none"><path d="M0 440 l0 -40 480 0 480 0 0 40 0 40 -480 0 -480 0 0 -40z M0 280 l0 -40 480 0 480 0 0 40 0 40 -480 0 -480 0 0 -40z"/></g></svg> O-group01
X1X2 Abc N–Y3Y4 Ehf OX1X2 AbcH-bond only with NH-group10
X1X2 Abc O–Y1Y2 Deh NX1X2 Abc N–Y3Y4 Ehf OX1X2 AbcH-bonds both with CO and with NH-group11E. 10-digit descriptions of PFs and file names
0100000000

In cell D, the selected first two lines correspond to the highlighted designation 01 in cell E.

Table 4

Coding of types of H-Bonds in the form of binary combinations for an improved database of pentafragments.

Types of H-bondsBinary Combinations
BondsCodeBondsCodeBondsCodeBondsCode
α-helix
1.NiH … Oi-4Oi-4 … HNi0000010110101111
Inverted α-helix
2.NiH … Оi+4Оi … HNi-40000107001071177
helix 310
3.NiH … Oi-3Oi-3 … HNi0000010310301133
Inverted helix 310
4.NiH … Оi+3Оi … HNi-30000106001061166
Combination of α-helix and helix 310
5.NiH … Oi-4 … Oi-3Oi-4…HNi⋯HNi-10000020220202222
Combination of Inverted α-helix and helix 310
6.NiH … Oi+4 … Oi+3Oi … HNi-4 … HNi-30000204002042244
Notations of bonds in text PDB-files (A), types of H-bonds (B), their coding with Boolean pairs of variables (C). an example of pentafragment (D) and its 10-digit description (E). In cell D, the selected first two lines correspond to the highlighted designation 01 in cell E. Coding of types of H-Bonds in the form of binary combinations for an improved database of pentafragments. An identification system was developed to sort pentafragments in database folders based on the binary coding of H-bonds [[8], [9], [10], [11]]. An example of describing the structure of pentafragments with the help of implemented coding is given in Table 3. In this case, the 10-digit numbers describing a conformation of pentafragments were transferred to the file names (Table 3, E). Subsequently, this coding procedure became more complicated (Table 4). Additional figures to identify various types of secondary structures were introduced, but retained its binary principle [11]. The structure of the database organized in accordance with the link encoding system as per Table 4 is shown in Table 5. It consists of folders containing pentafragment files and designated by the ith pair of variables (see the Folder numbering column, Table 5), of files enclosed in these folders and containing 10-digit numbers that describe the structure of the pentafragments (column 2), and of pentafragments contained in these files and associated to their specific positions in proteins (column 3). To speed up the search for pentafragments, the software has the database written in the form of strings (see Ref. [6] for an example).
Table 5

Pentafragment database structure.

Folder numbering (Database.JPG)
Pentafragment files. Folder 37-XX(Pentafragment Files of Folder 37-00.JPG)Pentafragments of the file 3730000373.txt(Pentafragment of File 3730000373.JPG)
No.FolderNo.Folder
1234567891011121314151617181900-XX01-XX02-XX03-XX04-XX06-XX07-XX10-XX11-XX12-XX13-XX14-XX16-XX17-XX20-XX21-XX22-XX23-XX27-XX2021222324252627282930313233343536373830-XX31-XX32-XX33-XX34-XX36-XX37-XX40-XX43-XX60-XX61-XX62-XX63-XX66-XX70-XX71-XX72-XX74-XX77-XX3700000270.txt3700000370.txt3700003270.txt3700003370.txt3700037270.txt3703000370.txt3730000373.txt3730003373.txtDKK23 TYR22 GLY21 ARG20 TYR19 ASN2BQA23 ILE22 GLY21 ARG20 TYR19 GLY2JIZ294 TYR293 ALA292 GLU291 ARG290 GLY3D2765 TYR64 GLY63 HIS62 TYR61 GLY
Pentafragment database structure.

Program layout

The computer program named PREDICTO @ DESIGNER The program is written in C ++. It has been registered [12] as well as described in detail in Ref. [13]. For the program, a file of the.pdb format (Protein Data Bank) and.gen (Genbank) can be used, which are transformed by the program into the.dbk format (Table 6, A) in which the program predicts the secondary structure of the protein. The result of the program is written in.dbkx format (Table 6, B).
Table 6

Formats used by the program PREDICTO @ DESIGNER.

A
B
A fragment of the pig myoglobin protein (1MWC file) in.dbk format(1MWD_A.dbk)Recording the result of the program in.dbkx format(1MWD_A.dbkx)
15XXXGGLYbbbbbbbbbb15XXXGGLY11121210113K9Z 1DMR
14XXXWTRPbbbbbbbbbb14XXXWTRP12121011113K9Z 1DMR
13XXXVVALbbbbbbbbbb13XXXVVAL12101111113K9Z 1MWC
12XXXNASNbbbbbbbbbb12XXXNASN10111111113K9Z 1MWC
11XXXLLEUbbbbbbbbbb11XXXLLEU11111111113K9Z 1MWC
10XXXVVALbbbbbbbbbb10XXXVVAL11111111013K9Z 1MWC
9XXXLLEUbbbbbbbbbb9XXXLLEU11111101013K9Z 1MWC
8XXXQGLNbbbbbbbbbb8XXXQGLN11110101013K9Z 1DMR
7XXXWTRPbbbbbbbbbb7XXXWTRP11010101013K9Z 1DMR
6XXXEGLUbbbbbbbbbb6XXXEGLU01010101003K9Z 1DMR
5XXXGGLYbbbbbbbbbb5XXXGGLY01010100003K9Z 1DMR
4XXXDASPbbbbbbbbbb4XXXDASPbbbbbbbbbb
3XXXSSERbbbbbbbbbb3XXXSSERbbbbbbbbbb
2XXXLLEUbbbbbbbbbb2XXXLLEUbbbbbbbbbb
1XXXGGLYbbbbbbbbbb1XXXGGLYbbbbbbbbbb
0ATGMMETbbbbbbbbbb0ATGMMETbbbbbbbbbb
Formats used by the program PREDICTO @ DESIGNER. Fig. 2, a shows the startup screen of the PREDICTO @ DESIGNER program. Clicking on the word PREDICTO sets the program to the secondary protein structure prediction mode (Fig. 2, b shows the workspace where digital and structural information is displayed) and clicking on the word DESIGNER sets it to the design mode (Fig. 2, c shows the workspace, control panel, and icons used to display information required for the design).
Fig. 2

The startup screen and workspaces of the PREDICTO @ DESIGNER program. a – program startup screen; b – PREDICTO section workspace; c – DESIGNER section workspace.

The startup screen and workspaces of the PREDICTO @ DESIGNER program. a – program startup screen; b – PREDICTO section workspace; c – DESIGNER section workspace.

The procedure for prediction

The method of predicting secondary protein structure described in the patent [2] consists in isolating pentafragments in a file with specially formatted primary structure of proteins (files.dbk) and their search in the Database. Since every pentafragment has a 10-digit identification number in the Database, the software reads the code number of the found pentafragment and displays it onto the numeric operating field in a bottom-up sequence progressively as pentafragments are selected in a protein chain from start to finish. This procedure consists of two stages: an initial pentafragment is found at the first stage and if it is detected correctly then the remaining protein is predicted further at the second stage [2]. It has been found that when applying this approach, the secondary structure of all proteins used to develop the database is predicted with an accuracy close to 100%.

Prediction correction method by replacement of amino acids

The method consists in the following [6]. Let us assume that at some i-th stage the software has isolated a pentafragment to be searched for that has not been found under a code number defined on the basis of search algorithm. If this pentafragment could be found at the previous i-1-th stage, then it is all about the amino acid that appeared in the pentafragment at the i-th stage. It is well known that these changes (mutations) are frequently observed for the same type proteins but extracted from different kinds of organisms. Because the search for pentafragment with missing i-th amino acid should be conducted under the same folder number, as for the other pentafragments with similar structure but with applied, show quite low prediction accuracy, a region with amino acids from 115 to 138 (Table 1) was completely predictable as a result of applying this method. Comparison of the predicted structure of alligator myoglobin with porcine myoglobin (Table 1, left column) shows that in general both structures have similar position of α-helixes in this fragment. Thus, applying this correction method significantly improves prediction accuracy for secondary structure of proteins.

Further ways to develop the prediction method

Applying the described prediction correction method is convenient and relevant to use for the groups of proteins with similar structure but derived from different species (as in cases with myoglobins and other heme-containing proteins). Ideally, it would be better to have a universal database that could be used to predict secondary structure of any protein with high accuracy. We have shown a practical possibility for creating it [14]. However, a high increase in the number of pentafragments in the database significantly increases the number of alternative options for prediction of secondary structures. This, in its turn, sharply slows down software performance and deteriorates the prediction quality. Due to the above-mentioned reasons, we believe it is more relevant to develop ad-hoc databases aimed at predicting structurally close proteins. In this case, a universal database can be built on the basis of hierarchical structure of specialized databases. A prediction algorithm will consist of two stages: a) preliminary search of common elements being attributable to certain protein groups; b) final prediction based on a specialized database. There is a lot of work to be done in this respect, but the results of this work seem to be quite promising.

Developing a design method for secondary structures

Because the proposed approach can predict secondary structures of proteins quite accurately, it would be logical to apply the same approach to design secondary structures based on the predefined secondary structure. This method is detailed in the patent of [15]. It is implemented in the Designer section [13] of the Predicto @ Designer software. The initial protein pentafragment and its description in the form of 10-digit number in the binary numeral system is set using the control panel. The selected pentafragment is searched for in the database and, if it is found, then it is necessary to see one new amino acid and 10-digit description of a new pentafragment containing the previous four amino acids and one new and run a new search in the database. If the new pentafragment is found, then the procedure should be repeated. The description presented in the patent is based on the data available in literature, and therefore, it confirms the feasibility of this design. However, before this method is recommended for a large-scale implementation, it must pass a more comprehensive experimental validation on the basis of up-to-date scientific and engineering know-hows. The studies are being carried out in this respect.

Specifications Table

Subjectbiology
Specific subject areadatabase of protein pentafragments and computersoftware
Type of dataTableFigureDatebaseSoftwareImage (x-ray)Text file
How data were acquiredComputer softwareProtein_3D,Predicto @ Designer
Data formatRaw and Analysed
Parameters for data collectionThe primary structure of the protein
Description of data collectionBy using a database and computer programs
Data source locationSource of protein isolation (animal or plant species)
Data accessibilityData are with this article
Related research articleVladimir Karasev, BioSystems 180 (2019) 7–18, https://doi.org/10.1016/j.biosystems.2019.02.001
Value of the Data

A database of protein pentafragments, sorted according to a binary description of their structure.

A computer program Predicto @ Designer using this database and algorithm has been written.

This program may be useful in the problems of predicting and designing of protein structure.

The obtained data can contribute to the development of a database and computer software.

  4 in total

1.  A model of molecular vector machine of proteins.

Authors:  Vladimir Karasev
Journal:  Biosystems       Date:  2019-03-13       Impact factor: 1.973

2.  Cloning and sequence analysis of porcine myoglobin cDNA.

Authors:  E Akaboshi
Journal:  Gene       Date:  1985       Impact factor: 3.688

3.  Stabilizing bound O2 in myoglobin by valine68 (E11) to asparagine substitution.

Authors:  S Krzywda; G N Murshudov; A M Brzozowski; M Jaskolski; E E Scott; S A Klizas; Q H Gibson; J S Olson; A J Wilkinson
Journal:  Biochemistry       Date:  1998-11-10       Impact factor: 3.162

4.  The amino acid sequence of alligator (Alligator mississippiensis) myoglobin. Phylogenetic implications.

Authors:  H Dene; J Sazy; M Goodman; A E Romero-Herrera
Journal:  Biochim Biophys Acta       Date:  1980-08-21
  4 in total
  2 in total

1.  Toward Development of a Label-Free Detection Technique for Microfluidic Fluorometric Peptide-Based Biosensor Systems.

Authors:  Nikita Sitkov; Tatiana Zimina; Alexander Kolobov; Vladimir Karasev; Alexander Romanov; Viktor Luchinin; Dmitry Kaplun
Journal:  Micromachines (Basel)       Date:  2021-06-13       Impact factor: 2.891

2.  Study of the Fabrication Technology of Hybrid Microfluidic Biochips for Label-Free Detection of Proteins.

Authors:  Nikita Sitkov; Tatiana Zimina; Alexey Kolobov; Evgeny Sevostyanov; Valentina Trushlyakova; Viktor Luchinin; Alexander Krasichkov; Oleg Markelov; Michael Galagudza; Dmitry Kaplun
Journal:  Micromachines (Basel)       Date:  2021-12-24       Impact factor: 2.891

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.