Literature DB >> 35120181

Quantitative analysis of visual codewords of a protein distance matrix.

Jure Pražnikar^1,2, Nuwan Tharanga Attygalle¹.

Abstract

3D protein structures can be analyzed using a distance matrix calculated as the pairwise distance between all Cα atoms in the protein model. Although researchers have efficiently used distance matrices to classify proteins and find homologous proteins, much less work has been done on quantitative analysis of distance matrix features. Therefore, the distance matrix was analyzed as gray scale image using KAZE feature extractor algorithm with Bag of Visual Words model. In this study, each protein was represented as a histogram of visual codewords. The analysis showed that a very small number of codewords (~1%) have a high relative frequency (> 0.25) and that the majority of codewords have a relative frequency around 0.05. We have also shown that there is a relationship between the frequency of codewords and the position of the features in a distance matrix. The codewords that are more frequent are located closer to the main diagonal. Less frequent codewords, on the other hand, are located in the corners of the distance matrix, far from the main diagonal. Moreover, the analysis showed a correlation between the number of unique codewords and the 3D repeats in the protein structure. The solenoid and tandem repeats proteins have a significantly lower number of unique codewords than the globular proteins. Finally, the codeword histograms and Support Vector Machine (SVM) classifier were used to classify solenoid and globular proteins. The result showed that the SVM classifier fed with codeword histograms correctly classified 352 out of 354 proteins.

Entities: Chemical

Mesh：

Substances：
Proteins

Year: 2022 PMID： 35120181 PMCID： PMC8815937 DOI： 10.1371/journal.pone.0263566

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

The analysis of protein structures using the distance matrix of Cα atoms has a long history in structural biology. To date, a protein distance matrix has been used for structural alignment, protein classification, and finding homologous proteins [1, 2]. Recently, tremendous progress has been made in predicting 3D proteins based on distance matrix and artificial intelligence [3-6]. Various studies have shown that the representation of protein structure in 2D space has the following main advantages: it represents local, short-, medium-, and long-range contacts between Cα-atoms simultaneously and is rotation and translation invariant [7]. The protein distance matrix contains the distances between residues, which can be represented as a grayscale image, where the distances between pairs of Cα-atoms are represented by intensity. Therefore, feature extraction can be applied to obtain points of interest. In previous studies on distance matrix, the extracted features have been used to classify proteins and explore the universe of protein folding space. For example, the DALI Z-score [1, 8], the Local Feature Profile [9], and the Effective Moment Feature Vector [10], which were generated as three-dimensional maps by dimensionality reduction, have shown that all α-, all β-, α/β-, and α+β-domains are grouped. Moreover, multi-view rendering of 3D protein structures was used to obtain a 2D image followed by feature extraction [11]. In this method, the codebook is generated by clustering visual words and a histogram is used to explore similarities between protein structures. Taken together, these studies have shown that feature vectors or histograms of distance matrix are valuable to explore and investigate the space of protein folding in reduced dimensions. The feature vectors or histograms can be obtained in many different ways, but the common approach in the above literature is the method from computer science—Bag-of-Visual-Words. Sivic and Zisserman first introduced Bag of Features [12] for matching objects in videos, where they investigated whether it is possible to retrieve a video scene that contains a particular object. In addition, Yu et al [13] developed an image retrieval system based on the Bag of Features method. Several studies also introduced image recognition techniques to successfully classify protein structures using Speeded Up Robust Features (SURF) and Scale-invariant Feature Transform (SIFT) descriptors. The agreement between proteins was measured using the similarity of all feature descriptors rather than codeword histograms [14, 15]. This approach does not introduce a fixed codebook size, but determines the smallest distance between all feature pairs from two images. A high number of matching points indicates that two images have similar patches and we can conclude that they were generated from similar 3D protein structures. Much of the previous research on distance matrix features has focused on optimal vocabulary size, various feature extraction algorithms, protein folding universe representation, and classification performance. Basic patterns in distance matrix showing secondary structure elements (α-helix, parallel and anti-parallel β-sheet) are well known, but less work has been done on quantitative analysis of distance matrix features. For example, Choi et al pointed out that the "null" pattern (15th mediodis) is the most common [9]. In the mentioned study, one hundred medoids (submatrix of size 10x10) were extracted from the training set (sampled proteins). Thus, 100 medoids represent a local feature of the protein distance matrix. However, no analysis was performed on the frequency of all 100 mediodis, the most frequent position in the distance matrix (map to image), or the correlation between the number of unique features and the solenoid proteins. Thus, our main motivation was to extend the analysis of distance matrix features instead of focusing only on the speed and accuracy of classification. Therefore, the focus of the present study is on a quantitative analysis of the codewords of the protein distance matrix, and we tried to answer the following questions: What is the relationship between the frequency of the codeword and its position in the distance matrix, what is the distribution of the codewords, and what is the correlation between tandem repeats proteins and the number of unique codewords. For our study, we used bag-of-visual-words, a method commonly used in image classification, where a histogram of codewords represents each image. The aim of this study is to (i) find the optimal vocabulary size of KAZE algorithm for fast protein classification of SCOPe protein domains, (ii) explore the most frequent position of visual codewords in the distance matrix, (iii) investigate the relationship between unique visual codewords, domain size, solenoid and tandem repeats in proteins.

Methods

SCOPe domain database and (non) solenoid proteins

The set was taken from the SCOPe 2.07 database [16-18] and included four main classes: all α, all β, α/β and α+β. We excluded domains that are not a true family and are labelled as X.X.X.0; we also excluded all individual members at the family level. Thus, each SCOPe family in the analysed database contains at least two members. The dataset contains 7308 distance matrix images, and we extracted 2,929,445 KAZE features. The images of the distance matrix contained the distances between all Cα-atoms. Therefore, each distance matrix was represented as a grayscale image in the range [0, 1], where 0 (black) represents long-range contacts and 1 (white) represents short-range contacts.

Solenoid and non-solenoid protein data set—Benchmark

For the classification of globular and solenoid proteins, we used a benchmark database consisting of 105 solenoid and 247 globular proteins downloaded from the following website: http://old.protein.bio.unipd.it/raphael/precompiled.html [19, 20]. The dataset contains 105 solenoid proteins for which the maximum sequence identity is 35% (CATH ’S’ -level, sequence families). In addition to the solenoid proteins, the dataset contains 247 globular proteins with different topological or folding families and no detectable sequence similarity (CATH ’T’-level). Thus, the dataset contains non-homologous proteins and has been used for other methods, e.g.: RAPHEL [20], ReUPred [21], ConSole [22], which are briefly introduced below. RAPHAEL uses geometric data instead of amino acid sequence profiles. The coordinates of the Cα-atoms are used to build a coordinate profile. To obtain the coordinate profile, two filters are used, where the first window size for averaging is 6 and the second window size is 3. Then, the period is calculated from the coordinate profile, which is defined as the distance between successive local maxima. In the last step, the Support Vector Machines fed with the feature profile is trained to classify proteins. ReUPred (Repetitive Units Predictor) is a tool for the prediction and classification of repeating units in proteins. The ReUPred algorithm takes as input the Structure Repeat Unit Library (SRUL), derived from RepeatsDB, and the query protein. ReUPred uses an iterative divide-and-evaluate approach to identify structural units by comparing the protein structure to a manually curated library of repeat units. The alignments found by the repeat-protein annotation that satisfy the predefined similarity criteria are considered valid. ConSole uses the information contained in the protein contact maps. The 3D protein structure is represented as a contact matrix in which there is a contact between two amino acids if the distance between any pair of heavy atoms is less than 4.5Å. The typical feature of the solenoid contact map is the second diagonal line indicating regular short distance contacts between amino acids. The template matching or normalized cross-correlation performed on the contact map is used as input for classification by a Support Vector Machines.

Bag of visual words

The earliest reference for Bag-of-Words was in Zellig Harris’ Distributional Structure [23]. Since then, the bag-of-words model has been used in various document classification methods. The frequency of each word (number of occurrences) is used as a feature for training a classifier. The Bag-of-Features model (Bag-of-Visual-Words) is influenced by the Bag-of-Words model and is used to classify images by treating image features as words. This Bag-of-Visual-Words algorithm includes the following four main steps: (i) feature extraction, (ii) visual dictionary, (iii) codebook generation, and (iv) histogram generation. All four steps are shown in Fig 1 and in the following text. Bag-of-Visual-Words was generated using the Matlab function BagOfFeatures. The image database and Matlab code [24] are in Supplement.

Fig 1

Graphical representation of Bag of Visual Words.

The presented algorithm consists of four steps: feature extraction, creation of a visual dictionary, creation of a codebook and finally hardcoding or histogram creation.

Graphical representation of Bag of Visual Words.

The presented algorithm consists of four steps: feature extraction, creation of a visual dictionary, creation of a codebook and finally hardcoding or histogram creation.

Feature extraction

Features are parts or patterns of an image that can be used to classify the image. Instead of storing the entire image, it is efficient to store the unique parts of the image. Feature extraction is the process of extracting unique elements of the image. The feature extraction algorithm detects key points and creates descriptors based on the key points. Usually, a descriptor is an array of 64 or 128 elements. We can consider these descriptors as small images extracted from the image. Feature detection algorithms such as Speeded Up Robust Features (SURF), Scale-invariant Feature Transform (SIFT) and KAZE create multiple instances of the same image by blurring and transforming (rotating, scaling, skewing) it. Then the algorithm identifies the keypoints by searching for similar points in all the generated images. Once the algorithm finds the keypoints, the neighbouring pixels around the keypoints are called descriptors. SURF and SIFT use Gaussian blur to create blurred versions of the images. However, the Gaussian blur smooths both image details, such as object boundaries, and noise to the same extent. Therefore, Gaussian blur degrades the quality of keypoints by reducing localization accuracy and discriminability. In contrast, the KAZE algorithm [25] locally adapts the blur to the image in a nonlinear scale space by using nonlinear diffusion filtering. The KAZE algorithm reduces noise while preserving object boundaries, resulting in better localization accuracy and discriminability. Therefore, KAZE features are more expensive to compute than SURF and SIFT, but outperform SIFT and SURF in terms of feature matching efficiency [26]. Therefore, we used the KAZE algorithm to obtain features from protein distance matrix images.

Visual dictionary

In this step, we merged the raw features from all images using the KAZE algorithm. Each extracted feature consisted of a 64-dimensional vector representing the neighbouring pixel values around a given key point. We extracted features f from all images in the dataset and stored the extracted features in a collection F = {f, …, f}, where M is the total number of features in the entire set. This feature collection F is called a visual dictionary.

Codebook generation

The codebook is a compact version of the visual dictionary. Therefore, the codebook stores only a subset of the features F instead of all the extracted features. We grouped similar features of the visual dictionary into clusters using k-means clustering and extracted the centroids of each cluster. Clustering is an unsupervised machine learning approach that groups similar data points. Data points in the same group should have similar features, while data points in different groups should have very different features. The goal of k-means is to cluster similar data points and help users discover the underlying patterns. K-means initialises the centroid locations and then performs iterative computations to optimise the centroid locations for each data point. We grouped the similar features of the visual dictionary into clusters using k-means clustering and extracted the centroids of each cluster. Then we stored the centroids in a new collection, which is called the codebook. Each data point of the codebook is called a codeword.

Generating a histogram

Using the codebook, we represented all the images as histograms, i.e., one histogram per image. Each bin of the histogram represents each codeword of the codebook. So, after hard coding, the dimensions of histograms are same for all images. First, we extracted the features from the image and found the closest codeword for each feature in the image (hard coding) and thus created the histogram or feature vector. After creating the histogram, the similarity of the images is calculated as the cosine distance between two histograms—feature vectors. The histogram representation has the same dimension for the entire dataset, regardless of the dimension and number of features of the images.

Support vector machine classification

The labelled (globular and solenoid proteins) distance matrix images were classified using the trainImageCategoryClassifier function in Matlab [24]. The function trainImageCategoryClassifier trains a multiclass classifier (Support Vector Machines) using a bag-of-features object. Support Vector Machines (SVM) is a supervised machine learning model that can be used for linear and nonlinear classification problems. SVM is based on the idea of finding a hyperplane in a high-dimensional space that best separates features into different classes (S1 Fig). The features closest to the hyperplane are called support vector points, while the margin width is the distance between the separator and the closest points of each class. The SVM algorithm finds the optimal hyperplane when margin width is maximum. In the present study, an SVM classifier (linear kernel) was used to discriminate solenoid and globular proteins. The input data were histograms of codewords, where each histogram corresponded to a protein distance matrix.

Results and discussion

Vocabulary size of a distance matrix

When creating vocabulary, it is crucial that the size of the vocabulary is not too small or too large. If the vocabulary is too small, it will not represent all the features or patches from the image database, so images that are significantly different from each other may have similar codeword histograms. On the other hand, if the vocabulary is too large, quantization artefacts may occur, i.e., overfitting. To test how the classification accuracy depends on the size of the vocabulary, we performed a nearest neighbour classification with different vocabulary sizes. The cosine similarity measure was used to determine the similarity between two feature vectors—codeword histograms. The minimum cosine distance between two codeword histograms thus defined the nearest neighbour between the query domain and all other domains in the database. The prediction was correct if the nearest neighbour belonged to the same class (or fold or superfamily or family). Fig 2A shows the accuracy as a function of vocabulary size for the class, fold, superfamily, and family levels.

Fig 2

(A) Accuracy versus vocabulary size of the SCOPe 2.07 database. (B) The first and second principal components of the codebook (7308 SCOPe domains by 5000 codewords).

(A) Accuracy versus vocabulary size of the SCOPe 2.07 database. (B) The first and second principal components of the codebook (7308 SCOPe domains by 5000 codewords). It is obvious that the accuracy is highest for class prediction and lowest for family prediction. The explanation is that the differences between classes are much larger than the differences at the family level of the same superfamily, which makes it harder to predict the correct family level. For example, if the query protein is classified by SCOPe as a.1.1.1 and the closest member was classified as a.1.2.1, the prediction is correct at the class and fold levels, but incorrect at the superfamily and family levels. Moreover, there are only four distinct classes at class level (α alpha, all β, α+β, and α/β), but there are 534, 870, and 1602 distinct classes at fold, superfamily, and family levels, respectively (Table 1). It follows that classification at the fold, superfamily, and family levels is more challenging than at the class level, where there are only four distinct classes.

Table 1

SCOPe 2.07 levels and the number of unique classes at each level used in this study.

Level	Unique classes
Class	4
Fold	534
Super-family	870
Family	1602

Further, we can also see that accuracy increases as the vocabulary size increases until a plateau is reached. This is observed at the fold, superfamily and family levels. The most significant increase is observed at the family level, where accuracy increases from 0.72, 0.79 and 0.80 for vocabulary sizes of 1000, 5000 and 10000 respectively. When the vocabulary size increases from 5000 to 10000, the accuracy at the family level increases by only ~1%, and at the class, fold, and superfamily levels, the increase in accuracy is even smaller. It is worth noting that creating and using a larger vocabulary increases the computation time. The creation of bag-of-visual-words with a vocabulary of 1000, 5000 and 10000 codewords required a CPU time (3.4 GHz processor) of about 20, 27 and 39 minutes, respectively. Therefore, for further analysis, we set the number of visual words in the codebook to 5000, which gives a good prediction and is not too large to prevent overfitting. Moreover, the principal component analysis of all histograms, i.e., the matrix of size 7308 (proteins) times 5000 (codewords), clearly shows four principal classes (Fig 1B). The first principal component analysis shows low values for all β-domains and high values for all α-domains, whereas α+β and α/β are scattered around the zero value. It follows that the first principal axis discriminates all α- and all β-classes, but not α/β and α+β. The second principal component discriminates between the α+β- and α/β-classes. It is also observed that three groups (all β, all α and α/β) cluster on the positive or negative principal axis, while the α+β class is scattered closer to the origin. We have shown here that the extraction of descriptors in an image of the distance matrix using the KAZE algorithm, followed by the construction of the codebook of 5000 words, gives a good distribution of domains in the low-dimensional space and a good prediction of nearest neighbours, and is consistent with the studies of Hou et al, Choi et al, and Shi et al, who studied the folding space and structural similarity in proteins [8-10].

Relative frequency and spatial distribution of codewords of a distance matrix

The spatial relationship between codewords is lost in the creation of the codebook because the information about the coordinates of the features is not contained in a bag of visual words. If we know which cluster (codeword) a feature belongs to, we can extract the coordinates of the codewords and relate the coordinates of the codeword with its relative frequency. Here, we examined the relationship between the relative frequency of codewords and the coordinates in the protein distance matrix. Fig 3A shows the relative frequency of codewords ordered from highest to lowest relative frequency. We see that codewords with the highest relative frequency are present in more than 30% of SCOPe domains. Only 52 (~1%) codewords have relative frequency above 0.25, and about 3000 codewords (from 2000 to 5000) are present in less than 5% of the protein distance matrix images. The next question we want to answer is: where are the codewords located in the 2D matrix? Due to the hard coding, each extracted feature belongs to a particular codeword, i.e., a member of a particular cluster, and each feature also contains the x and y coordinates—the centre of the feature. In this way, we can calculate the minimum distance between the feature coordinates and the diagonal of the distance matrix and relate the distance to the relative frequency of a codeword.

Fig 3

(A) Barplot of relative frequency of codewords, (B) distance between feature coordinates and diagonal of distance matrix.

Note that the size of the distance matrix depends on the size of the protein, but the shape of the information presented remains the same. The distance matrix always shows local, short, intermediate, and long-range contacts between Cα-atoms. Since protein domains differ by size, we normalized the coordinates to a value between 0 and 1. Note that the maximum distance between the diagonal and the farthest points is limited by the value √2/2≈0.7. The zero distance indicates that the feature is on the diagonal, while the ~0.7 distance indicates that the feature is far from the diagonal and represents the farthest contact between Cα atoms. Fig 3B shows the distance (codeword-diagonal) for all 5000 codewords sorted by relative frequency (from highest to lowest). In Fig 3B, it can be seen that codewords with high relative frequency (from 1 to 500) are closer to the diagonal than codewords with low relative frequency (from 500 to 1000). Less frequent codewords (from 1000 to 5000) are even more scattered and are in the interval from 0.05 to 0.65. To get a better visual impression, we plotted the positions of all features on a normalized distance matrix. The panel plot shows the position of codewords from 1 to 500 (Fig 4A), 501 to 1000 (Fig 4B), 1001 to 2000 (Fig 4C), and from 2001 to 5000 (Fig 4D). It can be clearly seen that the codewords with the highest relative frequency are located on the diagonal and represent local protein structure interactions. The highest density is located in the core of the diagonal, while the lowest density is found at the ends of the diagonal. Fig 4B shows features representing short-range contacts; these are non-diagonal codewords. Nevertheless, there are still some feature points at the tails of the diagonal. The tails of the diagonal contain information about local interactions; however, these codewords are relatively less frequent than the codewords in the core of the diagonal (Fig 4A). The explanation for this is that the codewords at the tails of the diagonal represent flexible terminus. Obviously, more codewords are required for disordered regions than for the rigid core region of the protein domain. In addition, Fig 4C and 4D show the mid- and long-range contacts. In Fig 3D, there are no elements near the diagonal, and the highest density of codewords is in the corners of the distance matrix, which represents the longest contacts between Cα-atom pairs in the protein structure.

Fig 4

Distribution of codewords in the distance matrix sorted according to the relative frequency (from highest to lowest).

(A) Visual codewords from 1 to 500, (B) 501 to 1000, (C) 1001 to 2000, (D) and from 2001 to 5000.

Distribution of codewords in the distance matrix sorted according to the relative frequency (from highest to lowest).

(A) Visual codewords from 1 to 500, (B) 501 to 1000, (C) 1001 to 2000, (D) and from 2001 to 5000.

Unique words ratio

This section focuses on how protein domain size is related to the total number of codewords, the number of unique codewords, and the relationship between unique codewords and the repetition of the structural motif within the protein. The correlation between the number of codewords in the protein and protein domain length was R = 0.96, and the correlation between protein domain length and unique codewords was R = 0.93 (Fig 5A and 5B). It was expected that protein length would correlate positively with the number of (unique) codewords, since a larger image contains a larger number of features. However, when constructing a histogram, different features can be encoded in the same codeword. One reason for this is clustering, where similar but not identical features belong to the same cluster. The second reason is that a protein with repeating units in 3D space also has repeating units in the contact map. Therefore, a large protein with a high number of repeating units may have a low number of unique codewords, but still have a high total number of codewords.

Fig 5

(A) Number of codewords versus domain size, (B) the number of unique codewords versus domain size.

Fig 6A shows the ratio of unique codewords to all codewords for all protein domains in the SCOPe dataset. When all features belong to different codewords (cluster classes), the ratio of unique words is 1, but when a large number of features belong to the same codeword, the ratio of unique words is low. For example, if the distance matrix image contains 500 features extracted by the KAZE algorithm, and these features belong to 400 unique codewords (taken from the precomputed codebook), then to calculate the proportion of unique words you need to divide 400 by 500 and get 0.8. The value 0.8 means that 80% of the codewords are unique. We can see that the ratio of unique codewords can be as high as 1.0 and on the other hand it is lower than 0.2. Moreover, we can observe a negative correlation (R = -0.74) between the domain size and the ratio of unique words. The value 0.2 means that only 20% of the features are unique. The histogram (Fig 6B) of the ratio of unique words shows that most (~0.96%) protein domains have a ratio between 0.7 and 1.0, and only a few (~0.6%) domains have a ratio below 0.4, indicating that the protein domain with a low ratio of unique codewords has many repeat units in 3D space. We selected three domains with unique codeword ratio below 0.2 and three domains with unique codeword ratio above 0.92.

Fig 6

(A) Ratio of unique codewords versus domain size, (B) distribution of the ratio of unique codewords.

As can be seen in Fig 7, the protein domains with a low ratio of unique codewords are solenoid domains containing repetitive structural units (see Fig 7A–7C); in contrast, all three selected domains with a high ratio of unique codewords represent globular proteins (Fig 7D–7F). Solenoid proteins and proteins containing tandem repeats [27] have been and still are the subject of extensive research, as they have been shown to play fundamental roles in many biological processes such as signal transduction and molecular recognition [28, 29]. Due to their elongated structures and flexibility, Ankyrin, Leucine-rich, and HEAT -repeat proteins are involved in protein-protein interactions [30]. Fibromodulin, for example, is a Leucine-rich repeat protein of the extracellular matrix that binds to fibrillar collagens and affects fibril velocity [31]. In addition, solenoid proteins also play a crucial role in the interaction with nucleic acids. Exportin-5, for example, is a transporter for microRNA [29]. Therefore, we linked the RepeatsDB (https://repeatsdb.bio.unipd.it/) [32] and our SCOPe dataset. We searched for which domains from our dataset were also included in the RepeatsDB database, which contains annotated tandem repeat protein structures. S2 Fig show that the domains from the RepeatsDB database have a lower ratio of unique words than the domains that are not part of the RepeatsDB database and do not have tandem repeats. In addition, a less strong correlation (abs. value) is observed between domain size and the ratio of unique words in the repeat protein structures (R = -0.63), compared to the domains that are not part of the RepeatsDB database (R = -0.80), see S2 Fig. Overall, these results suggest that the ratio of unique words of solenoid and tandem repeat proteins is shifted towards lower ratios. Moreover, a lower correlation (absolute value) between domain size and unique codewords is observed for solenoid and tandem repeat proteins.

Fig 7

(A, B, and C) Domains with a low number of unique codewords (below 0.2), SCOPe domains from left to right: d1o6va2, d2q4gw_, d4cj9a. (D, E, and F) Domains with a high percentage of a unique word (above 0.92) and a domain size above 200 residues, SCOPe domains from left to right: d1w55a1, d2zxea1, d4kb1a_. All ribbons were plotted by UCSF Chimera [33].

Discrimination between solenoid and globular proteins

To further investigate the typical ratio of unique codewords for globular and solenoid proteins, we used a solenoid and non-solenoid protein dataset that has been used in many studies and provides a benchmark for distinguishing between globular and solenoid proteins. We used a codebook based on SCOPe domains and calculated the ratio of unique codewords for a benchmark set. The boxplots in Fig 8 show the ratio of unique codewords for globular and solenoid proteins. We see that globular proteins have a much higher ratio of unique codewords compared to solenoid proteins. This shows that they have a more diverse structure without repeated 3D units. Using the data presented in Fig 8, we see that if the ratio of unique codewords is less than ~0.6, it is likely a solenoid protein.

Fig 8

The ratio of unique codewords of globular and solenoid proteins for a benchmark dataset.

Furthermore, visual inspection of the histograms of the frequencies of globular and solenoid protein codewords shows clear differences (Fig 9A and 9B). We can see that the histogram of globular protein (Fig 9A) contains many codewords with lower frequencies and there is no significant difference between low and high frequencies. In contrast, the histogram of the solenoid protein contains few codewords with very high frequencies and many codewords with very low frequencies.

Fig 9

(A) Histogram of codewords (feature vector) for globular (PDBid:1a79) and (B) solenoid protein (PDBid:1jl5) from the benchmark dataset.

To further investigate the performance of bag-of-features classification, we performed an independent classification of globular and solenoid proteins. First, we divided the data into a training and a test dataset, and then created a new codebook (using only the training dataset) in two different sizes: 500 and 1000 words. We randomly divided the dataset into two groups, G1 and G2. The training dataset (test dataset) represented 50% (50%) of the whole dataset. We performed 2-fold cross-validation, i.e., training on G1 and validation on G2, followed by training on G2 and validation on G1. We used the default settings for image classification with the MATLAB function trainImageCategoryClassifier, which trains a Support Vector Machine multi-class classifier with a bag-of-features object. The feature vector (codeword frequency histogram, see Fig 9A and 9B for an example) of each protein domain is fed into the Support Vector Machine Multiclass Classifier. The results of the solenoid and non-solenoid protein dataset, which contains 105 solenoid and 247 globular proteins, are shown in Table 2. As can be seen from Table 2, the overall accuracy is over 0.99, which is higher compared to other state-of-the-art methods that achieve an accuracy of 0.96 or less. The Support Vector Machine Multiclass Classifier, which uses a bag-of-features object input, correctly classified 350 out of 352 domains. The largest improvement is observed in the correctly classified solenoids with an increase in accuracy of 12% and 22% compared to RAPHAEL and ReUPred, respectively. Table 2 also shows that this method is very robust to the size of the vocabulary. When the vocabulary was reduced from 1000 to 500 words, the accuracy decreased only slightly, from 99.4 to 99.1.

Table 2

Accuracy of globular and solenoid protein classification using a benchmark dataset.

	TP	FP	TN	FN	Solenoid	Globular	Accuracy
RAPHAEL^a	91	1	246	14	86.7	99.6	95.7
ReUPred^b	81	4	243	24	77.1	98.4	92.1
ConSole^c					100	95	96
This study
Vocabulary = 1000	104	1	246	1	99.0	99.6	99.4
Vocabulary = 500	103	2	246	1	98.1	99.6	99.1

The data of true positives (TP), false negatives (FP), true negatives (TN) and false negatives (FN) for RAPHEL and ReUPred were taken from their original papers.

aRAPHAEL [20].

bReUPred [21].

cConSole [22].

The data of true positives (TP), false negatives (FP), true negatives (TN) and false negatives (FN) for RAPHEL and ReUPred were taken from their original papers. aRAPHAEL [20]. bReUPred [21]. cConSole [22].

Quantitative measure of repeats in protein domain: Inter- and intra-comparison

There are several quantitative measures for comparing protein structures, such as the squared deviation of the coordinates between two superimposed structures, globularity can be estimated by calculating the radius of gyration [34-36] or eccentricity [37], and the locality of contacts between amino acids is measured by relative contact order [38]. In this work, the ratio of unique words was used as a quantitative measure of regular repeats in protein domains. The ratio of unique words can be used for inter- and intra-comparison of protein domains. Solenoid proteins belong to different SCOPe classes. WD40 and the Kelch motif belong to the all-β class, HEAT and Armadilo repeat belong to the all-α class, and Leucine rich repeat and Ankyrin repeat belong to the α/β and α+β classes, respectively. Table 3 shows the ratio of unique words for six common protein solenoid domains, where we can see that Leucine rich repeat has the lowest ratio of unique words (0.16), which means that it has the simplest and most regular repeats. In this context, simple repetition means that a relatively small number of visual words are required to describe the repetition in the distance matrix. The WD40 and Kelch motifs were expected to have a higher ratio of unique words (0.46 and 0.53, respectively) because the shape is more spherical, resulting in long-range contacts between amino acids that are far apart in the sequence. Consequently, each repeat contains short- and long-range contacts and therefore more visual words are needed to describe a single repeat. Interestingly, the Ankyrin repeat, which does not have a globular shape, has a relatively high proportion of unique words (0.48) compared to Leucine rich motif. Apparently, the repeat consisting of two alpha helices separated by loops, which are also a source of disorder, is more complex than the α/β-horseshoe fold.

Table 3

Unique word ratio for six common protein solenoid domains.

Solenoid domain	SCOPe id	SCOPe family	Ratio
Leucine-rich repeat	d1a4ya_	c.10.1.1	0.16
HEAT repeat	d1b3ua_	a.118.1.2	0.26
Armadillo repeat	d3bcta_	a.118.1.1	0.35
WD40	d1pgua1	b.69.4.1	0.46
Ankyrin repeat	d1ixva1	d.211.1.1	0.48
Kelch motif	d1gofa3	b.69.1.1	0.53

Domains are sorted by ratio of unique words from lowest to highest.

Domains are sorted by ratio of unique words from lowest to highest. In addition to inter-, intra-comparison is also possible to evaluate repetitions in a particular superfamily of the SCOPe database. Fold b.69: 7-bladed beta-propeller contains 15 superfamilies, sorted by date of inclusion in the PDB or publication. We sorted the superfamilies by the ratio of unique words, see S1 Table. As expected, the sorted list of superfamilies is unrelated to the date of entry into the PDB or publication. The comparison between the highest ratio (0.66) and the lowest ratio (0.33) shows a significant difference between the superfamilies. Visual inspection of YVTN repeat (superfamily b.69.2) and Nitrous oxide reductase, N-terminal domain (superfamily b.69.3) reveals that the Nitrous oxide reductase domain has a more complex structure as it also contains alpha helices (see S3 Fig). In addition, the high proportion of unique words may also indicate that some parts of the oligoxyloglucan-reducing end-specific cellobiohydrolase are disordered, indicating a domain irregularity (S3 Fig). As we can see, the ratio of unique words contains information about modularity (extended or globular architecture), complexity, and disorder, and thus quantitatively measures the repeats in the protein domain.

Conclusion

In this work, we analysed the visual words (codebook) of protein distance matrices. We studied the relationship between the size of the vocabulary and the classification accuracy. It was found that a codebook with several thousand codewords is required for accurate classification. Also, the relationship between the relative frequency of codewords and the coordinates in the distance matrix was analysed. The result was that codewords with higher relative frequency are generally closer to the main diagonal of the distance matrix. We also showed that solenoid domains have a much lower proportion of unique codewords compared to globular proteins, and that the feature vector (codeword histogram) together with a support vector machine classifier can be used very efficiently to discriminate between globular and solenoid proteins. We believe that further work and development can be done to investigate whether the codeword histogram is useful for classifying tandem repeats. In addition, a more advanced approach, such as pooling methods, can be used to incorporate spatial data from protein distance matrix patches.

Support Vector Machines separating hyperplane.

The hyperplane is a function that separates features into multiple classes. The function that separates features in 2D space is a line, while in 3D space it is a plane. When more dimensions are introduced, this function is called a hyperplane. The figure shows two classes of data in 2D space, two support vectors, margin width, and a hyperplane, which in this case is a line. (TIF) Click here for additional data file. (A) Ratio of unique words against the number of features. Domains that are members of RepeatsDB and our dataset data set are marked with red dots. (B) The ratios of unique codewords from SCOPe domains (not members of RepeatsDB) and SCOPe domains that are members of RepeatsDB. (TIF) Click here for additional data file.

Three 7-bladed beta-propeller domains (SCOPe fold b.69).

The domains are sorted by the ratio of unique words from lowest (left) to highest (right). (A) YVTN repeat (SCOPe: d1l0qa2) has unique word ratio uwr = 0.33, (B) domain of oligoxyloglucan reducing end-specific cellobiohydrolase (SCOPe: d2ebsa1) has unique word ratio uwr = 0.62 and (C) domain of oligoxyloglucan reducing end-specific cellobiohydrolase (SCOPe: d2iwka1) has unique word ratio uwr = 0.66. (TIF) Click here for additional data file.

Unique word ratio for the SCOPe fold class 7-bladed beta-propeller (b.69), which contains 15 superfamilies.

The SCOPe entries were taken from SCOPe 2.07 and are less than 40% identical. The domains are sorted by the ratio of unique words from lowest to highest. (DOCX) Click here for additional data file.

Distance matrix images.

The supplemental data includes distance matrix images (grayscale, PNG format) and a Matlab script. (DOC) Click here for additional data file. 25 Nov 2021

PONE-D-21-31856

Quantitative analysis of visual codewords of a protein distance matrix

PLOS ONE Dear Dr. Pražnikar, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. As you will see, both reviewers have issues with the manuscript, both with technical quality and with presentation. One reviewer actually suggested rejection but I would like to give you the opportunity if you think you can address the comments to try to improve the manuscript. The key issues are the scope of the manuscript (would it be better as two manuscripts), the technical issues with the training, and better articulating the implications of the work. Please submit your revised manuscript by Jan 09 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Bostjan Kobe, Ph.D. Academic Editor PLOS ONE Journal requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section. 3. Thank you for stating the following in the Acknowledgments Section of your manuscript: “This work was supported by Structural Biology grant P1-0048 and Infrastructure programme grant I0-0035-2790, provided by the Slovenian Research Agency.” We note that you have provided additional information within the Acknowledgements Section that is not currently declared in your Funding Statement. Please note that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: “The author(s) received no specific funding for this work.” Please include your amended statements within your cover letter; we will change the online submission form on your behalf. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: No Reviewer #2: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: N/A Reviewer #2: Yes ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: The manuscript submitted by J. Praznikar and A. N. Tharanga describes some features of the protein distant matrix. Although it is potentially interesting, it is rather confused and a drastic revision is necessary. Actually, this manuscript contains two research projects. The first deals with some features of protein distance matrices and the second is a prediction method of solenoid proteins. The first part is described superficially and it would be difficult to repeat the calculations for a third part. Methods should be described in more detail, describing the algorithms and not just mentioning them. The second part is superficial, too. It is unclear if the problem is interesting and useful. I note that the Authors use a database compiled by Tosatto more than a decade ago. This is not acceptable given the PDB dynamics. As a consequence, the results shown in Table 1 might be biased, too. I strongly suggest the Authors to communicate their results, hopefully in two separate manuscripts, with more details about their methods. Other issues Line 112: it is unclear to me what “mediodis” means here. Line 134: it is unclear why the distance matrices contain the inverse square distances. Line 214: did the Authors use other similarity measures beside the cosine similarity measure? Line 247: after “computation time”, it would be interesting to insert some information about the computational requirements (cpu time etc.). Line 258: “other studies”. The Authors should mention them. Line 317: perhaps “and solenoid protein domains” might be removed. Line 336: the “ratio” might be defined by writing an equation. Line 342: “most” might be associated with a numerical percentage. Line 343: “few” might be associated with a numerical percentage. Line 356: “in many biological processes [26-31]” is really very very vague. Might it be possible to expand this discussion. Line 384: “a database”; which one? Line 415: “the gold standard database”; which one? Line 421: The methods “RAPHAEL” and “ReUpred” should be described. Line 422: “When the vocabulary was reduced from 1000 to 500 words, the accuracy decreased only slightly”; it might be necessary to provide a number. Reference 14: is there any other publication beside arxiv.org? Figure 4: titles should be provided to the horizontal and vertical axes. Figure 5. The expression “(aa)” should follow “Domain size”. Moreover, it might be interesting to insert the slopes of the regression lines. Minor issues Line 87: the reference to DALI Z-score is probably wrong. This score has been designed by Chris Sander and Liisa Holm. Line 96: The reference “Sivic J. et al.” is misspelled (also in the reference list). Line 130: “[16-18]” should be inserted after “database”. Reviewer #2: Authors take a refreshingly different approach to protein structure classification by using a (relatively) novel technique of image analysis by "bag of "visual" words" to classify distance maps, a particular visualization approach of protein structures. The paper is well written, results are presented clearly. I have two problems with the paper - training. No details are given on creation of the training and testing. If, as I assume, the division was done randomly, it is very likely that closely homologous proteins were split between the training the testing set. Such proteins have practically identical structures and hence distance maps. With thousands of features, many of them rare, it is very likely that the system is simply memorizing classifications of such homologs, this is probably easy to test. The training and testing sets must be constructed taking into account the relations between proteins. This is critical to the overall validity of the papers. - its conclusions are conceptually disappointing. Why introduce a new formalism and do all this work to build a lousy protein structure classifier? The paper provides zero useful information to a structural biologist. What does it tell me that tens of thousands of features are needed to classify a protein. This is a computer science paper written for other computer scientists. If you could appeal to the domain researchers, the paper would be much more valuable ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step. 28 Dec 2021 Dear Editor, I hereby resubmit a revised and re-edited article. A detailed document containing the authors' response to the reviewers' comments was attached. The changes have been incorporated into the manuscript and are highlighted in yellow. We thank all reviewers for their positive evaluation, reading our paper, and valuable comments to improve the manuscript. Following the reviewers' comments, we have addressed the comments and questions in detail below. We have better formulated the implications presented in a new section entitled: “Quantitative measurement of repeats in protein domain: inter and intra comparison". We also stated that the dataset does not contain homologous proteins (sequence identity is less than 35%). We believe that the first part of the paper, which is more related to the development of a new method and the presentation of a new approach to distance matrix analysis, and the second part, which is related to practical examples, are closely related. Moreover, the paper is not too long in our opinion with about 6000 words. ************************************************************************************************************************* Reviewer #1: The manuscript submitted by J. Praznikar and A. N. Tharanga describes some features of the protein distant matrix. Although it is potentially interesting, it is rather confused and a drastic revision is necessary. Actually, this manuscript contains two research projects. The first deals with some features of protein distance matrices and the second is a prediction method of solenoid proteins. Methods should be described in more detail, describing the algorithms and not just mentioning them. ANSWER: We have improved the description of the methods, see updated manuscript. We have added a description of the K-means clustering and Support Vector Machines. The first part is described superficially and it would be difficult to repeat the calculations for a third part. ANSWER: For third parties who have Matlab skills, it would be quite easy to repeat the calculations. The supplemental data includes all 7308 distance matrix images (grayscale, PNG format) and a Matlab script that calculates the bag of features, histograms (feature vector), and classification accuracy. To repeat the calculations for a vocabulary of 1000 words, third parties can invoke the script (BOW_accuarcy.m) by mouse click in Matlab (or run it from the command window). If the user wants to repeat the calculation with a vocabulary of 5000 words, only one parameter needs to be changed ('vocabulary',5000). We also provided a dataset of globular and solenoid proteins (247+105 images, grayscale, PNG format) and a Matlab script (SVM_2fold.m) to classify proteins using SVM. In summary, the supplementary material contains all the inputs needed to run Matlab scripts. The second part is superficial, too. It is unclear if the problem is interesting and useful. I note that the Authors use a database compiled by Tosatto more than a decade ago. This is not acceptable given the PDB dynamics. As a consequence, the results shown in Table 1 might be biased, too. ANSWER: Indeed, we used data first compiled by Marsella et al. (2009) and available on the website of Tossato's research group. We used this dataset intentionally because it is a gold standard for distinguishing between solenoid and globular proteins and has been used by many others: RAPHEL, ReUPred, ConSole, Repetita, Wavelet, ... Another possibility would be to use a new dataset. In this case, we would have to apply all the other methods and algorithms to the new dataset in order to make a comparison between the different methods/algorithms. Anyway, classification was not the only focus of this work. Rather, we introduced an alternative approach to analyse the distance matrix and introduce a qualitative measure of protein repeats. I strongly suggest the Authors to communicate their results, hopefully in two separate manuscripts, with more details about their methods. ANSWER: We have improved the description of the methods. The first part of the paper, which is more related to the development of a new method and the presentation of a new approach to distance matrix analysis, and the second part, which is related to practical examples, are closely related. Moreover, the paper is not too long in our opinion with about 6000 words. Line 112: it is unclear to me what “mediodis” means here. ANSWER: Medoids are representative objects of a dataset or a cluster within a dataset. The idea of K-medoids clustering is to consider the final centres as actual data points rather than mean data points (K-means clustering). In the mentioned study, one hundred medoids (submatrix of size 10x10) were extracted from the training set sampled proteins). Thus, 100 medoids represent a local feature of the protein distance matrix. In calculating the distance matrix, the authors used a threshold of 20 Å. Larger distances were set to "null", so that a submatrix of size 10x10 contained only "null" elements. To make it clearer for the readers, we rephrase the text as follows: For example, Soug-Hou Kim et al. pointed out that the "null" pattern (15th mediodis) is the most common. In the mentioned study, one hundred medoids (submatrix of size 10x10) were extracted from the training set (sampled proteins). Thus, 100 medoids represent a local feature of the protein distance matrix. However, no analysis was performed on the frequency of all 100 mediodis, the most frequent position in the distance matrix (map to image), or the correlation between the number of unique features and the solenoid proteins. Line 134: it is unclear why the distance matrices contain the inverse square distances. ANSWER: This was a (typo) error. We have corrected it as follows: The images of the distance matrix contained the distances between all Cα-atoms. Therefore, each distance matrix was represented as a grayscale image in the range [0, 1], where 0 (black) represents long-range contacts and 1 (white) represents short-range contacts. Line 214: did the Authors use other similarity measures beside the cosine similarity measure? ANSWER: No, we used default similarity metric (cosine) of Matlab function “retrieveImages”. Line 247: after “computation time”, it would be interesting to insert some information about the computational requirements (cpu time etc.). ANSWER: We have added the following sentence in the manuscript. The creation of bag-of-visual-words with a vocabulary of 1000, 5000 and 10000 codewords required a CPU time (3.4 GHz processor) of about 20, 27 and 39 minutes, respectively. Line 258: “other studies”. The Authors should mention them. Fixed, we mention them. ANSWER: Fixed, we mention them. We have shown here that the extraction of descriptors in an image of the distance matrix using the KAZE algorithm, followed by the construction of the codebook of 5000 words, gives a good distribution of domains in the low-dimensional space and a good prediction of nearest neighbours, and is consistent with the studies of Hou et al, Choi et al, and Shi et al, who studied the folding space and structural similarity in proteins. Line 317: perhaps “and solenoid protein domains” might be removed. ANSWER: Fixed. Line 336: the “ratio” might be defined by writing an equation. ANSWER: We have added an explanation in the manuscript. For example, if the distance matrix image contains 500 features extracted by the KAZE algorithm, and these features belong to 400 unique codewords (taken from the precomputed codebook), then to calculate the proportion of unique words you need to divide 400 by 500 and get 0.8. The value 0.8 means that 80% of the codewords are unique. Line 342: “most” might be associated with a numerical percentage. ANSWER: We add a numerical percentage (~96%), see updated manuscript. Line 343: “few” might be associated with a numerical percentage. ANSWER: We add a numerical percentage (~0.6%), see updated manuscript. Line 356: “in many biological processes [26-31]” is really very very vague. Might it be possible to expand this discussion. ANSWER: We improved text as follows: Solenoid proteins and proteins containing tandem repeats [25] have been and still are the subject of extensive research, as they have been shown to play fundamental roles in many biological processes such as signal transduction and molecular recognition (Andrade, Marcotte). Due to their elongated structures and flexibility, Ankyrin, Leucine-rich, and HEAT -repeat proteins are involved in protein-protein interactions (Kobe). Fibromodulin, for example, is a leucine-rich repeat protein of the extracellular matrix that binds to fibrillar collagens and affects fibril velocity (Svensson). In addition, solenoid proteins also play a crucial role in the interaction with nucleic acids. Exportin-5, for example, is a transporter for microRNA (Fournier). Line 384: “a database”; which one? ANSWER: In Methods, we described the dataset used in previous studies, which serves as the gold standard for distinguishing globular from solenoid proteins. We have expanded the description in the Methods section. We have also added a new section in Methods entitled: "Solenoid and non-solenoid protein dataset - benchmark". Line 415: “the gold standard database”; which one? ANSWER: In Methods, we described the dataset used in previous studies, which serves as the gold standard for distinguishing globular from solenoid proteins. We have expanded the description in the Methods section. We have also added a new section in Methods entitled: "Solenoid and non-solenoid protein dataset - benchmark". We also improved the sentence, as follows: The results of the solenoid and non-solenoid protein dataset, which contains 105 solenoid proteins and 247 globular proteins, are shown in Table 2. Line 421: The methods “RAPHAEL” and “ReUpred” should be described. ANSWER: Done, see section in Methods entitled: "Solenoid and non-solenoid protein dataset - benchmark" Line 422: “When the vocabulary was reduced from 1000 to 500 words, the accuracy decreased only slightly”; it might be necessary to provide a number. ANSWER: Done. When the vocabulary was reduced from 1000 to 500 words, the accuracy decreased only slightly, from 99.4 to 99.1. Reference 14: is there any other publication beside arxiv.org? ANSWER: To our knowledge, there is no other publication. Figure 4: titles should be provided to the horizontal and vertical axes. ANSWER: Done. Figure 5. The expression “(aa)” should follow “Domain size”. Moreover, it might be interesting to insert the slopes of the regression lines. ANSWER: We added the term "amino acids" to the axis, and also added the regression line. Line 87: the reference to DALI Z-score is probably wrong. This score has been designed by Chris Sander and Liisa Holm. ANSWER: The reference of DALI and Z-score (Sander and Holm, 1993 and 2020) is at the very beginning of the introduction. Hou et al. used the DALI Z-score to construct a similarity matrix of 498 protein folds and create a global representation of the protein folding space. We have included the reference of Sander and Holm, 1993 in line 87. Line 96: The reference “Sivic J. et al.” is misspelled (also in the reference list). ANSWER: Fixed. Line 130: “[16-18]” should be inserted after “database”. ANSWER: Fixed. ************************************************************************************************************************** Reviewer #2: Authors take a refreshingly different approach to protein structure classification by using a (relatively) novel technique of image analysis by "bag of "visual" words" to classify distance maps, a particular visualization approach of protein structures. The paper is well written, results are presented clearly. I have two problems with the paper - training. No details are given on creation of the training and testing. If, as I assume, the division was done randomly, it is very likely that closely homologous proteins were split between the training the testing set. Such proteins have practically identical structures and hence distance maps. With thousands of features, many of them rare, it is very likely that the system is simply memorizing classifications of such homologs, this is probably easy to test. The training and testing sets must be constructed taking into account the relations between proteins. This is critical to the overall validity of the papers. ANSWER: The dataset we used is a benchmark dataset that has been used by many other research groups. The dataset contains globular (247) and solenoid (105) proteins where the maximum sequence identity is 35%. Thus, the dataset does not contain homologous proteins. We have expanded the description in the Methods section. We have also added a new section in Methods entitled: "Solenoid and non-solenoid protein dataset - benchmark". - its conclusions are conceptually disappointing. Why introduce a new formalism and do all this work to build a lousy protein structure classifier? The paper provides zero useful information to a structural biologist. What does it tell me that tens of thousands of features are needed to classify a protein. This is a computer science paper written for other computer scientists. If you could appeal to the domain researchers, the paper would be much more valuable ANSWER: The first part of the paper focused on the general representation and showed that a general classification can also be done using the histogram of codewords (all alpha, all beta, alpha/beta and alpha+beta, see Figure 1). Then we introduced a new qualitative measure - the ratio of unique words. This parameter clearly shows that it tends to be low for solenoid proteins and higher for globular proteins, thus quantitatively describing the repeats in the protein domain. Moreover, the histogram of codewords combined with SVM showed an improvement in discriminating between globular solenoid proteins. And we believe this is valuable for domain researchers. However, we feel that the use of the ratio of unique words - a quantitative measure of repeats in the protein domain - is not adequately represented in the study. Therefore, we have added a section titled "Quantitative measure of repeats in protein domain: inter and intra comparison" and included two tables and a figure. Please read and view the improved manuscript. Submitted filename: ResponsetoReviewers.docx Click here for additional data file. 24 Jan 2022 Quantitative analysis of visual codewords of a protein distance matrix PONE-D-21-31856R1 Dear Dr. Pražnikar, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Bostjan Kobe, Ph.D. Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: 27 Jan 2022 PONE-D-21-31856R1 Quantitative analysis of visual codewords of a protein distance matrix Dear Dr. Pražnikar: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Professor Bostjan Kobe Academic Editor PLOS ONE

29 in total

1. Local feature frequency profile: a method to measure structural similarity in proteins.

Authors: In-Geol Choi; Jaimyoung Kwon; Sung-Hou Kim
Journal: Proc Natl Acad Sci U S A Date: 2004-02-25 Impact factor: 11.205

2. RAPHAEL: recognition, periodicity and insertion assignment of solenoid protein structures.

Authors: Ian Walsh; Francesco G Sirocco; Giovanni Minervini; Tomás Di Domenico; Carlo Ferrari; Silvio C E Tosatto
Journal: Bioinformatics Date: 2012-09-08 Impact factor: 6.937

3. Protein structure comparison by alignment of distance matrices.

Authors: L Holm; C Sander
Journal: J Mol Biol Date: 1993-09-05 Impact factor: 5.469

4. Regularities in interaction patterns of globular proteins.

Authors: A Godzik; J Skolnick; A Kolinski
Journal: Protein Eng Date: 1993-11

5. SCOPe: Manual Curation and Artifact Removal in the Structural Classification of Proteins - extended Database.

Authors: John-Marc Chandonia; Naomi K Fox; Steven E Brenner
Journal: J Mol Biol Date: 2016-11-30 Impact factor: 5.469

6. Scaling laws of graphs of 3D protein structures.

Authors: Jure Pražnikar
Journal: J Bioinform Comput Biol Date: 2021-01-19 Impact factor: 1.122

7. A method of protein model classification and retrieval using bag-of-visual-features.

Authors: Jinlin Ma; Ziping Ma; Baosheng Kang; Ke Lu
Journal: Comput Math Methods Med Date: 2014-09-01 Impact factor: 2.238

8. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database.

Authors: John-Marc Chandonia; Naomi K Fox; Steven E Brenner
Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971

9. Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13).

Authors: Andrew W Senior; Richard Evans; John Jumper; James Kirkpatrick; Laurent Sifre; Tim Green; Chongli Qin; Augustin Žídek; Alexander W R Nelson; Alex Bridgland; Hugo Penedones; Stig Petersen; Karen Simonyan; Steve Crossan; Pushmeet Kohli; David T Jones; David Silver; Koray Kavukcuoglu; Demis Hassabis
Journal: Proteins Date: 2019-12

10. RepeatsDB in 2021: improved data and extended classification for protein tandem repeat structures.

Authors: Lisanna Paladin; Martina Bevilacqua; Sara Errigo; Damiano Piovesan; Ivan Mičetić; Marco Necci; Alexander Miguel Monzon; Maria Laura Fabre; Jose Luis Lopez; Juliet F Nilsson; Javier Rios; Pablo Lorenzano Menna; Maia Cabrera; Martin Gonzalez Buitron; Mariane Gonçalves Kulik; Sebastian Fernandez-Alberti; Maria Silvina Fornasari; Gustavo Parisi; Antonio Lagares; Layla Hirsh; Miguel A Andrade-Navarro; Andrey V Kajava; Silvio C E Tosatto
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971