Literature DB >> 35353898

Multiple Profile Models Extract Features from Protein Sequence Data and Resolve Functional Diversity of Very Different Protein Families.

R Vicedomini^1,2, J P Bouly^1,3, E Laine¹, A Falciatore^1,3, A Carbone^1,4.

Abstract

Functional classification of proteins from sequences alone has become a critical bottleneck in understanding the myriad of protein sequences that accumulate in our databases. The great diversity of homologous sequences hides, in many cases, a variety of functional activities that cannot be anticipated. Their identification appears critical for a fundamental understanding of the evolution of living organisms and for biotechnological applications. ProfileView is a sequence-based computational method, designed to functionally classify sets of homologous sequences. It relies on two main ideas: the use of multiple profile models whose construction explores evolutionary information in available databases, and a novel definition of a representation space in which to analyze sequences with multiple profile models combined together. ProfileView classifies protein families by enriching known functional groups with new sequences and discovering new groups and subgroups. We validate ProfileView on seven classes of widespread proteins involved in the interaction with nucleic acids, amino acids and small molecules, and in a large variety of functions and enzymatic reactions. ProfileView agrees with the large set of functional data collected for these proteins from the literature regarding the organization into functional subgroups and residues that characterize the functions. In addition, ProfileView resolves undefined functional classifications and extracts the molecular determinants underlying protein functional diversity, showing its potential to select sequences towards accurate experimental design and discovery of novel biological functions. On protein families with complex domain architecture, ProfileView functional classification reconciles domain combinations, unlike phylogenetic reconstruction. ProfileView proves to outperform the functional classification approach PANTHER, the two k-mer-based methods CUPP and eCAMI and a neural network approach based on Restricted Boltzmann Machines. It overcomes time complexity limitations of the latter.

Entities: Chemical

Keywords: B12-binding domain containing; Haloacid Dehalogenase; Radical SAM; SPASM/twitch domain containing; WW domain; cryptochrome; evolution; functional classification; genome; glycoside hydrolase; metagenome; methylthiotransferase; photolyase; photoreceptor; profile; profile model; protein classification

Mesh：

Substances：
Plant Extracts
Proteins

Year: 2022 PMID： 35353898 PMCID： PMC9016551 DOI： 10.1093/molbev/msac070

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 8.800

Introduction

Functional classification of biological sequences is fundamental to understanding the ever-increasing genomic and metagenomic sequence data accumulating in our databases. This quest depends on the correct domain annotation of coding genes (Ponting and Dickens 2001; De Filippo et al. 2012; Prakash and Taylor 2012) which, in the past, was handled by sequence homology and feature-based approaches. The first and most intuitive approach searches for sequences homologous to already known protein or domain sequences (Hawkins et al. 2006; Wass and Sternberg 2008; Loewenstein et al. 2009; Clark and Radivojac 2011; Törönen et al. 2018) and does so either by direct pairwise sequence alignment or by passing through protein signatures, which are descriptions of protein or domain families derived from multiple sequence alignments. It is based on the “orthology-function conjecture” for which orthologs carry out biologically equivalent functions in different organisms, in contrast to paralogs whose functions typically diverge after duplication (Gabaldón and Koonin 2013). Due to complex processes of evolution, many homologs have diversified their functions and the sequence homology approach should be applied with great awareness: different levels of similarity in homology should induce different levels in functional annotation transfer. This represents a serious pitfall for this approach. A second pitfall is linked to the production of profile models, describing features conserved across sequences. Indeed, these families may consist of a few members highly divergent from each other (rare) or a continuum of thousands of sequences due to the absence of strong functional/evolutionary pressure, which challenges the definition of the family and produces totally degenerated super-family/clan models (most frequent) of restrained use. The second class of methods is based on the selection of an appropriate set of features (such as short sequence segments or wavelet decompositions) (Karchin et al. 2005; Wen et al. 2005; Bonetta and Valentino 2020; Wan and Jones 2020). Other computational schemes use protein structure (Pazos and Sternberg 2004; Pal and Eisenberg 2005; Lee et al. 2007; Dawson et al. 2017), phylogenetics and evolutionary relationships (Eisen 1998; Engelhardt et al. 2005, 2011; Gaudet et al. 2011; Sahraeian et al. 2015; Gumerov and Zhulin 2020), interaction and association data (Deng et al. 2004; Letovsky and Kasif 2003; Vazquez et al. 2003; Nabieva et al. 2005; Sharan et al. 2007; Cao et al. 2014; Pham and Lichtarge 2020), and a combination of these (Shin et al. 2007; Furnham et al. 2012; Boari de Lima et al. 2016; Cao and Cheng 2016; Zhang et al. 2017; Kulmanov and Hoehndorf 2020), with the evident dependence on the availability of different data-types and a large and highly diversified dataset of sequences. Novel computational approaches that classify sequences by function and overcome the intrinsic limitations of existing methods would help screen sequences to design accurate experiments directed to functional testing and to discover new functions. ProfileView is a computational method conceived for this purpose, capable of classifying hundreds/thousands of homologous sequences into functional groups. It is strongly based on the understanding of the structure of sequence data imposed by the evolutionary history of the sequences. The first main step of ProfileView is to encode functional and structural information belonging to the protein family into multiple profile models that capture the diversity of the homologous sequences in the family. Based on the set of different models for the family, the second main step of ProfileView is to define an original sequence space which organizes sequences by function. Biologically interpretable information and functional motifs are extracted from the classification process. In other words, family members are organized in a tree structure, where subfamily delineation is possible thanks to the hierarchical organization. The presence of multiple functions in a family or subfamily makes it desirable to subdivide its members into smaller groups in order to capture differences in function-related features at a lower level than the subfamily. ProfileView representative models and their specific conserved motifs have proven to be good indicators of this functional delineation. ProfileView can be applied on a large scale to a wide variety of datasets. In the past, the usage of multiple profile models demonstrated to be powerful in the context of domain annotation (Bernardes et al. 2016; Ugarte et al. 2018), where they have proven be highly accurate on whole genomes and metagenomic/metatranscriptomic datasets, allowing the discovery of new sequences enriching protein families (Fortunato et al. 2016; Amato et al. 2017). Here, we take on a new challenge and use these models to capture the variety of functional motifs characterizing a protein family. Their construction requires a relatively small number of sequences (a minimum of 20), and therefore, they can encode even functional motifs that are poorly represented within large sets of natural sequences, generating a possibly very large motifs diversification. To highlight its power and generality, we applied ProfileView to seven protein families whose members are characterized by a large functional diversity, multiple members are functionally well-characterized proteins and subfamilies delineations have been validated experimentally together with their functional motifs: the Cryptochrome/Photolyase Family (CPF), the glycoside hydrolase enzymes GH30 family, the enzyme superfamily Haloacid Dehydrogenase (HAD/-PGM/Phosphatase-like subgroup) and four others (the WW domains and three protein subgroups belonging to the enzyme superfamily Radical SAM, the B12-binding domain containing, the Methylthiotransferase and the SPASM/twitch domain containing). These families and subgroups allowed us to demonstrate the power in feature extraction, the simplicity in the interpretability of the results and the methodological approach, and the computational efficiency of ProfileView compared to a recent artificial neural networks approach to sequence classification (Tubiana et al. 2019). Comparisons are also made with the PANTHER classification system (Mi et al. 2012, 2013), the CUPP (Barrett and Lange 2019) and the eCAMI (Xu et al. 2020) platforms. For each protein family, ProfileView agrees with all available experimental data. For those sequences that were not experimentally validated before, ProfileView provided a functional classification supported by functional motifs.

New Approaches

Proteins carry out their functions primarily through their constituent motifs and domains. Motifs and domains are evolutionarily more conserved than other regions of a protein and tend to evolve as units, which are gained, lost, or combined together as one module (Basu et al. 2009). A domain is a conserved sequence pattern, defined as an independent structural unit. It consists of more than 40 and up to 700 residues, with an average length of 100 residues. A motif is a short conserved sequence pattern usually smaller than a domain. It is often associated with a distinct structural site performing a particular function. Our methodological approach to sequence classification, ProfileView, explores domain functions in proteins, possibly with complex domain architectures, and makes the hypothesis that the set of positions in a protein structure that are essential for its functional activity might have evolved within domains in alternative ways leading to functional differentiation within a protein family. Identifying these differences at the residue level then becomes crucial for functional classification, and ProfileView meets the challenge by identifying the diversity of functional motifs from sequences.

Converting Sequences in Multidimensional Vectors with Profile Models

The ProfileView method is outlined hereafter and illustrated in figure 1. ProfileView takes as input a set of homologous sequences and a protein domain, and returns a classification of the sequences in functional subgroups together with functional motifs contained in the domain and characterizing the subgroups.

Fig. 1.

Schema of the ProfileView approach, validation and functional inference. (A) Model library construction in ProfileView: representative sequences of the domain under consideration are selected from the Pfam domain library. For each sequence, ProfileView searches for its close homologs in UniProt (colored disks around the seed Pfam sequence, colored dot) and constructs a profile model seeded from that sequence. (B) Given a set of unaligned sequences to classify, construction of the representation space: homologous protein sequences (light blue dots in sequence space, center) are encoded into multidimensional vectors by the profile models constructed in A. For each sequence, each profile model contributes two numerical scores to the vector, the normalized bit-score and the normalized weighted bit-score (see V and IX in Materials and Methods). By clustering points in the multidimensional representation space, ProfileView outputs a classification tree of the set of sequences where internal nodes are annotated by representative models (black dots), whenever possible. (C) Analysis of the set of sequences in a subtree (see circled root in B) which is endowed with a representative model. The 2-dimensional plot illustrates all sequences in the tree as points defined by their model’s two scores. Note that these scores are described in the two columns of the score matrix in B associated with the model. Sequences in the subtree (pastel blue points) are scored the highest. A functional motif, characterizing sequences in the subtree, is associated with the representative model. (D) ProfileView validation on a set of sequences with a functionally characterized function. Their position is identified in the tree (colored dots) and the subtrees grouping together sequences of the same functional class are used to evaluate ProfileView. The existence of a representative model is indicated by a black dot. (E) Subtrees endowed with representative models in D are used to infer a functional classification of sequences locating near functionally characterized ones. The emerald green subtree, endowed with a representative model (D), groups no characterized sequence, indicating a potentially new functional class. The first main idea of ProfileView is to extract conserved patterns from the space of available sequences through the construction of many profile models for a protein domain family that should sample the diversity of the available homologous sequences and reflect shared structural and functional characteristics. These models, called Clade-Centered Models or CCM (Bernardes et al. 2016; Ugarte et al. 2018), are built as conservation profiles from close sequences. Compared to consensus models (e.g., a pHMM Eddy 1998) which are constructed from large sets of homologous sequences including distant ones, CCMs avoid the loss of functional signals due to distant sequences. To build CCMs, we consider the full set of Pfam sequences associated with a domain (Finn et al. 2014) and, for each sequence , we construct a CCM “seeded” from that sequence; if is too large (e.g., comprising tens of thousand sequences), we sample its sequences by first clustering as explained in Materials and Methods. A CCM seeded from is built as a pHMM from a set of UniProt sequences close to (see Materials and Methods). Such a CCM displays features characteristic of and that might differ for . The more and are divergent, the more the CCMs seeded from them are expected to highlight different features. CCM high specificity, obtained by considering UniProt domain sequences that display a high sequence identity to the seed sequence , captures feature characteristics of protein interaction sites and/or determinants of functional specificity for the protein family. Note that in the past, we constructed CCMs to improve domain annotation (Bernardes et al. 2016; Ugarte et al. 2018) and, for those models, we employed less restrictive conditions for sequence selection in UniProt. In practical terms, this first main idea of ProfileView is implemented into a precompiled library of models associated with the protein domain given as input (fig. 1). The second main idea of ProfileView is to use CCMs to embed input sequences into a multidimensional representation space, where each dimension is associated with a CCM (fig. 1). Namely, for each input sequence to be classified, each model is matched against the sequence, and the value of the match, expressing how close a model is to the sequence, is recorded as a vector entry (see colored rows in the matrix of fig. 1, left). This space is thought of as a “functional space” because nearby sequences, matching similar profile motifs, are supposed to share the same functional motifs. ProfileView clusters sequences (converted into vectors) within this space by hierarchical clustering and provides a classification tree whose internal nodes are, whenever possible, annotated by representative models and by functional motifs (fig. 1, right, and 1). Subtrees endowed with representative models are indicators of functional specificities and we shall argue that they can be used to subdivide family or subfamily members into smaller groups, in order to capture differences in function-related features of the family, that is, creating groups that preferably include only one function (fig. 1). Such subtrees will be described by functional motifs, that is groups of positions, not necessarily consecutive in the protein sequence, that are conserved in most sequences of the subtree. Positions in a functional motif can be specific to the subtree or shared among subtrees allowing for overlapping positions between representative motifs. The steps of ProfileView described in figure 1, identifying subtrees and their representative motifs, do not use any known functional information from experimentally characterized sequences. This latter will be used for validating the method (fig. 1) and for functional inference (fig. 1) while searching for new sequences sharing a known function. All details of the ProfileView pipeline are explained in Method.

Results

Protein Families Analyzed with ProfileView, Validation and Inference

ProfileView was run on seven protein families. Three of the analyzes are detailed below and four in supplementary text 1, Supplementary Material online. The protein sequences to be classified for these families present different characteristics, listed in table 1 (supplementary tables S1, S2, Supplementary Material online) and supplementary text 1, Supplementary Material online. Their sequence length spans from 30aa to 750aa and sequence similarity varies from 30% to more than 50% (supplementary table S1 and text 1, Supplementary Material online). For each family, ProfileView classification is based on one or two similar Pfam domains occurring in their architecture. Length, sequence similarity, and sequence identity for the domain regions contained in the sequences to be classified are reported in supplementary tables S2 and text 1, Supplementary Material online. Number, sequence identity, and sequence similarity of the seed sequences used to construct the ProfileView model libraries are described in supplementary table S3 and text 1, Supplementary Material online. For a comparative view, see supplementary figure S1, Supplementary Material online.

Table 1.

Characteristics of the Seven ProfileView Analyzes.

Superfamily/Family		Characteristics of seqs to be Classified			Information on the Model Library Construction
		# seqs	# filt seqs	# func seqs	Pfam Domain (accession code)	# Pfam seqs	Clust cond	# profile models
Cryptochrome/Photolyase (CPF)*		397	307	72	FAD (PF03441)	4,615	—	4,615
Glycoside hydrolase family 30 (GH30)*		1,803	1,675	695	Glyco-hydro-30 (PF02055)Glyco-hydro-30-2 (PF14587)	1,894	—	1,894
Haloacid Dehalogenase*HAD/β-PGM/Phosphatase-like		391	259	259	HAD (PF12710)HAD_2 (PF13419)	35,416	≥ 40%	4,075
WW domain		349	349	54	WW (PF00397)	5,634	—	5,634
	B12-bindingdomain containing	273	258	258	B12-binding (PF02310)B12-binding_2 (PF02607)	12,241	≥ 60%	3,504
Radical	Methylthiotransferase	400	393	393	Radical_SAM (PF04055)	83,232	≥ 40%	4,501
SAM	SPASM/twitch	128	29	29	SPASM (PF13186)	6,469	≥ 60%	2,663
	domain containing	128	115	115	Radical_SAM (PF04055)	83,232	≥ 40%	4,501

Note.—List of protein families discussed in the main text (starred) and four more discussed in supplementary text 1, Supplementary Material online. For each family, we report some characteristics of the sequences to be classified (number of sequences, number of sequences after filtering (steps II and III of the pipeline), number of sequences with known function) and some information on the model library construction (Pfam domain used in classification, number of Pfam domain sequences, MMseq2 clustering condition (when clustering is applied), number of constructed models). Further features are described in supplementary tables S1, S2, Supplementary Material online. The SPASM/twitch domain containing family is considered twice because classified both with the SPAM domain and the Radical_SAM domain.

Characteristics of the Seven ProfileView Analyzes. Note.—List of protein families discussed in the main text (starred) and four more discussed in supplementary text 1, Supplementary Material online. For each family, we report some characteristics of the sequences to be classified (number of sequences, number of sequences after filtering (steps II and III of the pipeline), number of sequences with known function) and some information on the model library construction (Pfam domain used in classification, number of Pfam domain sequences, MMseq2 clustering condition (when clustering is applied), number of constructed models). Further features are described in supplementary tables S1, S2, Supplementary Material online. The SPASM/twitch domain containing family is considered twice because classified both with the SPAM domain and the Radical_SAM domain. ProfileView was validated on seven independent test sets constituted by human curated functionally characterized sequences belonging to these seven protein families. Within a family, sequences are characterized in several functional subgroups, going from a minimum of four to a maximum of nine (table 2 and supplementary text 1, Supplementary Material online). In particular, the difficulty in classification is expected to be nonuniform over different functional subgroups. Figure 1 describes the validation pipeline (see Materials and Methods).

Table 2.

Summary of ProfileView Performance in Classifying Functionally Characterized Sequences.

Protein Family	# func subgrs	Validation on Subtrees						Table	Figure
		With Models			w/o Models
		U	M	W	U	M	W
Cryptochrome/Photolyase (CPF)	5	1	0	71	1	0	71	S4	S2
Glycoside hydrolase family 30 - CAZy families	9	106	5	584	0	1	694	S6	5
HAD/β-PGM/Phosphatase-like	6	0	0	259	0	0	259	S8	6

Note.—For each protein family, the number of functional subgroups used in the evaluation is reported. The total number of unclassified (U), misplaced (M), and well-classified (W) sequences is identified with respect to subtrees endowed with a representative model or not. Names of supplementary tables and figures where ProfileView performance is described in detail, for each functional subgroup, are given.

Summary of ProfileView Performance in Classifying Functionally Characterized Sequences. Note.—For each protein family, the number of functional subgroups used in the evaluation is reported. The total number of unclassified (U), misplaced (M), and well-classified (W) sequences is identified with respect to subtrees endowed with a representative model or not. Names of supplementary tables and figures where ProfileView performance is described in detail, for each functional subgroup, are given. To validate ProfileView classification, we determined whether functionally characterized sequences of the same functional group localize together in the ProfileView tree. Ideally, one would like ProfileView to split characterized sequences belonging to functional subgroups into distinguished subtrees endowed with representative models (fig. 1). Hence, if sequences belonging to the same functional class are grouped together in a single subtree endowed with a representative model (see Materials and Methods), we consider them well-classified (W in table 2). Some sequences might remain unclassified (U), some others misplaced (M) in subtrees of the wrong functional subgroup, and several sequences of the same functional subgroup might group together in some subtree which is not represented by a model. These different possibilities are indicators of the difficulty in classifying sequences within subgroups. Dropping the condition on the existence of representative models on subtrees, allows to show that, very often, the topology of classification trees groups together unclassified sequences belonging to known functional groups, as for the Glycoside hydrolase family 30 (GH30) for instance (table 2). In table 2 and supplementary text 1, Supplementary Material online, for each protein family, we provide a summary of ProfileView performance by reporting the total number of unclassified, misplaced and well-classified sequences in ProfileView trees. A detailed description of ProfileView performance, for each functional subgroup, is found in supplementary tables and supplementary figures cited in table 2 (supplementary text 1, Supplementary Material online). ProfileView identified a large number of functionally known positions and specific protein residues in interaction with either nucleic acids, amino acids or small molecules. For two families, the CPF and the GH30, we shall show in detail how ProfileView can provide a functional classification for a large number of functionally uncharacterized sequences (fig. 1 and table 2), and novel information on conserved amino acids that could be useful to design testing experiments (see also the detailed analysis of the WW domain family in supplementary text 1, Supplementary Material online). Subtrees endowed with representative models and grouping sequences of a specific function are used to infer the function for all sequences in the subtree, and subtrees endowed with representative models and grouping sequences with no functional characterization (e.g., emerald green tree in fig. 1) are used as indicators of potentially new functional classes.

ProfileView on the CPF Family

The CPF, involved in the interaction with nucleic acids, amino acids, and small molecules, is widely distributed in all kingdoms of life (Sancar 2003; Brettel and Byrdin 2010; Chaves et al. 2011; Jaubert et al. 2017). CPF members share the same fold, yet can perform very different functions and have completely different partners: cryptochromes (CRY) are mainly photoreceptors (PR) using light to activate specific signaling pathways; some CRY also acts as light-independent transcriptional regulators of the circadian clock; photolyases (PL) are light-activated enzymes repairing UV-damaged DNA (cyclobutane pyrimidine dimer (CPD) lesions or (6-4) lesions). There are five main CPF functional groups: circadian, (6-4) photolyase, CPD photolyase, ssDNA photolyase, and photoreceptor. (See supplementary text 2, Supplementary Material online, section 1, for more description.) On the other hand, sequence similarity highlighted a finer classification of CPF proteins splitting the five groups in several subgroups (see, for instance, Emmerich et al. 2020). Based on the current literature, we could identify 11 distinct subgroups: (6-4) photolyase, Animal photoreceptor cryptochrome (PR CRY), transcriptional regulator, CRY DASH, CRY Pro, Classes I, II, III CPD photolyase, Plant-like photoreceptor CRY, Plant photoreceptor CRY, and a new NCRY subgroup. Some of these subgroups of sequences have been experimentally characterized to share the same function, as (6-4) photolyase and CRY Pro (fig. 2, supplementary fig. S2, Supplementary Material online), and some others are associated with very similar sequences, as (6-4) photolyase, Animal photoreceptor CRY, and transcriptional regulator (supplementary figs. S3, S4, Supplementary Material online).

Fig. 2.

ProfileView representation space for the CPF family, classification tree, validation on experimental data, and inference. (A) Two-dimensional projection of the ProfileView representation space for 307 FAD-binding domain CPF sequences obtained by Principle Component Analysis (PCA). The axes correspond to the first and second PCA components explaining the 63.8% and 17.6% of the dispersion, respectively. Seventy-two experimentally functionally classified sequences are colored (legend “Function”). Unclassified sequences are left light gray. When a sequence is known to have a double function, both colors are indicated and the inside color refers to the known primary function. For instance, five of the eight ssDNA photolyase sequences (red purple) located on the top of the plot are double function (compare to the first ring in B). (B) ProfileView classification tree. External colored squares define known functions for the sequences (legend “Function”). Some functionally characterized sequences are known to hold multiple functions and are labeled by two colors. The function “signaling” (dark gray) refers to signaling processes of different nature (photoreceptor, transcription, unknown). Numbers on the internal nodes correspond to the percentage of sequences in the corresponding subtree that are separated from the remaining sequences in the tree by the a representative model occurring in the model library (see supplementary fig. S2, Supplementary Material online for details). Colored subtrees are identified by representative models and they correspond to known CPF classes (legend “Cryptochrome/Photolyase Family”), with the exception of the NCRY subtree. (C) Inferred function for unclassified sequences (gray dots in A), where colors (legend “CPF”) correspond to the identified subtrees endowed with representative models on the ProfileView tree in B. In our analysis, we make the hypothesis that the FAD (flavin adenine dinucleotide) binding domain, occurring in all CPF sequences, contains all functional information leading to a functional diversification of the family. Indeed, all CPFs noncovalently bind FAD and share a mechanism of FAD photoreduction by intra-protein electron transfer (Björn 2015). FAD can be in different oxidation and protonation states (Sancar 2003), specifically associated with different functions. The FAD domain is known to interact specifically either with the damaged DNA, with other domains present in CPF proteins (e.g., C-ter extensions in some photoreceptor cryptochromes) or with other protein partners (Czarna et al. 2013). ProfileView is validated on two different types of data: functionally characterized CPF sequences and functionally characterized positions within CPF sequences. These latter are compiled in a manually curated list of positions (supplementary file, Supplementary Material online “CPF_mutants_used_for_validation.xlsx”) from the literature. Furthermore, we combined them with structural modeling to analyze CPF subgroups in detail.

Validation of ProfileView on the Functional Diversity of CPF Members

The ProfileView representation space shows a coherent organization, where sequences with the same functional characterization (see “Function” in fig. 2 and and supplementary fig. S2, Supplementary Material online) tend to occur together in space (fig. 2, table 2). Their localization is analyzed further in the ProfileView classification tree, comprised of 11 subtrees ( in fig. 2 and supplementary fig. S2, Supplementary Material online). We observe an almost perfect split of 71 out of 72 functionally characterized CPF sequences (see labels in the outer circles of fig. 2, colors as in “Function”) across the eight distinguished subtrees , all endowed with a representative model. Their classification in the five CPF functional classes is described in supplementary table S4, Supplementary Material online. Note that the (6-4) photolyase class divides in two subtrees, distinguishing (6-4) photolyase sequences () from the known CRY Pro sequences (). This split, also recognized in the phylogenetic trees of the CPF family (supplementary figs. S3 and S4, Supplementary Material online; see also figs. 3, 3 and 3), is due to sequence divergence highlighting specific traits of the CRY Pro sequences such as the four FeS-binding cysteines which are missing in the (6-4) photolyase subgroup (Ma et al. 2019). In contrast, the CPD photolyase class divides in two distinguished subtrees corresponding to the subgroups Classes I and III CPD photolyase () and Class II CPD photolyase (), which are not identified in the phylogenetic trees of the CPF family. In particular, ProfileView further divides Classes I and III CPD photolyase into two subtrees, one for Class I CPD photolyase () and the other for Class III CPD photolyase (). Finally, the photoreceptor sequences are divided in two subtrees, one grouping Animal photoreceptor CRY () and the other Plant photoreceptor CRY ().

Fig. 3.

Topological comparison between the ProfileView classification tree and the phylogenetic trees for the CPF family and the FAD-binding domain. (A) Schema illustrating the topological structure of the ProfileView tree in figure 2 and supplementary figure S2, Supplementary Material online. Colors correspond to subtrees where the characterized sequences of the same functional group are over-represented (bottom). The domain architectures known to be characteristic of each subtree are reported (see B for more details). (B) Domain architectures for proteins belonging to different subtrees of A are reported (colors as in A). C- and N-terminal regions are indicated with gray boxes. Dashed border lines indicate terminal regions present only occasionally in an architecture. (C) Scheme of the main topological structure of the CPF phylogenetic tree constructed from the 307 CPF sequences containing the FAD binding domain. Colors as in A. See the CPF phylogenetic tree in supplementary figure S3, Supplementary Material online. (D) Scheme of the main topological structure of the FAD phylogenetic tree constructed from the 307 FAD-binding domain sequences. Colors as in A. See the FAD phylogenetic tree in supplementary figure S4, Supplementary Material online. (E,F) Two zooms on subtrees of the ProfileView classification tree involving classes of CPF sequences described in A. Colors as in A. Most importantly, at the root, the ProfileView tree topology organizes large subtrees consistently with known functional subgroups (fig. 3). Namely, the ProfileView tree separates light-independent circadian transcriptional regulator CRY () from the light-dependent (6-4) photolyase () and Animal photoreceptor CRY (; fig. 3). It also clearly separates the DNA repair (6-4) photolyase from the Animal photoreceptor CRY. It reconciles classes I and III CPD photolyase into a single subtree (), while keeping them distinct (), and it clearly separates them from Plant () and Plant-like photoreceptor CRYs (; fig. 3). For the characterized sequences displaying double function (supplementary fig. S2, Supplementary Material online), their DNA repair/photolyase activity (either CPD or (6-4)) is consistently determined by ProfileView that groups these sequences in the photolyase subtrees. At the best of our knowledge, these sharp separations, in agreement with known functional characterizations, have never been obtained by sequence analysis before. Interestingly, the ProfileView tree allowed for the identification of a yet functionally uncharacterized subtree (, named NCRY; see the light beige subtree in figs. 2 and 3) of proteins showing strong sequence divergence. The same subtree was also identified by sequence similarity network analysis in (Emmerich et al. 2020) without inferring any functional classification for it, and by the phylogenetic tree based on the FAD-binding domain in CPF sequences (FAD tree, for short; fig. 3) which places it close to the Animal photoreceptor CRY () and CRY DASH (). ProfileView positions NCRY close to the Plant photoreceptor CRY and Plant-like photoreceptor CRY (green subtrees in fig. 3). In contrast, the phylogenetic tree of CPF sequences (CPF tree, for short) includes NCRY within Class I CPD photolyase (cyan subtree in fig. 3). To our knowledge only one protein from this family has been characterized and it was shown to bind FAD but to lack DNA repair/photolyase activity (Worthington et al. 2003) which is in accordance with the position of this family in our functional tree. This finding highlights the potential of ProfileView to reveal novel functional classes within a protein family. (See also supplementary fig. S5, Supplementary Material online.) Overall, the 11 subtrees in figure 3 ( in figure 2 and supplementary fig. S2, Supplementary Material online) are uniquely associated with known functional classes (see supplementary table S4, Supplementary Material online). This provides the first proof of the method’s classification power for inferring known functions and suggesting potentially new ones.

Comparison of the ProfileView Tree with the FAD and CPF Phylogenetic Trees

The comparison of ProfileView classification tree (fig. 2) with the CPF tree (supplementary fig. S3, Supplementary Material online) and the FAD tree (supplementary fig. S4, Supplementary Material online) highlights important differences in the topological organization of major functional classes. A drawing in figures 3, 3 and 3 compares the three trees for easy visualization. We notice that the CPF phylogenetic tree (fig. 3): (1) incorrectly groups sequences exhibiting disparate functions, for instance Plant photoreceptor CRY and Plant-like photoreceptor CRY are clustered within Class III CPD photolyase; (2) hides the NCRY subtree within class I CPD photolyase; (3) mixes light-dependent and light-independent proteins in a subtree where Animal photoreceptor CRY and circadian transcriptional regulator are clustered within (6-4) photolyase sequences. It is interesting to notice that the compatibility of domain architectures associated with different functional classes of CPF sequences (fig. 3) is coherent with the ProfileView tree topology (fig. 3, bottom) and much less so with the CPF phylogenetic tree. Compare, for instance, the architectures for the classes Plant-like photoreceptor CRY, Plant photoreceptor CRY and NCRY, or those for classes I and III CPD photolyase. All members of these classes have a PHR domain in which a specific CPF FAD-binding domain is found, but C- and N-terminal extensions of variable sequence or length. The architectures for Plant-like photoreceptor CRY, Plant photoreceptor CRY and NCRY possess N- or C-terminal extensions whereas classes I and III CPD photolyase only possess the PHR domain. Classes which are topologically close in the ProfileView tree preserve sequence/length characteristics of C- and N-terminal regions and agree with what is expected in contrast to the subtrees of the CPF phylogenetic tree. Similar observations can be highlighted by comparing the ProfileView tree with the FAD phylogenetic tree (fig. 3).

Representative Models, Motifs and Validation of ProfileView on Functionally Characterized Positions

ProfileView associates representative models and functional motifs to the subtrees of its classification tree. They are used to highlight subfamily delineations and molecular determinants underlying functions and interactions, respectively. A representative model for a subtree of the ProfileView tree is a profile model that ideally “separates” the sequences in a subtree from all other sequences in the tree (step IX of ProfileView pipeline in Materials and Methods). Representative models can be used to subdivide members of a family or subfamily into smaller groups to capture function-related differences at a lower level of the ProfileView tree, that is, creating groups that preferably include only one function. CPF subtrees corresponding to known functional classes in figure 2 and supplementary figure S2, Supplementary Material online are characterized by representative models that separate at least 50% of the sequences in a subtree from all other sequences in the ProfileView tree (see labels reporting the proportion of sequences supported by a model on the nodes in supplementary fig. S2, Supplementary Material online). An automatic procedure in ProfileView identifies representative models. Given a representative model for a subtree, the set of conserved positions in the model uniquely defines a motif for the subtree (step X of ProfileView pipeline in Materials and Methods). The motifs associated with the 11 CPF functional subtrees are reported in supplementary figures S6, S7, Supplementary Material online with the exception of Classes I and III CPD photolyase, known to share the same function, which we grouped together by considering the representative model of the minimal subtree including both classes. The only subtree associated with two distinct representative models, covering two different regions of the FAD-binding domain sequence, is the light-independent transcriptional regulator tree (supplementary fig. S8, Supplementary Material online, fig. 4 and ). Figure 4 shows the alignment of the two transcriptional regulator motifs with the (6-4) photolyase motif and the Animal photoreceptor CRY motif, where positions 50–126 are not covered by the two transcriptional regulator motifs. These positions comprise the FAD-binding domain region directly involved in proton or electron transfer to the FAD chromophore and provide evidence that proton/electron transfer is not involved in the function of light-independent transcriptional repressors (fig. 4) despite the importance of the FAD chromophore in their regulation (Hirano et al. 2017).

Fig. 4.

Transcriptional regulator motifs and their comparison with (6-4) photolyase and Animal photoreceptor CRY motifs. (A,B) two motifs of conserved residues present in light-independent transcriptional regulator sequences. They are extracted from two representative models (supplementary fig. S8, Supplementary Material online) of the “yellow” subtree of figure 3 and (see bottom). Numbers (under the letters) correspond to positions in a model, and they are not comparable between motifs. A colored dot, piled below a motif, indicates that the corresponding position is well conserved (see Materials and Methods) in the representative model of the subtree of that color in figure 3. Circled dots indicate positions that are less conserved (see Materials and Methods). For each motif, colored dots are ordered, from top to bottom, depending on best E-values given by hhblits to the pairwise model alignments. (C) Representative motifs associated with the transcriptional regulator (yellow), (6-4) photolyase (red) and Animal photoreceptor CRY (orange) subtrees of ProfileView tree (bottom) are aligned. Numbered positions correspond to conserved positions belonging to the associated representative motif. The absence of the number indicates less conserved positions. The alignment has been constructed using transcriptional regulator motifs as template models and all others as query models. The length of a motif depends on the length of the associated model, selected as best representing the sequences in a subtree. (D) PDB structure (4CT0) of the interacting mouse cryptochrome mCRY1 (grey) and Period2 mPER2 (black) involved in the circadian clock. Residues L6, N38, L42, and K44 are specific to light-independent transcriptional regulators (supplementary text 2, Supplementary Material online). To validate ProfileView motifs, we exploited the functional information derived by characterized mutations and looked whether their conserved amino acid positions would identify known functional natural variations, single amino acid residue replacements by site-directed mutagenesis or random mutagenesis, and structural specificity when structures were available. For this purpose, we manually curated a list of experimentally characterized positions in the CPF sequences (see supplementary file, Supplementary Material online “CPF_mutants_used_for_validation.xlsx”). Most of these positions display mutations causing loss of function or phenotypic changes. They are often involved in binding with other proteins, DNA substrates or with the cofactor FAD; active amino acids involved in catalytic or allosteric sites, such as DNA repair for photolyases or post-translational modifications in CRY, are also identified. Supplementary table S5, Supplementary Material online summarizes how many ProfileView positions are validated by current experimental evidence. Interestingly, ProfileView finds a number of highly specific positions for CPF functional classes that have not been reported in the literature before. We discussed these positions together with other observations in supplementary text 2, Supplementary Material online. They illustrate the great deal of functional information that can be extracted from representative motifs and be used to design tailored experiments for discovering new functional activities or novel biological mechanisms involving the FAD-binding domain.

How ProfileView Representative Motifs Distinguish Evolutionary Close Sequences?

ProfileView can distinguish very similar sequences associated with different functions. We illustrate this crucial feature with a concrete example, based on representation models and motifs. CPF sequences U5NDX3 and R7UL99 are grouped together by phylogenetic analysis because they are very similar (sequence identity is 61.8% and sequence similarity is 74.7%) and are classified in different functional groups by ProfileView, as a photolyase and a transcriptional regulator respectively. The conserved positions belonging to the photolyase functional motif (motif called “(6-4) photolyase” in supplementary fig. S6, Supplementary Material online) are shown in the alignment reported in supplementary figure S9, Supplementary Material online. For almost all positions in the motif the corresponding amino acid is conserved in both sequences (green) as expected by the high sequence identity of the alignment. For positions 1 (L), 33 (I), and 135 (K) in the motif, the amino acid is conserved only in the U5NDX3 sequence (the corresponding amino acids in R7UL99 are colored blue). Viceversa, positions 136 (K) and 160 (Y) in the motif are conserved in R7UL99 but not in U5NDX3. Even in the presence of this high sequence conservation, the (6-4) photolyase motif distinguishes the sequences by providing higher matching value for U5NDX3 than for R7UL99. Among the five positions, two of them, 1 and 135, are highly conserved in the photolyase family and variable in the transcriptional regulator family (see missing yellow dots below the (6-4) photolyase motif in supplementary fig. S6, Supplementary Material online) making the U5NDX3 sequence closer in classification space to the photolyase subgroup than R7UL99. Note that these observations concern the dimension of ProfileView classification space which is associated with the “(6-4) photolyase” model. Ultimately, it is the contribution of all profile models, one for each dimension of the space, that will define the position of the sequences bringing them closer either to the photolyase subgroup or the transcriptional regulator subgroup. A second example is reported in supplementary figure S10, Supplementary Material online for sequences Q6MDF3, D8UF46, and Q485Z2. The position of the sequences in the CPF phylogenetic trees (supplementary figs. S3 and S4, Supplementary Material online) could wrongly suggest an ancestral function, conserved in paraphyletic groups separated by clades where neofunctionalization would occur. In contrast, a sequence alignment analysis (see legend in supplementary fig. S10, Supplementary Material online) driven by representative motifs highlights specific positions that explain the functional classification of the three sequences.

ProfileView on the GH30 Family of the CAZy Database

The glycoside hydrosylases (EC 3.2.1.-), in short GH, are a widespread group of enzymes which hydrolyse the glycosidic bond between two or more carbohydrates or between a carbohydrate and a noncarbohydrate moiety. Their classification, based on substrate specificity and occasionally on molecular mechanisms, has proven to be particularly difficult. For this purpose, a vast knowledge on these enzymes has been meticulously curated in the CAZy database (Lombard et al. 2014). The GH30 is one of the GH families that has been organized in subfamilies in CAZy (http://www.cazy.org/GH30.html). It counts nine different subfamilies (GH30-1,…, GH30-9) corresponding to 11 different enzymatic chemical reactions. Some of these subfamilies are functionally classified in CAZy and some others are left unclassified.

Validation of ProfileView on the Functional Diversity of GH30 Sequences

We considered a set of 1675 GH30 sequences and their 695 functionally classified sequences in CAZy (table 2). ProfileView representation space and ProfileView tree for these sequences have been constructed using models coming from two similar PFAM domains: PF02055 (Glyco_hydro_30) and PF14587 (Glyco_hydr_30_2). ProfileView classifies 584 out of 695 sequences well, within eight subtrees ( in fig. 5) and leaves 106 sequences unclassified and five misplaced (supplementary tables S6 and S7, Supplementary Material online, and fig. 5). Noticeably, all unclassified sequences in CAZy subfamilies GH30-3, GH30-4, and GH30-5 are grouped by ProfileView into three subtrees missing a representative model (, respectively). All misplaced sequences in CAZy subfamily GH30-6 are grouped into the same subtree missing a representative model (). Furthermore, the same subtrees separate well the EC numbers in CAZy functional annotation (supplementary table S7, Supplementary Material online).

Fig. 5.

ProfileView tree of GH30 sequences. The tree is based on the construction of models for the two pfam domains PF02055 (Glyco_hydro_30) and PF14587 (Glyco_hydr_30_2). Black dots in the tree indicate the existence of representative models separating at least 75% of the sequences in the subtree (note that lowering the threshold to 50% provides comparable results). The first external ring contains the labels of CAZy subfamilies (GH30-1, …, GH30-9), also indicated in larger characters on the annotated tree for an easier reading. Sequences and their classification correspond to those used in figure 3 of Barrett and Lange (2019). The second ring reports the existence of an “EC number” providing the functional annotation in CAZy. The EC numbers and their associated colors are indicated on the top left (GH30-1: 3.2.1.45 and 3.2.1.21+3.2.1.37; GH30-2: 3.2.1.37; GH30-3: 3.2.1.75; GH30-4: 3.2.1.38; GH30-5: 3.2.1.164; GH30-6: –; GH30-7: 3.2.1.*; GH30-8: 3.2.1.8, 3.2.1.136, 3.2.1.8+3.2.1.136; GH30-9: 3.2.1.31). The third ring reports CUPP clustering (Barrett and Lange 2019). Different colors are used to indicate different CUPP clusters. See supplementary table S6, Supplementary Material online. The fourth and most external ring reports eCAMI clustering (Xu et al. 2020). Different colors are used to indicate different eCAMI clusters. In figure 5, the existence of multiple representative models for the internal nodes of the ProfileView classification tree highlights a possible functional sub-characterization for several CAZy subfamilies. For instance, note that the two CAZy reactions 3.2.1.45 and 3.2.1.21+3.2.1.37 for GH30-1 are identified in distinguished subtrees (green and violet labels are associated with reactions 3.2.1.45 and 3.2.1.21+3.2.1.37 in fig. 5) separated by a representative model. Furthermore, for the GH30-3 subfamily, several sequences labeled by CAZy reaction 3.2.1.75 occur in different subtrees endowed with representative models, highlighting potential functional differences within this subfamily.

ProfileView on the Enzyme Superfamilies of the Structure-Function Linkage Database

The Structure–Function Linkage Database (SFLD) is a manually curated classification resource describing structure–function relationships for functionally diverse enzyme superfamilies (Schnoes et al. 2009; Akiva et al. 2014). Despite their different functions, the members of these superfamilies “look-alike,” which facilitates annotation errors. We challenge ProfileView against these sets of sequences and show that its classification meets the functional information in SFLD. SFLD is organized in superfamilies whose members are subdivided into subgroups using sequence information, and finally into families, that is, sets of enzymes known to catalyze the same reaction using the same mechanistic strategy. Subgroups are not organized by function, and the functional specificity of the sequences is detailed at the family level. We consider two different superfamilies, Haloacid Dehydrogenase and Radical SAM, because of their wide variety of functions. Indeed, the Haloacid Dehydrogenase family is characterized by 25 subgroups organized in 22 families and 20 different reactions, and the Radical SAM family by 58 subgroups organized in 98 families and 85 reactions (see sfld.rbvi.ucsf.edu/archive/django/superfamily/index.html for a detailed description). We analyzed the HAD/-PGM/Phosphatase-like subgroup of Haloacid Dehydrogenase and three subgroups of Radical SAM: B12-binding domain containing, Methylthiotransferase and SPASM/twitch domain containing (see supplementary text 1, Supplementary Material online). ProfileView functional classification has been validated on the SFLD families associated with the four subgroups.

ProfileView on the HAD/-PGM/Phosphatase-like Subgroup

The 259 characterized functions included in this subgroup comprise 2-haloacid dehalogenase, beta-phosphoglucomutase, phosphonoacetaldehyde hydrolase, and phosphatases of various specificities (see sfld.rbvi.ucsf.edu/archive/django/subgroup/1129/index.html). We run ProfileView on a model library constructed from the two similar Pfam domains HAD and HAD_2 (see tables 1 and 2). ProfileView classifies well all-known sequences belonging to known characterized functions in distinguished subtrees endowed with representative models. Neither unclassified nor misplaced sequences were identified, as illustrated in figure 6 and supplementary table S8, Supplementary Material online.

Fig. 6.

ProfileView classification tree of the HAD/-PGM/Phosphatase-like subgroup of Haloacid Dehydrogenase in SFLD. Validation test of ProfileView performance. See supplementary table S8, Supplementary Material online.

Comparison of ProfileView with Other Computational Approaches

ProfileView is compared with the PANTHER classification system (Mi et al. 2012, 2013) and the two k-mer-based platforms CUPP (Barrett and Lange 2019) and eCAMI (Xu et al. 2020). One more comparison with CUPP and the state-of-the-art neural network approach based on Restricted Boltzman Machines (RBM) described in (Tubiana et al. 2019) are reported in supplementary text 1, Supplementary Material online (on the WW domain family). In all comparisons, ProfileView outperforms or is on par with the functional classification considered.

ProfileView and PANTHER

PANTHER (Mi et al. 2012, 2013) is a large curated biological database of gene/protein families and their functionally related subfamilies which has been designed to classify and identify the function of gene products. PANTHER provides data and tools to group sequences in functional clusters. Unlike ProfileView, it does not organize them in a distance tree, missing the possibility to identify large-scale functional properties for groups of sequences that cluster together, such as light-dependent/independent CPF sequences. We annotated the 307 sequences belonging to the CPF family with PANTHER and compared PANTHER to ProfileView on the 72 functionally characterized CPF sequences. For easier visualization, we reported PANTHER classification of the full set of CPF sequences on both the ProfileView classification tree and the CPF distance tree in supplementary figures S11 and S12, Supplementary Material online. Supplementary table S9, Supplementary Material online reports PANTHER classification of the 72 functionally characterized sequences. As ProfileView, PANTHER associates sequences in Class II CPD photolyase and CRY Pro to specific functional classes. In contrast, it associates many functional classes with the Plant Photoreceptor CRY. This implies, on the one hand, that PANTHER does not sharply identify the subset of Plant Photoreceptor CRY sequences and, on the other hand, that it suggests a finer functional delineation within this subset. In this regard, we notice that PANTHER associations in the Plant Photoreceptor CRY subtree of supplementary figure S11, Supplementary Material online correspond to the topology of the ProfileView Plant Photoreceptor CRY subtree. Moreover, several PANTHER classes (e.g., cryptochrome-1, (6-4) photolyase isoform A, and SI:CH1073-390K14.1) are associated with distinct CPF classes. Some associations are clearly faulty as it is the case for the (6-4) PL sequences annotated as circadian regulators, and for the Class I CPD photolyase sequences classified as (6-4) PL. Note that PANTHER class SI:CH1073-390K14.1 recognizes both Plant photoreceptor Cry and Class III CPD photolyase, and that sequences in the NCRY subtree are annotated as (6-4) photolyase while they are PRs according to us and to (Emmerich et al. 2020).

ProfileView and the Two k-mer-based Platforms CUPP and eCAMI

CUPP (Barrett and Lange 2019) and eCAMI (Xu et al. 2020) are two computational approaches designed to classify carbohydrate-active enzymes by using short peptide/k-mer sequences expected to be enzyme specific. In CUPP and eCAMI, proteins sharing the same peptide profile are claimed to share the same function. The set of GH30 sequences used to validate ProfileView (fig. 5) was also used for the evaluation of CUPP (Barrett and Lange 2019). CUPP splits these sequences in 33 groups and organizes them in a dendogram (Barrett and Lange 2019) whose topology is reported in supplementary figure S13, Supplementary Material, online. The dendogram is composed of nine subtrees corresponding to the nine CAZy subfamilies. A schematic comparison of CUPP dendrogram (Barrett and Lange 2019) and ProfileView tree is given in supplementary figure S13, Supplementary Material, online. Both their topologies highlight the separation of the CAZy subfamilies GH30-1, GH30-2, GH30-3, and GH30-9 from the other subfamilies. ProfileView tree separates further subfamilies GH30-4 and GH30-5 from the remaining ones. A detailed analysis of the CAZy subfamilies indicates similar sequence organization for both ProfileView and CUPP. For instance, CUPP organizes GH30-1 sequences by splitting them in five clusters (Barrett and Lange 2019) that are easily identified in ProfileView tree, where three representative models are associated with three of CUPP clusters (purple, fuchsia, and dark blue in third circle of annotation in fig. 5). In contrast, the classification of CAZy subfamilies GH30-4 and GH30-5 (fig. 5) highlights a large number of CUPP clusters while ProfileView groups GH30-5 into three main subtrees and GH30-4 into one. Two of the three ProfileView subtrees grouping GH30-5 are characterized by representative models. Interestingly, the remaining sequences are clusterized by CUPP into several clusters and no representative model is found by ProfileView, indicating the difficulty of both methods in classifying this group of sequences. On the GH30 family, eCAMI performs very similarly to CUPP (compare the two most external layers of fig. 5). eCAMI tends to group sequences in a larger number of clusters than CUPP. This cluster fragmentation corresponds to small ProfileView subtrees and affects the same sets of sequences that are of difficult classification for CUPP. To test the general applicability of ProfileView versus CUPP and eCAMI, both designed to classify carbohydrate-active enzyme sequences, we also compared the three approaches on the CPF sequences. CUPP and eCAMI were run using both FAD and PHR sequences. CUPP tree and its associated clusters are represented in supplementary figure S14, Supplementary Material online for FAD sequences (see also supplementary fig. S15, Supplementary Material online). CUPP: (1) groups all together the CPF classes “Transcriptional regulators,” (6-4) photolyase and Animal photoreceptor CRY. Hence, distinguished functions are shared in the same subtree. In particular, it does not distinguish light dependent from light independent protein sequences; (2) does not distinguish Classes I and III CPD PL; (3) places the CRYPro subtree far from the remaining subtrees while, in ProfileView, CRYPro is located closer to Class II CPD photolyase; (4) splits the CRY DASH tree into two distinguished subtrees, one of which contains no sequence with a known functional annotation. In addition, CUPP successfully classifies a larger number of sequences (corresponding to the leaves left uncolored in supplementary fig. S14, Supplementary Material online) in the CPF family compared to ProfileView, which did not find sufficient confidence among its models to include some input sequences in its tree (Steps II and III of ProfileView pipeline in Materials and Methods). Viceversa, there are sequences that have been classified by ProfileView and that do not belong to CUPP classification (see uncolored sequences within CUPP clusters in supplementary fig. S14, Supplementary Material online). We also notice that, like ProfileView, CUPP: (1) groups Class II CPD photolyase in a single subtree, and (2) distinguishes NCRY sequences. When CUPP considers the whole PHR sequence, the topology of the CUPP tree (supplementary fig. S16B, Supplementary Material online) gets closer to ProfileView topology even though CUPP mixes up Classes I and III CPD photolyase as well as light-dependent (6-4) photolyase and Animal photoreceptor CRY sequences; the NCRY subtree locates close to photolyases (supplementary fig. S15, Supplementary Material online); the higher number of CUPP clusters fragments the functional organization, as for instance for Class II CPD PL. CUPP and eCAMI clustering can be visualized on the ProfileView tree in supplementary figure S17, Supplementary Material online. As observed for the GH30 family, eCAMI tends to group sequences in a larger number of clusters than CUPP. For CPF sequences, it separates some Animal photoreceptor CRY sequences from transcriptional regulators and (6-4) photolyases, clustered together by CUPP. It does not separate transcriptional regulators and (6-4) photolyases though. Furthermore, in agreement with ProfileView, it distinguishes between Classes I and II CPD photolyase sequences based on the FAD domain. In contrast, eCAMI divides the set of characterized sequences of Class I CPD photolyase on FAD and PHR sequences into several clusters (supplementary fig. S17, Supplementary Material online). Overall, this analysis highlights CUPP’s and eCAMI limitations in handling arbitrary protein families.

Discussion

The availability of large amounts of (meta)genomic data allows for a deeper exploration of living organisms and the processes underpinning their genetic, phylogenetic, and functional diversification. Computational approaches, capable of highlighting these diversities and identifying what is functionally novel in sequence information, will make the first fundamental step in the discovery of new candidates whose functional activity will be tested experimentally. Moreover, due to the huge amount of sequences that will be acquired in coming years (1 zetta-bases/year are expected in 2025 Stephens et al. 2015), there will no longer be a way to examine this mass of data with an “expert eye” and computational approaches will play a key role in extracting new information and in functional classification. Today, we can characterize homologs on the basis of their similarity using distance measures that model the evolution of the entire sequences. However, as shown here and elsewhere (Schnoes et al. 2009; Mi et al. 2012; Akiva et al. 2014; Barrett and Lange 2019), this computational approach is insufficient to provide insights on the functional activities of proteins, and a large number of sequences are still not functionally annotated. Some of these protein families, like the seven families discussed in this study, are extremely important in medicine, biology, environmental science and biotechnology due to their key roles in cancer biology, DNA repair, drug delivery strategies, chronobiology, and photobiology, specific enzymatic reactions, the formation of protein–protein interaction networks, optogenetics. Thanks to their key role, over the decades, experiments have accumulated an enormous amount of functional information that we have used to validate the ProfileView approach. ProfileView functional organization of the seven families considered here agrees with experimental evidence. ProfileView highlights that the functional classification of proteins depends on the nonlinear contribution of many profile models and that conserved patterns in sequences are not sufficient alone to discriminate diversified functions of complex protein families. This change of perspective in functional classification underlies the complexity of the question and explains why this problem remains wide open today despite the clear interest in classifying protein families that have been extensively studied in molecular biology, such as transporters, signaling, and transcription factors. Not least, the recent focus on de novo genes points out that the notion of “function” is more complicate than one might expect (Keeling et al. 2019). By constructing multiple profile models characterizing different conserved motifs in homologous domain sequences, ProfileView captures functional signals and, by combining them, is able to successfully classify large datasets. Its main advantages compared to approaches developed before are as follows: (i) ProfileView is alignment-free and avoids errors due to the difficulty of comparing distant homologs; (ii) several profile models represent more precisely than a single consensus model the functional variability of protein families; (iii) large amounts of data are not required to learn features and perform classification because of a relatively small number of profile models, which is reduced to a few thousand, and a construction of profile models from a few dozen sequences; (iv) functional annotation of many sequences does not need to be known to explore with precision the space of sequences and classify them; (v) ProfileView is a general approach applicable to proteins of arbitrary length and function. Moreover, once a domain library is constructed, ProfileView is computationally efficient in screening very large sets of homologous sequences in a reasonable time. ProfileView demonstrated to discover potentially interesting CPF proteins whose function could be tested experimentally to identify new light-responsive proteins with novel features. Photoactive proteins are of interest for biotechnology and any computational approach to finding them is desired. More broadly, ProfileView has the potential to greatly expand our understanding of the mechanisms developed by nature to exploit light for functional purposes. ProfileView also organized the WW domain family in subtrees of sequences, corresponding to a large spectrum of differences in binding affinity to various ligands, which have been experimentally observed. It demonstrated that a large variety of sequence motifs covers this spectrum and it identified these motifs. It could classify protein superfamilies in the manually curated CAZy and SFLD databases by accurately identifying differences in their multiple enzymatic reactions. Compared to Tubiana et al. (2019), a computational approach also based on sequence analysis, ProfileView described differences among binding motifs in much greater detail, opening new avenues in the discovery of alternative binding patterns in protein–protein interaction networks. It has been compared favorably to other classification tools like PANTHER, CUPP, and eCAMI, on the CPF, the WW domains and the GH30 family classified in the Carbohydrate-Active Enzymes database CAZy. The ProfileView method makes no assumptions about the complexity of the domain architecture of a protein family. For the applications discussed here, ProfileView operates under the assumption that the analysis of a single domain is sufficient to functionally classify very different protein families, possibly with complex domain architecture. For CPF sequences, we observed that ProfileView classification tree is in agreement with the known domain architectures for CPF subfamilies and potentially on their evolution. This result highlights an “extra” structure on the CPF family and points to a much more general question about the precise relationships between the evolution of domain architectures and the evolution of functions for a protein family. In principle, we cannot exclude that domain multiplicity and/or domain order might play a role for functional classification of some protein families (Basu et al. 2009; Lees et al. 2016). To test this hypothesis, ProfileView could be used to systematically classify known protein families based on single domains. Such classification will clarify, on the one hand, the large-scale applicability of ProfileView and, on the other hand, will contribute to our understanding of the (combined) evolution of functions and domain architectures. On the methodological side, ProfileView addresses the problem of extracting biological information about protein families from the huge space of natural sequences. Sampling of distant sequences could be realized using different distance measures. This is an important direction of investigation that could lead to finer biological information extracted from sequences. From the algorithmic point of view, ProfileView is surprisingly simple compared to the Restricted Boltzmann Machines (RBM) model used in Tubiana et al. (2019) to classify WW domain homologs. RBM are generative stochastic (single layer) artificial neural networks that can learn a probability distribution over a set of inputs. Once the machine is trained on protein sequences, it can be used to either generate new protein sequences that look “alike” the ones that have been used in the training or to estimate the probability for a sequence to be generated by the model. RBM learn collective modes by extracting short sequence motifs from sets of sequences based on correlation patterns among alignment positions. These motifs might reveal structural, functional and phylogenetic features and they are used to define a representation space where to classify sequences. RBM generative nature makes training challenging by an algorithmic point of view since intensive sampling from large training sets is required. In contrast, ProfileView constructs profile models seeded from distant homologous sequences. To construct a model, ProfileView requires a very small number of natural sequences, 20 or more, that are similar to the seed sequence. Also, ProfileView makes no use of positional correlations nor generates artificial sequences. Its profile models encode conserved patterns ignoring those parts of the homologous sequences appearing variable (see discussion on the two CPF sequences U5NDX3 and R7UL99 above). The number of models is not a restriction for the construction of the classification space. The intrinsic simplicity of ProfileView makes it possible to envision new directions of investigation such as the design of a ProfileView extension that could consider motifs across multiple domains, for proteins of complex domain architectures. Indeed, the fine understanding of functional mechanisms might need more sophisticated computational approaches than ProfileView. For instance, for the CPF classification based on the FAD-binding domain, ProfileView highlights functional differences between large classes of CPF sequences, helping to model the proximity between these classes with an appropriate identification of a functional tree topology. To find functional differences within classes and to anticipate the existence of a double function (see supplementary fig. S2, Supplementary Material online), the interplay between domains in the CPF sequence might have to be considered as highlighted in Rosensweig et al. (2018). ProfileView is deeply rooted in the evolutionary information encoded in genomic sequences. For this reason, it is expected to contribute to fundamental questions of genome evolution, such as accurate reconstruction of gene duplication history. A fundamental question in this regard concerns the functional distinction of paralogous genes within a phylogenetic tree and within a single species. The power of multiple profile models in identifying different functional determinants between homologs should be able to do the same between paralogs. Last, even though ProfileView has been applied here to the classification of entire protein sequences, it can handle metagenomic sequences as well. In this regard, it is important to highlight that the majority of metagenomic and metatranscriptomic data come from organisms that cannot be cultured and may never be isolated. Therefore, new conceptual approaches to explore their biology in complex ecosystems are desperately needed. ProfileView increases knowledge on the biology of organisms whose ecological role has been recognized (e.g., marine microbes) but which are still not accessible to functional investigations, thus opening a new avenue for functional exploration.

Materials and Methods

Datasets Used to Validate the Method

Datasets of Sequences to be Classified

The seven protein families used to evaluate ProfileView performance are listed in table 1 (first column). Their sets of homologous sequences to be classified (see table 1, second column, for their number) were retrieved from publicly available databases (see below). For each family, a subset of sequences was functionally characterized (see table 1, fourth column) and we used it for evaluation. All families show multiple functions (table 2, second column). The different characteristics of the protein families are reported in table 1 and supplementary table S1, Supplementary Material online. CPF sequences were retrieved from UniProt, JGI projects (genome.jgi.doe.gov), and OIST projects (marinegenomics.oist.jp). The set was constructed according to two main criteria and contains: (1) CPF sequences known to have a specific function according to experimental evidence reported in the literature (see supplementary file, Supplementary Material online for bibliographical references); (2) CPF sequences that span the entire tree of life; they belong to species, classes, and phyla (see supplementary file, Supplementary Material online for the detailed list). In the text, a “CPF sequence” refers to the full-length CPF sequence comprising the PHR domain, including the FAD-binding domain, and possibly the C- and N-terminal extensions, whereas a “FAD sequence” refers to the FAD-binding domain sequence exclusively. The set of WW domain sequences was constructed by combining the datasets of natural sequences analyzed in (Otte et al. 2003; Ingham et al. 2005; Russ et al. 2005; Tubiana et al. 2019). Sixty sequences were experimentally characterized (Otte et al. 2003; Ingham et al. 2005; Russ et al. 2005), and the remaining ones were randomly selected in comparable proportion from the three sets classified in (Tubiana et al. 2019) as types I, II/III, and IV. The set of GH30 sequences is the same as that used in (Barrett and Lange 2019) (file GH30.faa provided with the CUPP program v1.0.14 and containing 1803 sequences) and described in the Carbohydrate-Active Enzymes database CAZy (http://www.cazy.org/GH30.html). It is organized into several subfamilies of the CAZy classification. Some of these subfamilies are functionally classified by CAZy and others are left unclassified. We used the annotation files in Barrett and Lange (2019), where 721 of the 1803 sequences have mapping/labeling to subfamilies GH30-1 through GH30-9. Note that, of the 1675 sequences retained for ProfileView analysis after filtering, 695 have a label in the GH30 ProfileView tree (table 1). The set of sequences of the HAD/-PGM/Phosphatase-like subgroup of the Haloacid Dehalogenase (HAD) superfamily and the three subgroups of the Radical-SAM superfamily (B12-binding domain containing, Methylthiotransferase and SPASM/twitch domain containing) were retrieved from the Structure-Function Linkage Database (SFLD) (Schnoes et al. 2009; Akiva et al. 2014). More precisely, each subgroup is defined by the union of the sets of annotated sequences associated with its families in SFLD. Given a subgroup, we considered all of its families, even if they were represented by very few sequences, possibly only one.

Datasets of Sequences Used for Model Library Construction

For all protein families, the domain(s) considered for model construction, their Pfam accession code, and the number of models constructed are shown in table 1. The seed sequences seeding the models were retrieved from the Pfam full set associated with the Pfam domain used for classification. The seed sequences for FAD were retrieved from Pfam v while for all other domains we used Pfam v. For the three families, GH30, HAD/-PGM/Phosphatase-like and B12-binding domain containing, Pfam contains two similar domains (see table 1, fifth column) and we used the union of the Pfam sequences from both these domains. For each seed sequence, the homologous sequences used to build the profile model were retrieved from Uniclust30, which is UniProtKB clustered to 30% identity and for which a HHblits database is provided (Mirdita et al. 2017).

Clade-Centered Models and a Multi-source Functional Annotation

Widely used methods searching for homologous domain sequences (Altschul et al. 1997; Eddy 2011; Remmert et al. 2011) rely on a single-source annotation strategy, where a single profile model (e.g., a pHMM Eddy 1998), generated by the consensus of a set of homologous sequences, is used to represent a protein domain. The single-source strategy usually works well for rather conserved homologous sequences, but when the sequences are highly divergent, the consensus signals become too weak to generate a useful probabilistic representation and the global consensus models do not properly characterize domain features. A multi-source domain annotation strategy (Bernardes et al. 2016), in which protein domains are represented by several profile models, called Clade-Centered Models (CCM), was implemented in CLADE (Bernardes et al. 2016) and MetaCLADE (Ugarte et al. 2018) for genomes and metagenomes/metatranscriptomes, respectively. There, we showed that CCMs significantly improve domain annotation for both complete genomes and metagenomic/metatranscriptomic sequences. Because of their proximity to protein sequences, CCMs are more specific and functionally predictive than canonical global consensus models. Here, we construct and use CCMs differently, with the goal of better resolving the functional organization of sequences within protein families. In order to capture conserved motifs that might be of functional relevance to the family, we build CCMs that are highly specific. The motifs, which consist of conserved positions on subsets of close homologs, will likely belong to protein interaction sites and will be determinants of functional specificity. To construct CCMs (see below, step I of the ProfileView pipeline), we consider the FULL set of sequences associated with a Pfam domain (Finn et al. 2014) and, for each sequence , we construct a profile HMM by retrieving a set of homologous sequences close to from UniProt. Such a model displays features that are characteristic of and that might differ from other sequences . The rationale is that the more divergent and are, the more we expect CCMs to highlight different features within a protein family.

The ProfileView Method

The ten main steps of the ProfileView pipeline are explained in detail below. A flowchart is provided in figure 7. A hands-on description of the ten steps for the CPF family is given in supplementary text 2, Supplementary Material online.

Fig. 7.

ProfileView flowchart. The ProfileView pipeline is organized in ten main steps: (I) construction of the model library for a domain or a few similar domains chosen by the user, (II) sequence filtering based on matching/unmatching of the models on a sequence, (III) sequence selection based on the quality of a match, (IV) filtering of models to reduce model redundancy, (V) association of two scores to each model hit, (VI) construction of ProfileView space of sequences, (VII) dimensionality reduction of ProfileView sequence space, (VIII) construction of ProfileView classification tree, (IX) identification of the best representative models for subtrees, (X) extraction of functional motifs from representative models. User-editable parameters are highlighted in green and those that remain fixed are highlighted in cyan. ProfileView takes as input a Pfam domain and a set of homologous sequences to be classified. If there are similar Pfam domains (Pfam usually names them with a numerical extension, as for instance HAD and HAD_2), the user may decide to provide several alternative domains as input and build the model library from multiple domains accordingly. The output of ProfileView is a classification tree with representative models associated with internal nodes of the tree, when they exist, and functional motifs characterizing the sequences in the corresponding subtree. I. Model Library Construction (. To construct a library of models for the domain , we considered sequences from the FULL dataset in Pfam database (Finn et al. 2014) as “seeds” for the models. For each sequence, we built a CCM (Bernardes et al. 2016) by searching in Uniclust30 for highly significant matches of homologous sequences having at least identity with the seed sequence and covering at least of it. More precisely, a multiple sequence alignment is built using the command hhblits of the HH-suite (Remmert et al. 2011) (with parameters -qid 60 -cov 70 -id 98 -e 1e-10 and database uniclust30_2017_10) and subsequently converted into a pHMM with HMMER (Eddy 1998) in order to perform a sequence-profile comparison. Typically, profile models are built from 50 to 100 sequences; a minimum of 20 sequences (heuristically determined threshold) is required. Note that the FULL set of Pfam sequences associated with a domain might be very large and contain tens of thousands of sequences. If so, we sample a few thousand sequences, by first clustering the set with MMseq2 (https://github.com/soedinglab/MMseq2). The easy-cluster command of mmseqs is used with two parameters: --min-seq-id sets the minimum sequence identity for clustering, and -c 0.8 considers matches above this fraction of aligned/covered query/target residues. MMseq2 is used to cluster close sequences together and for this, we required that sequences in a cluster had more than either 40 or 60% sequence identity (default set at 50%) depending on the protein family, so that several thousands (up to 6000) representative sequences could be identified from different clusters. The selected representative sequences were used as seeds to build profile models, as indicated above. If several similar Pfam domains are considered, as for the GH30 and HAD families, the above procedure is applied to the union of Pfam sequences associated with all domains. (Ultimately, we observed that several similar domains slightly improve classification.) II. Sequence Filtering. After constructing the set of models for the domain , we discarded from the input set of sequences all sequences against which HMMER (version 3.1b2) found no domain hit, regardless of the hit score. For all protein families, table 1 (third column) reports the number of sequences after filtering. Note that this filtering step, based on multiple profile models, can identify domains in divergent sequences where the consensus Pfam model cannot provide a hit. See supplementary figure S18, Supplementary Material online for an illustration of sequence filtering. III. Sequence Selection. Each CCM in is mapped against the set of all input sequences using HMMER. Let be the set of hits provided by hmmsearch, where is a sequence of , is a model of and is the bit-score assigned to . The bit-score is a log-odds ratio score (in base two) comparing the likelihood of the pHMM to the likelihood of a null hypothesis (i.e., an i.i.d. random sequence model). More formally,where is the probability of the pHMM generating the sequence and is the probability of being generated by the null model (Barrett et al. 1997). We partitioned the hit set in three subsets , , , where contains all hits that fully cover the associated model, contains all hits involving the extremes of a sequence covered only partially by the associated model (this situation corresponds to an “incomplete” sequence), and contains all remaining hits. See supplementary figure S18, Supplementary Material online for an illustration of the three matching types. More formally, a hit belongs to if the aligned region of to (excluding gaps) covers at least of the length of . If represents an overlap between and (allowing an overhang length of at most the 10% of the sequence length) then . Otherwise, . To eliminate potentially incomplete sequences, a sequence is retained only if: either at most the of its hits belong to , or, at least the of its hits belong to either or . These two conditions were introduced to take into account that Pfam could also contain partial sequences which could lead to the construction of very short models (which could be fully aligned in potentially incomplete sequences). We refer to the reduced set of sequences as . IV. Data-Driven Selection of Models based on Sequence Best Matching. In order to restrict the analysis to a reduced set of models that remains representative of , we kept only those models that achieve one of the best scores for at least a sequence in , for (default). The rationale of this model filtering is to get rid of “noisy” models and, at the same time, significantly reduce the size of , from some thousands down to a few hundreds. We refer to the reduced set of models as . The parameter can be set by the user. V. Association of Two ProfileView Scores to Model Hits: The Normalized Bit-score and the Normalized Weighted Bit-score. Let be the number of positions in a sequence that match to a model in a sequence-profile alignment (that is, no gap is considered in the counting). Given a hit , we define the following two scores for it: a normalized bit-score ; a normalized weighted bit-score , where is the sum of bit-scores over those positions in the sequence-profile alignment having a bit-score 3 (that is, the positions where and strongly agree). More formally, let be the log-odds ratio of a residue being emitted from a match state with emission probability and with null model background frequency , defined by HMMER during the model construction and differing between amino acids (Eddy 1998). Given the list of the aligned residues of against the model states of and such that the posterior probability, computed by HMMER, of each aligned pair is greater than , we define . Both scores are computed for all hits and used to construct the ProfileView space of sequences. VI. The Construction of a ProfileView Space of Sequences (. For each sequence , we construct a vector , where the dimension of is and is the number of models in . The vector contains the pairs of values and , for each . If a model does not have a hit on the sequence , then we assume that and let and . Hence, we say that the ProfileView space is a -dimensional space, where each dimension is associated with either the normalized bit-score or the normalized weighted bit-score for some model . Each sequence is a point in and its position reflects the proximity of the sequence to CCMs in . VII. PCA and Dimensionality Reduction for ProfileView Space of Sequences. After constructing the ProfileView space for the sequences , Principal Component Analysis (PCA) is performed to reduce its number of dimensions. More precisely, is reduced to a -dimensional space , where is the minimum number of principal components that explain the of variance for the set . By default, . This value should decrease the number of dimensions to a few dozens. If a protein family is characterized by a large diversity of representative sequences, the user may have to loosen the constraints on variance by setting to smaller values. is a parameter that can be set by the user. VIII. The ProfileView Tree Construction (. Sequences are clustered in using a hierarchical agglomerative strategy. Namely, we considered the Euclidean distance between vectors and Ward’s minimum variance method for merging clusters. The logic of this criterion is to select, at each step, the pair of clusters that minimize the total variance within the cluster after the merging. Starting from all clusters being singletons, this bottom-up algorithm completes in agglomerative steps and allows to represent clusters in a hierarchical way and to define a rooted tree. More precisely, it produces a binary tree where each internal node defines a cluster of two or more elements (according to the chosen merge criterion). Moreover, in such a tree, the distances/dissimilarities between the merged clusters are encoded as edge weights. IX. Association of Representative Models to ProfileView Subtrees (. To better explore subtrees in the ProfileView tree, potentially associated with known functions, we associated a representative model to the sets of sequences that label their leaves. Intuitively, a representative model separates a subset of sequences from the rest of the sequences of the tree (this set is designated ) in the ProfileView space (see fig. 1). Given a model in the library, let us call the maximal subset of where the model assigns higher scores to sequences in than to sequences in . This must apply to at least one of the metrics— and *—which define (see step III). For each model in the library, we compute and choose the model with a of largest cardinality as the representative model of . If two models have the same maximum cardinality, we choose the model that provides the best separation, that is, the model that maximizes the distance between the centroids of the sets and (again, computed according to the and metrics). If is the set of sequences of a subtree of the ProfileView tree (which is not the entire tree), then a representative model when the following two conditions are met: (1) includes at least half of the sequences in and (2) contains at least one sequence from each of the child subtrees of . Note that a node in the ProfileView tree might be left without a representative model. When ProfileView returns a representative model for a node of the tree, it also returns a list of suboptimal models covering either the same amount of sequences or 90% of . X. Motif Extraction from Representative Models. A motif extracted from a representative model is the set of all amino acids characterizing well-conserved columns (i.e., match states) in the sequence alignment associated with the model, according to the hhblits’ definition. That is, given a column of the multiple sequence alignment related to the model, an amino acid is well conserved if it occurs with a probability before adding pseudo-counts and including gaps in the fraction count. See figure 1.

On the Size and Diversity of the Model Library

ProfileView model library construction (step I) is designed in such a way that sequences coming (possibly extracted) from the Pfam FULL set would best represent the protein family. To show the relevance of ProfileView’s automatic construction, we have tested ProfileView on smaller sets of Pfam “seed” sequences and have shown that not having a large and diverse set of sequences hinders classification performance. More precisely, we built two model libraries for the CPF family by considering 10 and 100 “seed” sequences, respectively. These sequences have been randomly chosen in the FULL set of CPF sequences. The resulting ProfileView trees are reported in supplementary figures S19 and S20, Supplementary Material online. A straightforward comparison with figure 3 and supplementary figure S2, Supplementary Material online shows the limitations of a functional classification based on a restricted number of models.

Parameters Used in ProfileView Analysis of the Seven Protein Families

ProfileView was run with the same default parameters and on all protein families in table 1, with the exception of the WW domain, which is characterized by a wide variability of sequences. For the WW domain family, we set and (see steps IV and VII). The parameter allowed to obtain a space of dimensions starting from a total of 2,488, versus dimensions obtained with a threshold of . The parameter allowed to increase the number of best matching models to 1,244 versus 845 obtained with . Intuitively, to classify datasets of sequences with high variability, such as the WW domain family, the number of models representing the dataset should be large (>1,000).

Motif Graphical Representation

The model logos were built using the Weblogo python package (Crooks et al. 2004) (version 3.7) which allowed us to easily export sequence logos (Schneider and Stephens 1990). Amino acids are colored according to chemical properties: neutral polar amino acids (G, S, T, Y, C) show in green, acidic polar (Q, N) violet, positively charged (K, R, H) blue, negatively charged (D, E) red, and hydrophobic (A, V, L, I, P, W, F, M) black. The graphical representation of a motif associated with some representative model has been augmented by additional information that helps to easily compare the motif across representative models. Namely, we have highlighted, by a colored “dot,” those positions that are well conserved in other representative models. Given a reference model and a query model , a dot is put under a well-conserved column of , if there exists a column in the query model : (1) aligning in hhblits with a score greater than (i.e., fairly similar amino acid profiles) and posterior probability greater than ; (2) containing a most conserved amino acid which is the same as in and is also well conserved. A circled dot indicates an aligned column in satisfying 1 but not 2. This means that the most conserved amino acid in shows <60% frequency in . Note that, in this case, and might display different most conserved amino acids. It is important to note that given two models and one position, the score assigned to that position in the hhblits pairwise alignment of the models depends on the reliability of the query-template alignment (https://github.com/soedinglab/hh-suite/wiki). Depending on which of the models is considered as template, the scores assigned to the same position may vary (confidence values are obtained from posterior probabilities calculated in the Forward-Backward algorithm of hhblits). In particular, hhblits warns that the confidence score for an aligned position depends on the alignment confidence of the neighboring regions. As a result, the alignment score of some conserved positions may decrease due to the presence of a highly variable region in their vicinity, possibly containing gaps. This explains why, for the aligned positions of two motifs, we may miss to indicate related positions or we may display different color dots. An example of missing related positions is illustrated by position 102 in the NCRY motif and position 103 in the Plant photoreceptor CRY motif of CPF. The two motifs clearly diverge within the region adjacent to positions 102/103, justifying a difficult model alignment and a low confidence score at 102/103. A second example, illustrating the asymmetry of colored dots, is position 102 in the NCRY motif aligned with position 95 in CRY Pro. While the CRY Pro motif records the colored dot for a match with NCRY, this is not true for the NCRY motif. In fact, while the two positions align together with a confidence score of 0.8 for the CRY Pro model taken as a template, they also align when the NCRY model is taken as the template but with a confidence score that drops at 0.6.

Phylogenetic Tree Construction for CPF, FAD, and WW Sequences

The multiple sequence alignments of CPF sequences and FAD sequences were computed using MUSCLE version v3.8.31 (Edgar 2004), and were then trimmed using trimAl version 1.4.rev22 (Capella-Gutiérrez et al. 2009) with a gap cutoff of (i.e., columns containing more than of gaps have been removed). Then, for each sequence alignment, we selected the best evolutionary model using ProtTest (version 3.4.2) (Darriba et al. 2011). More precisely, the evolutionary model best fitting the data was determined by comparing the likelihood of all models according to the Akaike Information Criterion (AIC). The model optimization of ProtTest was run using a maximum-likelihood-tree strategy and the tree generated for the best-fit model (VT+G+F) was considered as input for the construction of the final phylogenetic tree (with parameter ). In particular, the construction of a maximum-likelihood phylogenetic tree has been carried out with PhyML 3.0 (Guindon et al. 2010) that optimized the output tree with Subtree-Prune-Regraft (SPR) moves and considering the SH-like approximate likelihood-ratio test. Finally, branches with a support value smaller than were collapsed. The phylogenetic tree for the set of homologous CPF sequences used to validate ProfileView is reported in supplementary figure S3, Supplementary Material online and contains leaves corresponding to the CPF sequences containing the FAD-binding domain. The phylogenetic tree for the set of 307 FAD sequences is reported in supplementary figure S4, Supplementary Material online. The procedure used to generate the phylogenetic tree for WW domain sequences is the same as that used for CPF and FAD sequences. The best-fit model (computed with ProtTest) is RtREV+I+G, with parameters and . Phylogenetic and ProfileView trees have been generated with iTOL (Letunic and Bork 2019).

Output Files of ProfileView

ProfileView produces several output datasets: the model library, the ProfileView tree, the list of representative models associated with internal nodes of the tree. Additionally, ProfileView offers the user the possibility to choose a list of representative models to compare. The first model on this list is considered a reference model. A first output provides a logo showing all conserved positions together with a list of colored dots (possibly circled) obtained after a pairwise comparison of a model in the list with the reference model (see Materials and Methods above; see for example fig. 1). A second output provides a logo that shows an intermediate representation of the positions in the reference model, that is, it shows all conserved positions in the associated motif and all positions that are not conserved in the reference model but which are conserved in some model in the list (see e.g., fig. 4).

Comparison with Other Tools

CUPP (Barrett and Lange 2019), eCAMI (Xu et al. 2020), and PANTHER (Mi et al. 2012, 2013) were run for comparison with ProfileView. CUPP v1.0.14 was run with CUPPclustering.py and parameter-cluster to execute the clustering (http://www.bioengineering.dtu.dk/CUPP). For the comparison on the GH30 family, we considered the “CUPP groups” predicted with the latest version of the CUPP-Predict tool (v3.2.1) available at https://cupp.info. Unfortunately, from the README, the updated “clustering” version of CUPP is still under development and CUPP results on the CPF family still refer to the application of CUPP version 1.0.14. eCAMI was downloaded from https://github.com/zhanglabNKU/eCAMI (commit 5b00a038). The PANTHER HMM library version 15.0 and the pantherScore2.2 tool (scoring protein sequences against the library) were retrieved at http://www.pantherdb.org. We used pantherScore2.2.pl with parameters -l [PANTHER15.0 library] -D B -n, where -D B allows to visualize the best hit in the output and -n allows to visualize family and subfamily names in the output.

Evaluation

To assess the robustness of ProfileView classification, we compared it to functional grouping in the literature. For each protein family, we considered a set of functionally characterized sequences that have been human curated and, in most cases, tested experimentally. Each family is characterized by several functional subclasses (table 2). These datasets constitute independent sets for testing ProfileView. These test sets are available at http://www.lcqb.upmc.fr/profileview/. We want to determine whether characterized sequences of the same functional group localize together in the ProfileView tree. For this, we identify the largest subtrees of at least 2 sequences, endowed with a representative model and comprising at least 75% of characterized sequences which belong to the same functional class. We use this overrepresentation of a functional subclass in a ProfileView subtree to label the subtree with that function and the characterized sequences as “well-classified” (denoted “W”). Note that some of the ProfileView subtrees will be labeled by a function and others may not be. Within subtrees labeled by a function, there may be characterized sequences belonging to other functional subclasses that we denote as misplaced (denoted “M”). Within subtrees that are not labeled by a function, there might be characterized sequences belonging to functional subclasses that we denote as unclassified (denoted “U”). To resume, for each known functional subgroup of characterized sequences in the dataset, we count: the number of unclassified sequences (denoted “U”), that is the number of characterized sequences that do not belong to some subtree labeled by a function, the number of misplaced sequences (denoted “M”), that is the number of characterized sequences that belong to a subtree labeled by a different functional class, the number of well-classified sequences (denoted “W”), that is the number of characterized sequences that are over-represented in a subtree and allowed for the identification of its functional label. These three numbers allow to evaluate whether ProfileView classifies well or not sequences of a given protein family within different functional subclasses. Also, since experimental functional accuracy might differ across functional groups, we consider subtrees comprising at least 75% of their characterized sequences within a functional subgroup which are not endowed with a representative model. For them, we count W/M/U sequences. These subtrees provide evidence of functional organization while highlighting the difficulty of identifying motifs, described by representative models that are specific to a functional class. Our evaluation criteria were designed to highlight several possible scenarios. If a protein family is known to have functional subclasses and the ProfileView tree is comprised of subtrees labeled with the distinguished functions, the ProfileView classification can be considered to be fully accurate for the protein family. On the other hand, if some ProfileView subtree of sequences remains unlabeled, this could suggest missing functional knowledge for the protein family. Finally, since the experimental functional accuracy between different functional groups might differ, ProfileView might suggest an alternative sequence classification.

Computing Time and Memory Usage

The most costly computational part of the ProfileView pipeline is the construction of the profile models for a protein domain sequence. The program was tested using 16 threads on a single machine equipped with an Intel Xeon E5-2670 CPU running at 2.60 GHz, with 128 GB of RAM, and a Linux operating system (CentOS release 6.5). Supplementary table S10, Supplementary Material online summarizes, for each protein family, the time complexity, and the RAM used for the model library construction and the classification step. The time used for the model library construction depends on the number of models and the length of the domain. Once a library is constructed, it can be used for the analysis of different protein families. Note that the library constructed for the Radical SAM domain was used for both the Methylthiotransferase family and the SPASM/twitch domain containing family analyzes in supplementary text 1, Supplementary Material online. Classification time depends on two main sub-steps: (1) HMMER annotation, which mainly depends on the number of sequence-model comparisons, rather than the sequence length, and (2) the computation of the scores used to construct the ProfileView sequence space. Step 2 is the most time-consuming as it involves the parsing of all sequence-profile alignments. ProfileView current implementation of this part of the code is single-threaded but this sub-step could be performed in parallel, possibly in future versions of ProfileView. Note that, for the WW domain family presented in supplementary text 1, Supplementary Material online, (Tubiana et al. 2019) indicates about 1–2 days of computing time on an Intel Xeon Phi processor with cores to run RBM analysis. ProfileView classifies this family in less than 9 h.

Implementation and Software Availability

ProfileView was developed and tested under a UNIX operating system, using Bash, Python, and R scripts. It exploits GNU parallel (Tange 2018), if available on the system, in order to run some jobs in parallel. It is implemented in three main parts carrying out the following pipeline modules: the construction of a single-domain model library, the generation of the ProfileView tree along with its representative models, the comparison of selected representative models and the identification of conserved positions/motifs. ProfileView is available at http://www.lcqb.upmc.fr/profileview/ under the version 2.1 of the CeCILL Free Software License. Click here for additional data file.

78 in total

1. Genome cartography through domain annotation.

Authors: C P Ponting; N J Dickens
Journal: Genome Biol Date: 2001 Impact factor: 13.583

2. FAD Regulates CRYPTOCHROME Protein Stability and Circadian Clock in Mice.

Authors: Arisa Hirano; Daniel Braas; Ying-Hui Fu; Louis J Ptáček
Journal: Cell Rep Date: 2017-04-11 Impact factor: 9.423

3. Graph sharpening plus graph integration: a synergy that improves protein functional classification.

Authors: Hyunjung Shin; Andreas Martin Lisewski; Olivier Lichtarge
Journal: Bioinformatics Date: 2007-10-31 Impact factor: 6.937

4. Analysis of protein function and its prediction from amino acid sequence.

Authors: Wyatt T Clark; Predrag Radivojac
Journal: Proteins Date: 2011-04-19

Review 5. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

Review 6. Machine learning techniques for protein function prediction.

Authors: Rosalin Bonetta; Gianluca Valentino
Journal: Proteins Date: 2019-11-14

7. Structures of Drosophila cryptochrome and mouse cryptochrome1 provide insight into circadian function.

Authors: Anna Czarna; Alex Berndt; Hari Raj Singh; Astrid Grudziecki; Andreas G Ladurner; Gyula Timinszky; Achim Kramer; Eva Wolf
Journal: Cell Date: 2013-06-06 Impact factor: 41.582

8. Big Data: Astronomical or Genomical?

Authors: Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson
Journal: PLoS Biol Date: 2015-07-07 Impact factor: 8.029

9. Marine diatoms change their gene expression profile when exposed to microscale turbulence under nutrient replete conditions.

Authors: Alberto Amato; Gianluca Dell'Aquila; Francesco Musacchia; Rossella Annunziata; Ari Ugarte; Nicolas Maillet; Alessandra Carbone; Maurizio Ribera d'Alcalà; Remo Sanges; Daniele Iudicone; Maria I Ferrante
Journal: Sci Rep Date: 2017-06-19 Impact factor: 4.379

10. Peptide-based functional annotation of carbohydrate-active enzymes by conserved unique peptide patterns (CUPP).

Authors: Kristian Barrett; Lene Lange
Journal: Biotechnol Biofuels Date: 2019-04-30 Impact factor: 6.040