Literature DB >> 36052345

Protocol to identify host-viral protein interactions between coagulation-related proteins and their genetic variants with SARS-CoV-2 proteins.

David D Holcomb¹, Katarzyna I Jankowska², Nancy Hernandez², Kyle Laurie², Jacob Kames², Nobuko Hamasaki-Katagiri², Anton A Komar³, Michael DiCuccio⁴, Chava Kimchi-Sarfaty⁵.

Abstract

Here, we describe a bioinformatics pipeline that evaluates the interactions between coagulation-related proteins and genetic variants with SARS-CoV-2 proteins. This pipeline searches for host proteins that may bind to viral protein and identifies and scores the protein genetic variants to predict the disease pathogenesis in specific subpopulations. Additionally, it is able to find structurally similar motifs and identify potential binding sites within the host-viral protein complexes to unveil viral impact on regulated biological processes and/or host-protein impact on viral invasion or reproduction. For complete details on the use and execution of this protocol, please refer to Holcomb et al. (2021).

Entities: Chemical

Keywords: Bioinformatics; Computer sciences; Genetics; Health sciences; Molecular biology

Mesh：

Substances：
Viral Proteins

Year: 2022 PMID： 36052345 PMCID： PMC9345850 DOI： 10.1016/j.xpro.2022.101648

Source DB: PubMed Journal: STAR Protoc ISSN： 2666-1667

Before you begin

The clinical data from COVID-19 patients clearly display the strong association between coagulation-related proteins and COVID-19 infections (Levi et al., 2020). Elevated circulating D-dimer levels, a prolonged prothrombin time (PT) and low ADAMTS13 plasma levels have been observed in patients with COVID-19 (Al-Samkari et al., 2020), (Levi et al., 2020) (Mancini et al., 2020) (Bazzan et al., 2020). Nevertheless, the detailed mechanism of how the virus dysregulate the coagulation pathway is not fully understood. While there are numerous factors including age, weight and various medical underlying conditions that increase the risk for severe COVID-19 and the extent of thrombotic events or coagulopathy after infection (Zhou et al., 2020) (National Center for Immunization and Respiratory Diseases NCIRD; Division of Viral Diseases, 2022) the interactions between viral proteins and coagulation related proteins may also be responsible for the prothrombotic phenotype that is seen in COVID-19 patients. Some proteins associated with the coagulation cascade that can be directly or indirectly affected by viral proteins has been already identified (Ortega-Bernal et al., 2022), (Pfefferle et al., 2011). The enrichment analysis combined with human proteomics data revealed the direct interactions of virus proteins with coagulation cascade proteins including fibrinogen (FGA and FGG) or protein S (PROS1) (Ortega-Bernal et al., 2022). The interaction of coagulation factor X (FX) that cleaved the SARS-CoV-2 spike protein has been experimentally validated (Du et al., 2007), (Kastenhuber et al., 2022). Through computational modeling, we previously examined the binding of Poly(A) Binding Protein Cytoplasmic 4 (PABPC4), Serine/Cysteine Proteinase Inhibitor Clade G Member 1 (SERPING1) and Vitamin K epOxide Reductase Complex subunit 1 (VKORC1) with SARS-CoV-2 proteins and have shown that some of these protein genetic variants may impact COVID-19 severity (Holcomb et al., 2021). In addition, it has been suggested that VKORC1 variant 1629A confers protection against thrombotic complications of COVID-19 and that differences in its allele frequency are partially responsible for the differences in COVID-19 severity in Asians population (Janssen and Walk, 2020). These studies strongly suggest that in addition to the protein-protein interaction between host-viral system also the proteins' genetic variation should be considered to fully understand the source of coagulation pathway dysregulation during COVID-19 infection. The protocol presented here describes a step-by-step bioinformatics approach to identify host-viral protein interactions. This pipeline identifies the host-proteins that may interact with the virus/virally encoded proteins. The selected host-protein variants are then identified in various subpopulations by searching numerous databases including COVID-19 host genetics initiative (HGI) GWAS data, dbSNP or ClinVar. All identified missense and synonymous variants are rated based on conservation score, change in mRNA minimum free energy, and codon usage using numerous in-silico tools. The selected host-viral proteins are then aligned to find structurally similar sites and to identify potential binding sites. The identification of host-viral protein interactions may help to understand the origin of coagulopathy during or after COVID-19 infection. The evaluation of host-viral protein interactions for different host protein variants can help to understand the distinct disease pathogenesis in minority groups and subpopulations. Moreover, by applying sequence and structural alignments, the pipeline finds and visualizes the similarities in the specific human and viral proteins structures and predicts the specific regions of protein-protein interaction in the viral-host system. The identified interactions may help to understand the viral impact on regulated biological process and/or protein impact on viral invasion or reproduction which can further help to better understand the outcomes of host-viral interactions critical in development of potential novel drugs and treatment options.

Institutional permissions

All resources are publicly available. Please cite any sources used for any publications resulting from this protocol.

Hardware

We strongly suggest at least 16 GB of memory and 2 cores. At least 150 GB of hard drive space will be required. In addition, a network connection is required to query webservers. Linux is strongly recommended, particularly an Ubuntu or Debian distribution.

Installing code

Timing: 10 days Python dependencies are the backbone of the code. Most of the code is written in Python. Python version 3.7.12 (Van Rossum and Drake, 2009). Jupyter Notebook 3.4.3 (Kluyver et al., 2016) is useful for the bulk of the calculations. They have been automated to run in Jupyter Notebook. The code can be run in any environment but has been developed and documented for Jupyter Notebook. The easiest installation method uses Anaconda (Anaconda Documentation, 2022). conda install jupterlab Entrez from BioPython 1.69 (Cock et al., 2009) is necessary to query NCBI servers for information on sequences and variants. conda install -c conda-forge biopython Update your Entrez email and API key. If you do not have an API key for Entrez services, more information is provided here: https://ncbiinsights.ncbi.nlm.nih.gov/2017/11/02/new-api-keys-for-the-e-utilities/. The following Python libraries are needed: Requests, Pandas, Prody, Numpy, BeautifulSoup, urllib, xlrd, subprocess, multiprocessing. Additional libraries and versions used are included in requirements.txt [troubleshooting 1]. conda install requests, pandas, prody, numpy, beautifulsoup, urllib, xlrd, subprocess, multiprocessing Associated code has many dependencies. Any further software requirements are listed here, and there are no further hardware requirements [troubleshooting 2]. Clustal Omega 1.2.4 (Sievers et al., 2011) is necessary for aligning sequences to generate multiple sequence alignments and conservation scores. Download source code from http://www.clustal.org/omega/#Download, and extract. Move into the extracted directory, then configure, make, and make install. ./configure make sudo make install The executable should be installed to a directory in the path, preferably in /usr/bin. If not placed in /usr/bin, you will need to change clustalo_cmd in tools.align_sequences. KineFold (Xayaphoummine et al., 2005) randomly estimates mRNA folding energy for a given nucleotide sequence. This is useful for comparing wild-type and variant sequences. Download the KineFold executable from http://kinefold.curie.fr/download.html. Extract the downloaded file and move into a directory in the $PATH. NUPACK 3.0.6 (Zadeh et al., 2010) also computes mRNA stability. Register at the NUPACK website, and wait to receive a password. Download source files from NUPACK website http://www.nupack.org/downloads/source. Add the bin directory to your $PATH variable. RNAfold (Lorenz et al., 2011) also computes mRNA stability. Download the ViennaRNA (version 2.4.10) source code from https://www.tbi.univie.ac.at/RNA/ and extract the file. Move into the extracted directory, then ./configure, make, and make install. ./configure make sudo make install HH-Suite 3.3.0 (Steinegger et al., 2019). This is required for NetSurfP. The easiest way to install is from conda. conda install -c bioconda hhsuite Alternately, may be installed from GitHub. git clonehttps://github.com/soedinglab/hh-suite.git Uniclust (Mirdita et al., 2017). This is required for HH-Suite and NetSurfP. Download Uniclust30 version 2018_08 for HHsuite from http://gwdu111.gwdg.de/∼compbiol/uniclust/2018_08/. More up-to-date versions may be used, but have not been tested. Update the uniclust and hhsuite_model variables in compute_features.py to the appropriate locations. NetSurfP 2.0 (Klausen et al., 2019) estimates the accessible surface area of the protein at different amino acids in the protein sequence. This is useful for determining which amino acids are accessible to solvent or binding. Install from pip. pip install netsurfp2 Alternately, request download from Downloads, Version 2.0 - Any at https://services.healthtech.dtu.dk/service.php?NetSurfP-2.0. After downloading and extracting file, follow instructions in README.md. Finally, install HH-Suite and Uniclust as described above. NetPhos 3.1b (Blom et al., 1999) is used to estimate the phosphorylation potential of a protein at each position in the sequence. Request download from Downloads, using the appropriate OS or system at https://services.healthtech.dtu.dk/service.php?NetPhos-3.1. Extract the package and move into the resulting directory. Install as directed by the readme file. Add the executable directory to your $PATH, or copy the executable to a bin already in your $PATH. NetOGlyc 3.1 (Hansen et al., 1998) is used to estimate the O-linked glycosylation potential of a protein at each position in the sequence. Request download from Downloads, using the appropriate OS or system at https://services.healthtech.dtu.dk/service.php?NetOGlyc-4.0. Extract the package and move into the resulting directory. Install as directed by the readme file. Add the executable directory to your $PATH, or copy the executable to a bin already in your $PATH. NetNGlyc 1.0 (Gupta and Brunak, 2002) is used to estimate the N-linked glycosylation potential of a protein at each position in the sequence. Request download from Downloads, using the appropriate OS or system at https://services.healthtech.dtu.dk/service.php?NetOGlyc-4.0. Extract the package and move into the resulting directory. Install as directed by the readme file. Add the executable directory to your $PATH, or copy the executable to a bin already in your $PATH. Coarse-grained co-translational folding model (Jacobs and Shakhnovich, 2017), but primarily the rare-codon enrichment calculation. An adapted version of this code is included with the rest of the code for this protocol. This program is useful for locating regions in the nucleotide sequence that are enriched in rare codons, and where such enrichment is conserved across multiple species. Change the path to rc_enrichment in compute_features.py to the appropriate directory where the coarse-grained co-translational folding model is installed. sys.path.insert(0, “/media/home/workspace/Submission/rc_enrichment”) Pymol 1.8.4.0 is useful for visualizing protein crystal structures. This may elucidate similar structural motifs or important binding sites between proteins. The easiest way to install is from the default package manager. sudo apt-get install pymol Copy the text in the gray box from here https://pymolwiki.org/index.php/InterfaceResidues and save as InterfaceResidues.py. Rosetta 3.10 is useful for modeling and designing protein structures. It’s also helpful for docking proteins to identify potential binding sites. Apply for an academic license or purchase a license here: https://www.rosettacommons.org/software/license-and-download. When you receive an academic license, as well as a username and password, enter them here: https://www.rosettacommons.org/software/academic/. Download rosetta_src_3.10_bundle.tgz, rosetta_bin_linux_3.10_bundle.tgz and rosetta_3.10_user_guide.tgz and extract them to an appropriate directory. Update all variables at the top of analyze_all_variants.ipynb, including the directory containing the files (also used as output). Unless specified otherwise, all programs should be installed in a directory added to the $PATH variable (or placed in a main bin directory).

Other resources

Timing: 8 h EVmutation (Hopf et al., 2017). An adapted version of the python wrappers is included in EVmutation folder. Requites PLMC. Can be downloaded from GitHub with git clonehttps://github.com/debbiemarkslab/plmc.git Then, inside the either install with make all if using a single core Linux system. If using openMP, install with make all-openmp Finally, either add the plmc directory to your $PATH or copy the plmc executable to a directory in your $PATH. RemuRNA (Salari et al., 2013). Can be installed from conda with conda install -c bioconda remurna mFold 3.6 (Zuker, 2003). Download mFold 3.6 from http://www.unafold.org/mfold/software/download-mfold.php. After extracting and moving into the resulting directory. ./configure make sudo make install

Download necessary datasets

Timing: Days, dependent on download speeds Download the newest BIOGRID (Oughtred et al., 2021) release (or other protein-protein interaction database such as STRING (Szklarczyk et al., 2019)). From BIOGRID website, select downloads at top. Select Current-Release. Select BIOGRID-ALL-x.x.xxx.tab3.zip, where x.x.xxx is the version. For replication purposes, we used BIOGRID-ALL-4.4.209.tab3.zip. Download file and unzip to location specified in biogrid_file in analyze_all_variants.ipynb. Download all host proteins involved in desired biological process (for example coagulation) from Gene Ontology. Search your biological function on AmiGO (Carbon et al., 2009) at http://amigo.geneontology.org/amigo/landing. A view of the website, including the query and one possible result is shown in Figure 1.

Figure 1

A view of the AmiGO interface

(A) A typical search bar is shown in (A) where a user can search for a gene ontology term or a gene.

(B) The results for the term “coagulation” are shown in (B). A user may query any desired protein function or biological process.

A view of the AmiGO interface (A) A typical search bar is shown in (A) where a user can search for a gene ontology term or a gene. (B) The results for the term “coagulation” are shown in (B). A user may query any desired protein function or biological process. Select Ontology. Choose the closest matching term (blood coagulation). Filter for only host proteins (Homo sapiens). Select Download. Add Gene/product (bioentity_label) to top of the Selected fields. Download and save text to location in go_file variable in analyze_all_variants.ipynb. Download latest GWAS output. At COVID-19 HGI (COVID-19 Host Genetics Initiative, 2021) website, select Downloads, then latest release with data. Select a dataset, and download the GRCh37 liftover and associated tbi file. For replication, we used the Release 6 version of the A2_ALL_leave_23andme study. Decompress the file to the location in the gwas_file variable. As implied by previous steps, set the variables biogrid_file, go_file, and gwas_file appropriately in the analyze_all_variants.ipynb. The initialization of analyze_all_variants.ipynb is shown in Figure 2.

Figure 2

The first section of the analyze_all_variants.ipynb

This region contains the primary variables you should have to change to run this code. This code is available at the FDA GitHub repository listed in the “Data and code availability” section.

The first section of the analyze_all_variants.ipynb This region contains the primary variables you should have to change to run this code. This code is available at the FDA GitHub repository listed in the “Data and code availability” section. Set the host, host_organism, and virus variables appropriately: host and virus should be set to the taxonomy IDs of the host and virus organisms. CRITICAL: analyze_all_variants.ipynb contains variables representing the locations of these datasets, as well as the prefix to be used for output files, and the host and viral names and taxonomy IDs. Please set these variables in the notebook because you attempt to run it, as the program will need the appropriate locations of the files to read them. You can find them near the top of the notebook.

Key resources table

Step-by-step method details

Most steps are managed by analyze_all_variants.ipynb. If the dependencies are properly installed and the variables at the top of analyze_all_variants.ipynb are set, this notebook will run most of the following computations. In addition, locations of the code segments corresponding to each step are stated. A view of the menu of the notebook is shown in Figure 3.

Figure 3

View of the menus of an analyze_all_variants.ipynb notebook

Assuming everything is set correctly, only the circled yellow button needs to be clicked to run the entire pipeline.

View of the menus of an analyze_all_variants.ipynb notebook Assuming everything is set correctly, only the circled yellow button needs to be clicked to run the entire pipeline. Details are included here to further assist at each step. Additionally, timing is included here, but is not important and will be highly variable depending on the datasets involved.

Identify host proteins that bind to viral proteins and are involved in the biological process

Timing: 1 day This step identifies proteins that both interact with viral proteins and are involved in a particular biological process (blood coagulation, in this instance). These proteins may indicate an impact of viral invasion on the relevant biological process, and variants in the genes encoding these proteins may impact viral binding. In the second cell of analyze_all_variants.ipynb, Interactions are filtered from the BIOGRID release to include only those that involve host and viral proteins. In the third cell, genes involved in the biological process (coagulation) are extracted from the AmiGO output, and only those included in the previous step are included. In the fourth cell, each gene from the previous step is queried in NCBI’s Gene database, and the most appropriate RefSeq accession identifier is selected.

Find sequences, synonymous variants, and missense variants for each gene

Timing: 2 days This step finds the associated nucleotide sequences for each gene automatically identified in the previous section, including pre-spliced sequences, transcript fed to ribosome, and open reading frame read by ribosome. Additionally, this will find all synonymous and missense variants for each gene in dbSNP. This step will run BLAST and Clustal Omega to generate multiple sequence alignments of homologous nucleotide and amino acid sequences to compute conservation scores for all variants. Then, this step will query the Variant Effect Predictor to score the effect of any missense variants. Additionally, this step will use codon and codon pair usage from CoCoPUTs (Codon/Codon Pair Usage Tables (CoCoPUTs) (fda.gov)) to score the change in codon and codon pair usage using relative synonymous codon usage (RSCU), relative synonymous codon pair usage (RSCPU), and %MinMax. Additionally, variants are scored with KineFold and NUPACK to access impact on mRNA stability. This step will also score missense variants for the potential effect on post-translational modifications with NetPhos, NetOGlyc, and NetNGlyc. Finally, NetSurfP is used to predict the accessible surface area of the protein to determine if variants occur on the surface or the interior of the protein. Variants in conserved regions or variants that effect biochemical features of the DNA or the protein may impact viral binding or invasion. Synonymous variants may impact protein expression and protein conformation in some circumstances (Sauna and Kimchi-Sarfaty, 2011; Jankowska et al., 2022). This occurs in the fifth cell of analyze_all_variants.ipynb. For each gene and associated RefSeq accession ID: run bio_tools.get_all_seqs with the associated gene name and RefSeq accession ID. seqs = bio_tools.get_all_seqs(genename, nm, write=True)["seqs"] Then, find and score all synonymous variants. tmpd = get_dbSNP.get_syn(genename, seqs["ORF"], nm) Finally, find and score all missense variants. tmpd = get_dbSNP.get_missense(genename, seqs["ORF"], nm)

Query GWAS data and score associated variants

Timing: 2 days This step will find all variants in the host genes identified in steps 1 and 2 with a p-value less than 0.05, and then score the variants. As part of this process, we gather all identifiers for each variant, including the dbSNP accession ID, as well as transcript coordinate and genomic coordinate HGVS identifiers. This is useful when we query them in databases such as dbSNP, ClinVar, and Google Scholar. Because these variants almost exclusively appear in introns and UTRs, we score them with splicing predictors such as ESRseq scores, HEXplorer scores, ESEfinder, FAS ESS, and ExonScan. These variants may impact viral binding to the host protein, or expression of the protein and impact of the viral binding. While the code is specific to COVID-19 HGI formatting and column names, other GWAS studies may be used by changing the relevant column names. This occurs in the sixth cell of analyze_all_variants.ipynb. Give the GWAS file location and the gene names to the gwas_pipeline function. All other columns in the GWAS data can be changed, but are set to defaults for COVID-19 HGI datasets. [troubleshooting 3] [troubleshooting 4]. gwas_data = parse_gwas.gwas_pipeline(gwas_file, genes, prefix=gwas_prefix, index_col="SNP", p_col="all_inv_var_meta_p", chromosome="#CHR", position="POS", ref="REF", alt="ALT", plim=0.05, translation=gwas_prefix + ".trans") Pause point: By default, the program will pause to output genomic HGVS identifiers for all relevant GWAS variants to the file specified as gwas_prefix + “_nc_acc.txt". The user should then manually feed this list to batch Mutalyzer in order to find transcriptomic HGVS identifiers. This result should be saved in the same directory as the list with filename set to gwas_prefix + “.trans".

Query Dali for structurally similar proteins

Timing: 3 days This step will identify any proteins that are structurally homologous to the viral protein and that bind to the host protein. This may elucidate specific motifs common to viral and host proteins, giving insight to the way viral and host proteins bind. Moreover, if the resulting homologs are involved in another biological process, this may indicate other effects of viral proteins on the host. For each viral protein that interacts with one of the host proteins identified in steps 1 and 2: Query the protein sequence for the viral protein on NCBI’s protein database (: https://www.ncbi.nlm.nih.gov/protein/). An example query output is given in Figure 4.

Figure 4

A view of the NCBI Protein database interface

This is the result of a query of “SARS-CoV-2 ORF7a”. Several resulting proteins’ sequences are shown, as well as multiple possible filters (circled in red) to apply to results. Link: https://www.ncbi.nlm.nih.gov/protein/.

In some instances, it may be useful to filter for the organism, as shown in Figure 4. Query the protein in the Protein Data Bank. If you can’t find a quality structure, move on to step c. Otherwise, you may download the structure and skip to step d. You may query the protein name, or the sequence. A good structure should be of high quality, high resolution (low Angstroms), complete (full sequence included), and a realistic environment. Submit the protein sequence from step A to I-TASSER to be modeled. If you do not have an account, you will have to register an account. For replication, we generally ignore the options for modeling. A view of the I-TASSER interface is shown in Figure 5.

Figure 5

A view of the I-TASSER webserver interface

An example sequence and protein ID have been entered. This website requires a user email and password, but optionally allows for more constraints. We don’t use these constraints.

Download the file corresponding to the produced structure. Submit this file to Dali under the PDB search header. When your results are finished from Dali, open a python terminal, then type. import compute_features compute_features.parse_dali(html, gene, path=directory) where ‘html’ is the link given from the Dali run with the results, ‘gene’ is the name of the viral protein, and ‘directory’ is the location to write to. Afterwards, run the final three cells of analyze_all_variants.ipynb. The third-to-last and second-to-last cells create helping dictionaries. The last cell cross-references structural homologs against proteins known to interact with the host protein of interest. A view of the NCBI Protein database interface This is the result of a query of “SARS-CoV-2 ORF7a”. Several resulting proteins’ sequences are shown, as well as multiple possible filters (circled in red) to apply to results. Link: https://www.ncbi.nlm.nih.gov/protein/. A view of the I-TASSER webserver interface An example sequence and protein ID have been entered. This website requires a user email and password, but optionally allows for more constraints. We don’t use these constraints.

Align structures to identify shared motifs

Timing: 1 day This step will visualize the shared structural motifs between viral proteins and homologous host proteins that also bind to other host proteins. In particular, this step will identify the regions containing these structural motifs, which may particularly impact binding between host and viral proteins. This section doesn’t involve variants identified in prior sections. For each viral protein that interacts with one of the host proteins associated with your biological process of interest. An example output is shown in Figure 6.

Figure 6

View of structurally aligned proteins in Pymol

The proteins include SARS-CoV-2 ORF7a protein, and two human proteins with significant structural similarity. Additionally, the Pymol console is shown (at top), and the list of proteins is shown at the right along with additional Pymol options.

Open the I-TASSER modeled structure from the previous section in Pymol. In Pymol, fetch the associated homologous structures. For each homologous structure identified for the viral protein, ABCD_E. fetch ABCD In the ‘all’ section on the right, click ‘H’, then ‘everything’. In the ‘all ‘section on the right, click ‘S’, then ‘cartoon’. Select ‘Display’, then ‘Sequence’. For each homologous structure, highlight all chains not involved (everything except for E here). Then, in the ‘(sele)’ section on the right, click ‘H’, then ‘everything’. For each homologous structure, highlight the chain involved (E). Then, in the Pymol terminal, super ABCD, I-TASSER_model to structurally align the homolog and the I-TASSER viral protein model. Save the Pymol session or the image, as needed. View of structurally aligned proteins in Pymol The proteins include SARS-CoV-2 ORF7a protein, and two human proteins with significant structural similarity. Additionally, the Pymol console is shown (at top), and the list of proteins is shown at the right along with additional Pymol options.

Dock viral and host proteins

Timing: 10 days This step will indicate likely binding sites of viral and host proteins. This may further indicate vital regions that will impact viral binding and its influence on the biological process. Variants included in these regions may strongly impact binding of viral and host proteins. For each viral protein and interacting host protein. Find the protein sequence of the host protein and model this structure in I-TASSER as described above. Submit both the host protein and viral protein structures to ZDock. When Zdock is finished, load each resulting PDB file into Pymol and manually curate each docking for feasibility. If necessary, a literature review may indicate vital positions that are involved in binding. For each acceptable output docking from ZDock, give to Rosetta Prepack asWhere /PATH/TO/ROSETTA/EXECS/ is the global path to the Rosetta executables, $PDB is the path to the docking file from ZDOCK, A and B are the names of the chains (viral and host proteins), $SUFFIX is the suffix of the output files you want to use. /PATH/TO/ROSETTA/EXECS/docking_prepack_protocol.linuxgccrelease -nstruct 1 -detect_disulf true -rebuild_disulf true -ex1 -ex2aro -overwrite -ignore_zero_occupancy false -in:file:s $PDB -unboundrot $pdb -partners A_B -suffix $SUFFIX -out:file:scorefile ${SUFFIX}.sc Then, use. /PATH/TO/ROSETTA/EXECS/docking_protocol.linuxgccrelease -nstruct 1 -out:pdb_gz true -ex1 -ex2aro -in:file:s $PDB -partners A_B -suffix $SUFFIX -out:file:scorefile ${SUFFIX}.sc Again, set $PDB, A and B, and $SUFFIX appropriately. Use scores_to_csv to compile the score files:where directory is the directory containing all the score files (.sc) generated by the previous step and viral_gene_host_gene is the names of the viral and host genes (you can modify this as desired). python scores_to_csv.py --directory directory --col total_score --prefix viral_gene_host_gene Read through scores.tsv. The last number in each score/PDB file name indicates the Rosetta model generated, and the models with the lowest total_score should be selected. For each of the optimal docking models, load the PDB into Pymol. Visualize as desired. We suggest hiding everything then showing cartoon, and displaying sequence as described previously. Then, next to ‘all’ on the right, select ‘C’, then ‘by chain’, then ‘by chain’. A view of the Pymol interface is shown in Figure 7.

Figure 7

View of the menus of Pymol

Yellow indicates the visualization options, red indicates standard menus, and purple shows the protein sequences. A docked structure consisting of two proteins is shown. The two chains are colored differently as described in step 9, part i.

Click ‘File’, then ‘Run’, then browse to the location of ‘InterfaceResidues.py’. Then, in the Pymol terminal, enter.where docked_struct is the docking of the viral and host proteins. interfaceResidues docked_struct All highlighted positions are amino acids that interact between the two proteins and may warrant further focus. View of the menus of Pymol Yellow indicates the visualization options, red indicates standard menus, and purple shows the protein sequences. A docked structure consisting of two proteins is shown. The two chains are colored differently as described in step 9, part i.

Expected outcomes

The code described here provides five primary outputs related to host proteins potentially interacting with the virus: (i) Identification of host proteins that interact with viral proteins. (ii) The subset of synonymous variants of the interacting host genes. (iii) The subset of missense variants of the interacting host genes. (iv) Variants of the interacting host genes that are significantly impactful per the GWAS study. (v) Structural alignments of viral proteins with structural homologs that also interact with the host proteins. This pipeline identifies coagulation related-proteins that interact with SARS-CoV-2 proteins but can be easily modified and adjusted for the search of interactions (and their potential impact) between proteins involved in any specific/defined biological pathway and/or process and proteins encoded by any type of virus. The pipeline identifies the host-proteins that may interact with the virus/virally encoded proteins, including selected host-protein variants (found in various subpopulations) by screening numerous databases including COVID-19 host genetics initiative (HGI) GWAS data, dbSNP or ClinVar. It should be noted here that, the subset of GWAS data includes only data with p-value less than 0.05 (unless specified differently). This may exclude many variants, or even some genes of interest entirely. All identified missense and synonymous variants are then rated. The subset of synonymous variants includes minor allele frequencies, conservation scores, rare codon enrichment, change in mRNA minimum free energy from multiple tools, many codon and codon pair usage scores, and the number of values that are in the extreme. The subset of missense variants includes the same scores as the synonymous variants, except that the conservation scores are computed for amino acid sequences, rather than nucleotide sequences. Evaluation of host-viral interaction for different host protein variants can help understand the distinct disease pathogenesis in minority groups. The Jupyter Notebook rendering of the synonymous variants and scores is shown in Figure 8.

Figure 8

Visualization of the scored synonymous variants in coagulation-involved human proteins that interact with SARS-CoV-2 proteins (the output of the step 4)

This shows the identifier of the variant (with the gene location) as the row name. Primarily minor allele frequencies for different populations from GnomAD are shown. Not all columns are shown here.

Visualization of the scored synonymous variants in coagulation-involved human proteins that interact with SARS-CoV-2 proteins (the output of the step 4) This shows the identifier of the variant (with the gene location) as the row name. Primarily minor allele frequencies for different populations from GnomAD are shown. Not all columns are shown here. In the pipeline outcome, in addition to the GWAS data, the identified variants have genomic and transcriptomic HGVS coordinates, 501 nucleotide flanking sequence (250 nucleotides on left and right) for wild-type and mutant sequences, summaries of results from splicing tools, minor allele frequencies in different populations, as well as any information on clinical significance or publications referencing the variant. The rendering of the computed GWAS data is shown in Figure 9.

Figure 9

Visualization of the filtered and processed GWAS data of coagulation-involved human proteins that interact with SARS-CoV-2 proteins (the output of the step 5)

This shows the location of the variant, and effect size of the variant on severe COVID-19 status. Additional columns are now shown.

Visualization of the filtered and processed GWAS data of coagulation-involved human proteins that interact with SARS-CoV-2 proteins (the output of the step 5) This shows the location of the variant, and effect size of the variant on severe COVID-19 status. Additional columns are now shown. Additionally, the pipeline will produce a list of host proteins that are structurally homologous to the viral proteins and also bind to the associated host proteins. This step identifies viral proteins that may target same host proteins via competition binding thus disturbing the host protein function and the processes that the protein regulates. Additionally, structural alignments of these proteins may elucidate the structurally homologous regions. Finally, the docked structures may imply a more significant impact on variants in amino acids near the protein-protein interface. The sequence and structural alignments may help to identify the host proteins that are critical for viral replication and highlight the host proteins that can be affected by viral invasion. Predicting the host-viral interactions between the host protein and its variants with the viral proteins are critical in development of potential treatments.

Quantification and statistical analysis

For all synonymous and non-synonymous variants, quantify the number of computed features that are above the 95th percentile or below the 5th percentile. For some features such as post-translational modifications that are zero for most amino acids, use the 95th percentile of nonzero values. Variants with more extreme values may be more impactful on the protein structure/function, or more common in the population and more impactful to the population at large. This occurs in the eighth and ninth cell of analyze_all_variants.ipynb. Further analysis of impactful variants is usually manually curated for interesting, extreme, and common variants.

Limitations

While protein-protein interactions databases tend to be reliable, they may include potentially spurious results that have not been peer reviewed. To overcome this limitation, sources for protein-protein interactions should be manually curated to verify for credibility. In addition, this protocol is exclusively in silico, so results should be interpreted as such. In vitro validation on the results may be helpful.

Troubleshooting

Problem 1

Any step: Python package not found.

Potential solution

Additional Python packages may not be installed. If Anaconda is installed, these can usually be installed with conda install [package name] The repository may need to be specified or added in Anaconda to find the appropriate package. Conda-forge and Bioconda are common repositories for many dependencies.

Problem 2

Permission denied when attempting to get sudo privileges for sudo make install You may move the executables to a bin in the $PATH, or add the directory containing the executable to your $PATH variable. The easiest way to add this directory to your path variable is to include this statement in your ∼/.bashrc file. To do this, edit the ∼/.bashrc file. Add the line export PATH="/PATH/TO/EXECUTABLE/FILE:$PATH" to the end of the file. Then, run source ∼/.bashrc to update your $PATH variable.

Problem 3

Step 5: Columns expected but not found in GWAS result file. In the gwas_pipeline function, certain columns are expected in the GWAS file, including the p-value of the variant impact, the chromosome, position, reference and alternate alleles. If the titles of these columns are different from the defaults in gwas_pipeline, feed the appropriate column names as arguments to gwas_pipeline. If these columns are not included in the GWAS file, it may be necessary to process the GWAS file to create these columns.

Problem 4

Step 5: Very little data in GWAS results, or the same to transcriptomic HGVS identifier used for many different variants in results. The gwas_pipeline function includes a section to translate genomic HGVS identifiers to transcriptomic HGVS identifiers. If the batch Mutalyzer translation file is not used or available, the program will manually pass each variant to Mutalyzer, frequently resulting in duplicate entries and errors. To solve this, specify a translation file by passing as a translation argument to gwas_pipeline. Then when the program pauses, feed the list of genomic HGVS identifiers to batch Mutalyzer and save the result as your translation file.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Chava Kimchi-Sarfaty (chava.kimchi-sarfaty@fda.hhs.gov).

Materials availability

All materials are publicly available from the U.S. Food and Drug Administration GitHub. All other required materials have been outlined above.

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

BIOGRID 4.4.209Database of protein-protein interaction information. Used in step 1.	(Oughtred et al., 2021)	RRID: SCR_007393
AmiGODatabase of biological processes and the proteins involved. Used in step 2.	(Carbon et al., 2009)	RRID: SCR_002143
COVID-19 HGI Release 6Contains GWAS studies for human variants in different elements of COVID-19 infection and survival. Used in step 5.	(COVID-19 Host Genetics Initiative, 2021)	RRID: SCR_022272
dbSNPDatabase of single nucleotide polymorphisms in the human genome. Used in steps 4b and c.	NCBI (Sherry et al., 2001)	RRID: SCR_002338
ClinVarDatabase of clinical information for variants in the human genome. Used in steps 4b and c.	NCBI (Landrum et al., 2018)	RRID: SCR_006169
Google ScholarSearch tool to find information from academic resources. Used in steps 4b and c.	Google	RRID: SCR_008878
EntrezDatabase of biological data, specifically protein and nucleotide sequences. Used in steps 4a, b, and c.	NCBI (Sayers et al., 2022)	RRID: SCR_016640
Protein Data BankDatabase of protein structural data. Used in the step 6.	Research Collaboratory for Structural Bioinformatics (Berman et al., 2000)	RRID: SCR_012820
Codon and Codon-Pair Usage Tables (September 2021) (CoCoPUTs)Database of codon and codon-pair usage data. Used in the steps 4b, 4c, and 5.	(Alexaki et al., 2019)	RRID: SCR_018504
ESEfinderTool to find exonic splicing enhancers in a nucleotide sequence. Used in step 5.	(Cartegni et al., 2002, Cartegni et al., 2003)	RRID: SCR_007088
FAS ESSPredict and analyze exonic splicing silencers. Used in step 5.	(Wang et al., 2004)	RRID: SCR_022517
ExonScanTool that expects a primary transcript sequence, preferably excluding the first and last exon. Used in step 5.	(Wang et al., 2004, Yeo and Burge, 2004)	RRID: SCR_022516
I-TASSERWebserver to model protein structures. Used in step 6.	(Yang et al., 2015)	RRID: SCR_014627
DaliWebserver to find structural homologs of proteins. Used in step 6.	(Holm, 2019)	RRID: SCR_013433
ZDockFast Fourier Transform based protein docking programs Used in step 9b.	(Pierce et al., 2014)	RRID: SCR_022518

Software and algorithms

Python 3.7.12Programming language and associated packages. Used in all steps.	(Van Rossum and Drake, 2009)	RRID: SCR_008394
Biopython 1.69Library of biologically relevant tools for Python. Used in steps 4 and 5.	(Cock et al., 2009)	RRID: SCR_007173
BLAST, specifically through BioPython and EntrezTool to search for homologous sequences. Used in steps 4b and c.	(Altschul et al., 1990)	RRID: SCR_004870
Clustal Omega 1.2.4Tool to align multiple protein sequences simultaneously. Used in steps 4b and c.	(Sievers et al., 2011)	RRID: SCR_001591
NetNGlyc 1.0Tool to predict N-Glycosylation from protein sequence. Used in steps 4b and c.	(Gupta and Brunak, 2002)	RRID: SCR_001570
NetOGlyc 3.1Tool to predict O-Glycosylation from protein sequence. Used in steps 4b and c.	(Hansen et al., 1998)	RRID: SCR_009026
NetPhos 3.1bTool to predict phosphorylation from protein sequence. Used in steps 4b and c.	(Blom et al., 1999)	RRID: SCR_017975
NetSurfP 2.0Tool to predict protein surface availability from sequence. Used in steps 4b and c.	(Klausen et al., 2019)	RRID: SCR_018781
Coarse-grained Co-translational Folding Analysis and Rare Codon EnrichmentTool to predict co-translational folding energy and rare codon enrichment. Used in steps 4b and c.	(Jacobs and Shakhnovich, 2017)	RRID: SCR_022271
%MinMaxTool to predict rare codon clustering. Used in second step. Used in steps 4b and c.	(Rodriguez et al., 2018)	RRID: SCR_022268
Ensembl Variant Effect PredictorTool to predict effect of genetic variant and also contains minor allele frequencies. Used in steps 4b and c.	(McLaren et al., 2016)	RRID: SCR_007931
AL2CO conservation scoresDescribes multiple conservation scores. Used in steps 4 and 5.	(Pei and Grishin, 2001)	RRID: SCR_022267
RNAfold 2.4.10Tool to compute mRNA stability. Used in steps 4b and c.	(Lorenz et al., 2011)	RRID: SCR_008550
NUPACK 3.0.6Tool to compute mRNA stability. Used in steps 4b and c.	(Zadeh et al., 2010)	RRID: SCR_022274
KineFoldTool to compute mRNA stability. Used in steps 4b and c.	(Xayaphoummine et al., 2005)	RRID: SCR_022273
ESRseq hexamer scoreScores 6-mers for splicing potential. Used in step 5.	(Ke et al., 2011)	RRID: SCR_022270
HEXplorer score (ZEI and ZWS)Scores 6-mers for splicing potential. Used in step 5.	(Erkelenz et al., 2014)	RRID: SCR_022269
Rosetta 3.10Software suite used for prediction of protein structure and docking. Used in step 9.	(Kahraman et al., 2013)	RRID: SCR_015701
Pymol 1.8.4.0Software to visualize proteins, protein complexes, and other molecules. Used in step 8.	(Pymol, 2020)	RRID: SCR_000305

Other

Dell Precision 7730 laptop with Intel Core i7-8850H, 32 GB memory, 500 GB solid state drive	Dell

51 in total

Review 1. Listening to silence and understanding nonsense: exonic mutations that affect splicing.

Authors: Luca Cartegni; Shern L Chew; Adrian R Krainer
Journal: Nat Rev Genet Date: 2002-04 Impact factor: 53.242

Review 2. Understanding the contribution of synonymous mutations to human disease.

Authors: Zuben E Sauna; Chava Kimchi-Sarfaty
Journal: Nat Rev Genet Date: 2011-08-31 Impact factor: 53.242

3. NetSurfP-2.0: Improved prediction of protein structural features by integrated deep learning.

Authors: Michael Schantz Klausen; Martin Closter Jespersen; Henrik Nielsen; Kamilla Kjaergaard Jensen; Vanessa Isabell Jurtz; Casper Kaae Sønderby; Morten Otto Alexander Sommer; Ole Winther; Morten Nielsen; Bent Petersen; Paolo Marcatili
Journal: Proteins Date: 2019-03-09

4. Genomic HEXploring allows landscaping of novel potential splicing regulatory elements.

Authors: Steffen Erkelenz; Stephan Theiss; Marianne Otte; Marek Widera; Jan Otto Peter; Heiner Schaal
Journal: Nucleic Acids Res Date: 2014-08-21 Impact factor: 16.971

5. An approach to cellular tropism of SARS-CoV-2 through protein-protein interaction and enrichment analysis.

Authors: Daniel Ortega-Bernal; Selene Zarate; Maria de Los Ángeles Martinez-Cárdenas; Rafael Bojalil
Journal: Sci Rep Date: 2022-06-07 Impact factor: 4.996

6. Kinefold web server for RNA/DNA folding path and structure prediction including pseudoknots and knots.

Authors: A Xayaphoummine; T Bucher; H Isambert
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

7. Coagulation abnormalities and thrombosis in patients with COVID-19.

Authors: Marcel Levi; Jecko Thachil; Toshiaki Iba; Jerrold H Levy
Journal: Lancet Haematol Date: 2020-05-11 Impact factor: 18.959

8. Vitamin K epoxide reductase complex subunit 1 (VKORC1) gene polymorphism as determinant of differences in Covid-19-related disease severity.

Authors: Rob Janssen; Jona Walk
Journal: Med Hypotheses Date: 2020-08-25 Impact factor: 1.538

9. Gene variants of coagulation related proteins that interact with SARS-CoV-2.

Authors: David Holcomb; Aikaterini Alexaki; Nancy Hernandez; Ryan Hunt; Kyle Laurie; Jacob Kames; Nobuko Hamasaki-Katagiri; Anton A Komar; Michael DiCuccio; Chava Kimchi-Sarfaty
Journal: PLoS Comput Biol Date: 2021-03-17 Impact factor: 4.475

10. Coagulation factors directly cleave SARS-CoV-2 spike and enhance viral entry.

Authors: Edward R Kastenhuber; Marisa Mercadante; Benjamin Nilsson-Payant; Jared L Johnson; Javier A Jaimes; Frauke Muecksch; Yiska Weisblum; Yaron Bram; Vasuretha Chandar; Gary R Whittaker; Benjamin R tenOever; Robert E Schwartz; Lewis Cantley
Journal: Elife Date: 2022-03-23 Impact factor: 8.140