Literature DB >> 34704076

Protocol for using NoBadWordsCombiner to merge and minimize "bad words" from BLAST hits against multiple eukaryotic gene annotation databases.

Xi Zhang1,2, Yining Hu3, David Roy Smith4.   

Abstract

Annotating protein-coding genes can be challenging, especially when searching for the best hits against multiple functional databases. This is partly because of "bad words" appearing as top hits, such as hypothetical or uncharacterized proteins. To help alleviate some of these issues, we designed a bioinformatics tool called NoBadWordsCombiner, which efficiently merges the hits from various databases, strengthening gene definitions by minimizing functional descriptions containing "bad words." Unlike other available tools, NoBadWordsCombiner is user friendly, but it does require users to have some general bioinformatics skills, including a basic understanding of the BLAST package and dash shell in Linux/Unix environments. For complete details on the use and execution of this protocol, please refer to Zhang et al. (2021a).
© 2021 The Author(s).

Entities:  

Keywords:  Bioinformatics; Genomics; Sequence analysis

Mesh:

Substances:

Year:  2021        PMID: 34704076      PMCID: PMC8521201          DOI: 10.1016/j.xpro.2021.100888

Source DB:  PubMed          Journal:  STAR Protoc        ISSN: 2666-1667


Before you begin

Next-generation sequencing (NGS) technologies can generate huge amounts of molecular sequence data (Yandell and Ence, 2012). Functional annotations of protein-coding genes from NGS data can be easily acquired via database searches, including NCBI-NR (Pruitt et al., 2005), UniProtKB/Swiss-Prot (Boutet et al., 2007), and TrEMBL (Boeckmann et al., 2003). But the results of these searches often include ‘bad words’, such as best hits to hypothetical proteins or uncharacterized proteins, which can confuse the interpretation of gene annotation results. Indeed, it was reported that 20–30% of the annotations from assembled chlamydomonadalean nuclear genomes are represented by hypothetical proteins, including those from the Chlamydomonas reinhardtii genome (Zhang et al., 2021a). For various other recently sequenced genomes, the percentage of hypothetical proteins can be even higher (Galperin, 2001). It can be time-consuming to manually curate the functional hits from Basic Local Alignment Search Tool (BLAST) searches. This can be especially true if trying to minimize hits containing ‘bad words’ (e.g., hypothetical proteins) when the redundant hits have meaningful functional annotations (i.e., without ‘bad words’). Currently, there are very few user-friendly bioinformatics tools for merging and minimizing ‘bad words’ during functional gene annotation, and those that are available typically involve custom programing scripts with a steep learning curve (De Wit et al., 2012). Here, we present NoBadWordsCombiner, an open-source, user-friendly bioinformatics web tool for efficiently merging and minimizing ‘bad words’ scanned from various functional annotation databases. This tool can plugin to external databases, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000) and InterProScan (Quevillon et al., 2005), to strengthen the definition of gene annotations. NoBadWordsCombiner does require users to have some basic familiarity with bioinformatics. They must be comfortable with the BLAST package (Altschul et al., 1997), the dash shell in Linux/Unix environments, and inputting files from third-party tools, such as InterProScan (Quevillon et al., 2005) and KEGG (BlastKOALA and GhostKOALA) (Kanehisa et al., 2016). Recently, we sequenced, assembled, and annotated the nuclear genome of the Antarctic green alga Chlamydomonas sp. UWO241 (Zhang et al., 2021a), hereafter referred to as UWO241. During our analysis of this genome, we designed and applied the NoBadWordsCombiner tool during the functional annotation stage, which greatly minimized descriptions containing ‘bad words’. The protocol presented here describes how to use NoBadWordsCombiner for merging and minimizing ‘bad words’ from eukaryotic gene annotation databases. The model psychrophilic green alga UWO241 is used as a case-study for this goal.

Overview

NoBadWordsCombiner merges the functional gene annotation BLAST hits from NCBI nr, UniProtKB/Swiss-Prot, and TrEMBL database searches. Specifically, it removes redundancy from descriptions containing hits to hypothetical or uncharacterized proteins (not including instances when all hits are hypothetical/uncharacterized). Then, the definition of the combined hits is strengthened via protein functional domains and pathway information based on data from the InterPro and KEGG databases. Finally, the overview of the gene annotations with the minimized ‘bad words’ is summarized in a mega table.

Downloading the software and prerequisites

NoBadWordsCombiner can be operated on the web (http://hsdfinder.com/combiner/) or the local environment (Linux and Python 3) after downloading the software package from GitHub (https://github.com/zx0223winner/ NoBadWordsCombiner). To run locally, pre-installed Python (preferably Python 3) and Linux (e.g., Ubuntu 20.04 LTS) environments are required. The BLAST and InterProScan software packages as well as the online KEGG pathways tools BlastKOALA and GhostKOALA (Kanehisa et al., 2016) can be accessed via the links in the key resources table. A minimum specification requirement is a computer with 2 cores, 4 GB of RAM, and 256 GB storage, which should allow the ‘bad words’ to be merged and minimized within a few minutes.

Key resources table

Materials and equipment

The software implementation was written in Python 3 using the following custom scripts and platforms: NoBadWordsCombiner.py, which enables the ‘bad words’ to be merged and minimized from BLAST hits against multiple eukaryotic gene annotation databases and protein signature databases (e.g., Pfam); Django (3.1.5), a Python-based web platform, which maintains the web server; and pandas (1.2.2), the software library used for manipulating the data. Blastxml_to_tabular.py (Cock et al., 2015) is a custom Python script that can convert a BLAST XML file to the desired tabular output. The NCBI-NR and UniProtKB/TrEMBL databases, including the gene annotations, are computationally analyzed, whereas the UniProtKB/Swiss-Prot database is manually curated and, thus, contains fewer annotations of hypothetical proteins. A full list of the utilized packages and database, including links, can be found in the key resources table. The full NoBadWordsCombiner source code can be found in the GitHub repository. A useful hands-on tutorial (Online_NoBadWordsCombiner Tutorial.pdf) can also be accessed under the tutorial directory of GitHub. The test input data consist of BLAST and protein signature results from InterProScan (Quevillon et al., 2005). Five mandatory tab-delimited tables are needed to run the tool. The first and second input documents of the NCBI-NR and SwissProt database BLAST results have 14 columns (Tables 1 and 2). These two tables are parsed from the local BLAST results via a custom Python script (Blastxml_to_tabular.py). The third and fourth input files were designed as a 1-column gene name list file and a 2-column KEGG annotation file, respectively (Tables 3 and 4). The fifth input document of the InterProScan results has 13 columns (Table 5). The KO accession with each gene model identifier was retrieved from the KEGG database (Kanehisa and Goto, 2000). In the following step-by-step protocol, we use the deduced protein sequences from the UWO241 genome annotation (Zhang et al., 2021a) to show how to generate these tables.
Table 1

Input file example of NCBI nr database BLAST result

QueryAccQuery_LengthHitDescriptionHitNameHitLengthHitBitsHSP_rank%IDeValueQuery_StartQuery_EndHit_startHit_endHSP_length
g1.t1817hypothetical protein CEUSTIGMA_g3421.t1 [Chlamydomonas eustigma]gi|1238995578|dbj|GAX75978.1|1443260.766154.26356591.41E-751077411268258
g2.t1399ankyrin, partial [Anaeromyces robustus]gi|1183350135|gb|ORX78377.1|23565.4698140.22988513.61E-1019279189687
g3.t13567hypothetical protein CEUSTIGMA_g3419.t1 [Chlamydomonas eustigma]gi|1238995576|dbj|GAX75976.1|1103172.17138.46153851.15E-398051674330597299
g4.t1963hypothetical protein CEUSTIGMA_g3418.t1 [Chlamydomonas eustigma]gi|1238995575|dbj|GAX75975.1|623310.457189.50617281.17E-97469954172333162
g6.t1291hypothetical protein CHLRE_10g421079v5 [Chlamydomonas reinhardtii]gi|1335042461|gb|PNW77074.1|10382.8037158.33333331.66E-18103282349360
g7.t17908hypothetical protein CEUSTIGMA_g3945.t1 [Chlamydomonas eustigma]gi|1238994727|dbj|GAX76500.1|2934156.377132.67857147.48E-346334776123132860560
g9.t1471hypothetical protein CEUSTIGMA_g3416.t1 [Chlamydomonas eustigma]gi|1238995573|dbj|GAX75973.1|139164.466162.12121213.00E-497646811139132
g10.t11827hypothetical protein GPECTOR_108g190 [Gonium pectorale]gi|1004134917|gb|KXZ42995.1|463331.257178.82882881.18E-103580124548269222
Table 2

Input file example of SwissProt database BLAST result

QueryAccQuery_LengthHitDescriptionHitNameHitLengthHitBitsHSP_rank%IDeValueQuery_StartQuery_EndHit_startHit_endHSP_length
g2.t13992-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2sp|Q05921|RN5A_MOUSE73548.1358134.88372094.14E-062526712520686
g3.t13567DNA mismatch repair protein MSH6 OS=Arabidopsis thaliana OX=3702 GN=MSH6 PE=1 SV=2sp|O04716|MSH6_ARATH132453.5286141.81818181.61E-0537954312117555
g4.t1963Eukaryotic peptide chain release factor GTP-binding subunit ERF3A OS=Homo sapiens OX=9606 GN=GSPT1 PE=1 SV=1sp|P15170|ERF3A_HUMAN499234.958172.29729732.94E-7251195469216148
g9.t1471Thylakoid-associated protein slr0729 OS=Synechocystis sp. (strain PCC 6803 / Kazusa) OX=1111708 GN=slr0729 PE=4 SV=1sp|P72673|Y729_SYNY310147.7506129.47368421.49E-06187468119995
g10.t11827Threonylcarbamoyl-AMP synthase OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) OX=284812 GN=sua5 PE=3 SV=1sp|O94530|SUA5_SCHPO408217.238150.82644633.67E-63580124258299242
g15.t1270Protein transport protein Sec61 subunit beta OS=Chlamydomonas reinhardtii OX=3055 GN=SEC61B PE=1 SV=1sp|A8I6P9|SC61B_CHLRE8965.4698153.70370372.15E-14106261368954
g16.t1897Probable prolyl 4-hydroxylase 4 OS=Arabidopsis thaliana OX=3702 GN=P4H4 PE=2 SV=1sp|Q8LAN3|P4H4_ARATH298157.147141.07883829.49E-45170869289241
g17.t11104GATA transcription factor 3 OS=Arabidopsis thaliana OX=3702 GN=GATA3 PE=2 SV=2sp|Q8L4M6|GATA3_ARATH26962.003156.0975611.17E-097920117121141
Table 3

Input file example of gene name list

Gene name
g1.t1
g2.t1
g3.t1
g4.t1
g5.t1
g6.t1
g7.t1
Table 4

Input file example of KO accession with each gene model identifier retrieved from the KEGG database

Gene identifierKO accession
g59.t1K10849
g60.t2K17087
g61.t2N/A
g62.t1N/A
g63.t2N/A
g64.t1N/A
g65.t1K15172
g66.t1K02519
Table 5

Input file example of InterProScan database result

Protein accessionUnique codeSequence lengthProtein signatureSignature accessionSignature descriptionStart locationStop locationE-valueStatusDateInterPro accessionInterPro description
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262SUPERFAMILYSSF82153N/A1292609.42E-10T31-03-2019IPR036378FAS1 domain superfamily
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262ProSiteProfilesPS50213FAS1/BIgH3 domain profile.1112579.579T31-03-2019IPR000782FAS1 domain
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262PfamPF02469Fasciclin domain1232595.80E-09T31-03-2019IPR000782FAS1 domain
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262ProSitePatternsPS00183Ubiquitin-conjugating enzymes active site.6984-T31-03-2019IPR023313Ubiquitin-conjugating enzyme, active site
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262SMARTSM00212N/A21481.10E-36T31-03-2019N/AN/A
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262PANTHERPTHR44511N/A21191.50E-59T31-03-2019N/AN/A
g5250.t1f246997202ceeb0ebfd5ea2f454be9a2262SMARTSM00554N/A1452607.20E-07T31-03-2019IPR000782FAS1 domain
g15700.t1e9641f2405b85bc4c48a85029514acf0799MobiDBLitemobidb-liteconsensus disorder prediction347361-T31-03-2019N/AN/A
Input file example of NCBI nr database BLAST result Input file example of SwissProt database BLAST result Input file example of gene name list Input file example of KO accession with each gene model identifier retrieved from the KEGG database Input file example of InterProScan database result

Step-by-step method details

Preparing the NCBI-NR and UniProtKB/Swiss-Prot protein BLAST-search result files

Timing: ∼2 days (depending on the amount of the data, computing power, and Internet speed.) Upload protein BLAST-search result files from your genome of interest in tab-separated values (tsv) format as the input files (Tables 1 and 2) of NoBadWordsCombiner. This protocol will go over how to acquire local BLAST-search results via an example FASTA file. The example file as well as the hands-on tutorial (Online_NoBadWordsCombiner Tutorial.pdf) can be acquired from GitHub under the tutorial directory (Figure 1A).
Figure 1

The NoBadWordsCombiner home page

(A) The GitHub web interface of this tool.

(B) Uploading the necessary input files.

(C) The interface of running the tool.

(D) The output example of the tool.

You can ignore this step and proceed with your own protein data set if you know how to acquire the appropriate BLASTP search results. Download the BLAST package and FASTA file. A BLAST-search result example file is found in the ZIP file in the GitHub ‘tutorial’ directory under the name NoBadWordsCombiner_file_examples.zip (Figure 1A). Also, the ‘blastdbv5-user-guide.pdf’ document in the GitHub ‘tutorial’ directory contains complementary vignettes to help guide users. Download the BLAST Package via https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/. Please select the appropriate version based on your computer operating system (Windows, MacOS, or Linux) Unzip the ‘NoBadWordsCombiner_file_examples.zip’ file; the file named ‘Chlamydomonas_UWO241_protein.fasta’ is the example FASTA file. # Display the first ten rows of the FASTA file. > head Chlamydomonas_UWO241_protein.fasta >g1.t1 MAATVENVVERVKSFSSVVRGVKSGKPDGATTQLVQETIEILATYCDFEEVVPV CLKFLDEVLTAAPQTSTLIRLEGGAK IFPSIIRNFMGVDASILALCAKVMCKCASGSPAMQHHLVKEKGLPTLLLSCCSA HAGEPAVVGPLLEVLVALARYSKGAT ALSNANLVHACKELLVGLMGHWHAFGMVLKLIKSVMKHEGPCLAALKAGEVVRL LLGVARLVSRMPDQRKLLKRASRTLW VLSQRSLHPLPEMELNWPHTHTHTHTHTHTHT >g2.t1 MMMLAYRFGFTTLMYATVKGHADAMRLLLKHPSADTAAMMMLTDIRGCTALMFA AQDGHVNAIRMLLDHPSADVAARIAV RSTVGISALTSAAGFAAGQPTLSRRASPARSCTPLLFLLRRVAVEPQLCDTQ >g3.t1 MVPTDGARHGWTATSLPAILGAASHAKITVQQLVVGGPPPSCPYGPEIVGRSLS LFSKSAKTWDRAPGGVVSAFCAATGE Set up the manually curated UniProtKB/Swiss-Prot database and computationally calculated NCBI-NR database. The ‘uniprot_sprot.fasta.gz’ file can be downloaded directly from https://www.uniprot.org/downloads. When downloading, choose Reviewed (Swiss-Prot) in FASTA format under the parent directory. The NCBI-NR BLAST v5 databases can be accessed via https://ftp.ncbi.nlm.nih.gov/blast/db. Some necessary files (e.g., nr.00.tar.gz, nr.00.tar.gz.md5, and nr.01.tar.gz) can be automatically downloaded via a custom Perl script at step 2c. The ‘makeblastdb’ command will construct a protein database by taking in the FASTA file with the parameter (-in), setting up the database type (e.g., protein) with the parameter (-dbtype protein), and titling the name of database (e.g., uniport_prot_database) with parameters (-title database_name). The ‘-out’ option will yield the database output name (e.g., uniport_db). # Note: if your FASTA data arerepresented bynucleotides, you can change the database type with the parameter (-dbtype nucl) > ./makeblastdb -in uniprot_sprot.fasta -dbtype prot - title uniprot_prot_database -out uniprot_db To download the NCBI-NR v5 databases, use the Perl script ‘update_blastdb.pl’, which is in the downloaded BLAST+ package (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). # This command will download the NCBI-NR database (https://ftp.ncbi.nlm.nih.gov/blast/db/v5) with the name ‘nr’ without using the makeblastdb command to redo it. It could take minutes to hours for processing, depending on the speed of the Internet. Users can first check all available databases via the command below. > perl update_blastdb.pl --blastdb_version 5 – showall # This will give the results like this: # Connected to NCBI; downloading BLASTDBv5 # human_genome # landmark # … Users can then run the command below to download the nr database, which includes 55 volumes of data (>100 Gb). Alternatively, users can manually download these 110 files (i.e., nr.00.tar.gz, nr.00.tar.gz.md5, etc.) from the link: https://ftp.ncbi.nlm.nih.gov/blast/db/v5. > perl update_blastdb.pl --blastdb_version 5 nr --decompress # This will bring the results like this: # Connected to NCBI; downloading BLASTDBv5 # Downloading nr (55 volumes) ... # Downloading nr.00.tar.gz... # Downloading nr.00.tar.gz.md5 #... Use the BLASTP search option to blast the amino acid sequences against the uniport_db and nr_v5 databases. The BLASTP command can do the protein similarity search by searching the query file (Chlamydomonas_UWO241_protein.fasta) against the protein database created from the former step with default parameters, such as ‘-evalue’ (indicating the significance of the BLAST hits), ‘-outfmt’ (meaning the format of the BLAST result), and ‘-out’ (telling the file name of the output file; e.g., BLASTP_UWO241_uniprot.xml). >./blastp -query Chlamydomonas_UWO241_protein.fasta -db uniprot_db -out BLASTP_UWO241_uniprot.xml -evalue 1e-5 - outfmt 5 The BLAST XML file (-outfmt 5) can include useful information compared to the BLAST Tabular file (-outfmt 6), such as the aligned sequence, the sequence of the hit, and the description of hits in the database. However, the XML format is not human-readable. Users will need to employ a commonly used parser (Blastxml_to_tabular.py) (Cock et al., 2015), which is a custom python script, to convert a BLAST XML file to a desired tabular output (tab-delimited file). Users can first download the script Blastxml_to_tabular.py via the link from the key resources table. Then run the command below. > python blastxml_to_tabular.py -c qseqid,qlen,salltitles,sseqid,slen,bitscore,qframe, pident,evalue,qstart,qend,sstart,send,length BLASTP_UWO241_uniprot.xml > BLASTP_UWO241_uniprot.tsv The parameters behind the option ‘-c’ in the Blastxml_to_tabular.py script will yield desired columns in the output tab-delimited file. For example, ‘qseqid’ refers to query sequence ID and ‘qlen’ refers to query length. These desired parameters will create a 14-column table (e.g., Tables 1 and 2). To speed up the NCBI BLAST search, users can specify one or more (comma-delimited) taxids, or a file containing multiple taxids on the command-line. For example, to search against the Chlamydomonas reinhardtii (Taxonomy ID: 147) and Chlamydomonas eustigma (Taxonomy ID: 57939) nuclear genomes, use the command shown below. Also, we recommend that users browse the ‘blastdbv5-user-guide.pdf’ document in the GitHub ‘tutorial’ directory to familiarize themselves with the creation of either taxids or taxidlist. >./blastp -query Chlamydomonas_UWO241_protein.fasta -db nr –taxids 147,57939 -out BLASTP_UWO241_NCBI-NR.xml - evalue 1e-5 -outfmt 5 # Multiple taxonomy IDs are delimited by ','. #Similar to Step 2e, the BLASTP_UWO241_NCBI-NR.xml file will be converted to BLASTP_UWO241_NCBI-NR.tsv via the command below. > python blastxml_to_tabular.py -c qseqid,qlen,salltitles,sseqid,slen,bitscore,qframe,pide nt,evalue,qstart,qend,sstart,send,length BLASTP_UWO241_NCBI-NR.xml > BLASTP_UWO241_NCBI-NR.tsv # The option ‘-c’ refers to the desired output columns which can be set in comma-delimited format (e.g., qseqid,qlen,salltitles) CRITICAL: Make sure to use the BLASTP option, which allows for greater sensitivity. The BLAST output parameter must be in format 5. Users can adjust the parameter of the E-value, but we recommend that it be no greater than 1e-5 (to ensure accurate predictions). Troubleshooting 1. This will give two BLAST result files formed by 14-column spreadsheets, including key information, such as query name and percentage identity. The 14-column explanation of parsed BLAST search result files (Tables 1 and 2). QueryAcc (e.g., g2.t1) Query_Length (e.g., 399) HitDescription (e.g., ankyrin, partial [Anaeromyces robustus]) HitName (e.g., gi|1183350135|gb|ORX78377.1|) HitLength (e.g., 235) HitBits (e.g., 65.4698) HSP_rank (e.g., 1) %ID (e.g., 40.2298851) eValue (e.g., 3.61E-10) Query_Start (e.g., 19) Query_end (e.g., 279) Hit_start (e.g., 18) Hit_end (e.g., 96) HSP_length (e.g., 87) If users want to upload different BLAST files or mistakenly submitted an incorrect file, they can reload the browser page or simply overlap with another file. Troubleshooting 2 The NoBadWordsCombiner home page (A) The GitHub web interface of this tool. (B) Uploading the necessary input files. (C) The interface of running the tool. (D) The output example of the tool.

Preparing the gene name list and a gene list with KO annotations from the KEGG database

Timing: ∼3 h (depending on the queuing time of GhostKOALA) Users can retrieve the third and fourth files from the genome FASTA file and the KEGG database, which include the correlation of the KO accession with each gene model identifier (Figure 1B). The gene name file is the baseline to merge all the different functional annotations. Users can acquire the gene name list file by the following command lines. We use the FASTA file from the UWO241 genome as an example (Table 3). # ‘grep’ is the command used in the dash shell to grasp each line containing the word pattern of ‘>’. ‘sed’ is used to substitute all the ‘>’ into none, which generates a clean name list file. Users can first test the function of grep via the command below: > grep “>” Chlamydomonas_UWO241_protein.fasta| wc # This should turn out the results as follows: # 16325 16325 168617 Users can then carry out the following step to acquire a gene name file. > grep “>” Chlamydomonas_UWO241_protein.fasta | sed ‘s/>//g’ > UWO241-gene_name_list.txt As for a gene list with KO annotation, users have the option to use the GhostKOALA (genome size ≥ 300MB) or BlastKOALA analysis tool of KEGG to acquire the KO annotation file of the genome (https://www.kegg.jp/ghostkoala/). Below, we provide the necessary steps for using the tools: BlastKOALA accepts a smaller dataset and is suitable for annotating high-quality genomes. Upload the query amino acid sequences in FASTA format. Enter the taxonomy group of your genome. Enter the KEGG GENES database file to be searched. Enter your email address. An email will be sent to you for confirmation of your input data. Click the link in the email to initiate your job; you will receive another email once it is finished. GhostKOALA accepts larger datasets (e.g., 300 Mb) and is suitable for annotating metagenomes. Upload the query amino acid sequences in FASTA format. Enter the KEGG GENES database file to be searched. Enter your email address. Same as above (7d). From the email link of KEGG, users can download the gene list with the associated KO annotations. The format of the output file is referred to in Table 4. Explanation of the 2-column input file for KO accession (Table 4): Gene identifier (e.g., g59.t1) KO accession (e.g., K10849) Use the GhostKOALA or BlastKOALA analysis tool of KEGG to acquire the KO annotation file of your genome (https://www.kegg.jp/ghostkoala/). We provide an example of a KO annotation file under the GitHub directory of the tutorial with the name NoBadWordsCombiner_file_examples.zip. Troubleshooting 3

Preparing the InterProScan search result file

Timing: ∼3 h Upload an InterProScan search result file of your genome in tab-delimited format as the fifth input file (Table 5). Users must individually download and install InterProScan to acquire the input file for the NoBadWordsCombiner tool. The latest InterProScan documentation can be found via the link https://interproscan-docs.readthedocs.io/en/latest/index.html. Here, we provide the necessary steps for using InterProScan: Installation requirements: InterProScan is developed to run on Linux and no versions are planned for Windows or Apple (MAC OS X) operating systems. Software requirements: 64-bit Linux; Perl 5; Python 3; Java JDK/JRE version 11. Obtaining the core InterProScan software (Direct link: ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.51-85.0/interproscan-5.51-85.0-64-bit.tar.gz). this is a large file (around 8 Gb). Running InterProScan: Once the InterProScan package is uncompressed, it can be run directly from the command line. #If run this script with no arguments, the usage instructions will be presented. >./interproscan.sh Run the shell script below: # interproscan.sh is the command taking in the input file with parameter (-i) and setting up the format of output file (e.g., tsv format). ‘-dp’ is to ensure all the database matches proceeded in local environment. >./interproscan.sh -i Chlamydomonas_UWO241_protein.fasta -f tsv -dp Output files: InterProScan should run without any warning, and it will create a tsv output file (i.e., Chlamydomonas_UWO241_protein.fasta.tsv) containing several member database matches, including Pfam. For your convenience, an example of an InterProScan search result is found in the ZIP file under the GitHub directory of tutorial with the name NoBadWordsCombiner_file_examples.zip. Troubleshooting 4 The 13-column explanation of InterProScan search result file (Table 5): Protein accession (e.g., g5250.t1) Sequence unique code (e.g., f246997202ceeb0ebfd5ea2f454be9a2) Sequence length (e.g., 262) Protein signature (e.g., Pfam) Signature accession (e.g., PF02469) Signature description (e.g., Fasciclin domain) Start location (e.g., 123) Stop location (e.g., 259) E-value (or score) (e.g., 5.80E-09) Status - is the status of the match (T: true) Date - is the date of the run (e.g., 31-03-2019) InterPro annotations - accession (e.g., IPR000782) InterPro annotations - description (e.g., FAS1 domain) Before clicking the submission button, users can select one of nine protein signatures (i.e., Pfam, CDD, Hamap, PRINTS, ProDom, ProSitePattern, ProSiteProfiles, SFLD, or TIGRFAM). We set the Pfam domain parameter as the default in order to collect larger database entries and because it has been widely used in many sequence analysis and genome annotation projects. Users can select other protein signatures, such as CDD, which can utilize 3D structures to decipher sequence structure and functional relationships. The descriptions of the various protein signatures are shown below: Pfam: A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). CDD: Prediction of Conserved Domains Database. Hamap is a system for the classification and annotation of protein sequences. PRINTS: A fingerprint is a group of conserved motifs used to characterize a protein family. ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database. ProSitePatterns and ProSiteProfiles: PROSITE consists of documentation entries describing protein domains, families, and functional sites as well as associated patterns and profiles to identify them. SFLDs are protein families based on Hidden Markov Models or HMMs. TIGRFAMs are protein families based on Hidden Markov Models or HMMs.

Output file of the NoBadWordsCombiner tool

Timing: ∼3 min Tap the submit button and a pending image will jump out (Figure 1C). It usually takes less than three minutes to run with a 200 Mb genome-sized file (Figure 1D). Troubleshooting 5 The output of 23-column tab-delimited mega table (Table 6)
Table 6

Output file example of 23-column mega table via NoBadWordsCombiner

IDGeneLengthNoBadName_Hit_DesNoBadName_Hit_NameNoBadName_%IDNoBadName_eValueNCBI_Hit_DesNCBI_Hit_NameNCBI_%IDNCBI_eValueSwiss_Hit_DesSwiss_Hit_NameSwiss_%IDSwiss_eValueKEGG_KOKEGG_DesPfamPfam_NoPfam_DesPfam_evalueInterpro_NoInterpro_domain
0QueryAccQuery_LengthHitDescriptionHitName%IDeValueHitDescriptionHitName%IDeValueHitDescriptionHitName%IDeValueN/AN/AN/AN/AN/AN/AN/AN/A
1g1.t1817hypothetical protein CEUSTIGMA_g3421.t1 [Chlamydomonas eustigma]gi|1238995578|dbj|GAX75978.1|54.26356591.41E-75hypothetical protein CEUSTIGMA_g3421.t1 [Chlamydomonas eustigma]gi|1238995578|dbj|GAX75978.1|54.26356591.41E-75N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
2g2.t13992-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2sp|Q05921|RN5A_MOUSE34.88372094.14E-06ankyrin, partial [Anaeromyces robustus]gi|1183350135|gb|ORX78377.1|40.22988513.61E-102-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2sp|Q05921|RN5A_MOUSE34.88372094.14E-06N/AN/APfamPF12796Ankyrin repeats (3 copies)1.80E-11IPR020683Ankyrin repeat-containing domain
3g3.t13567DNA mismatch repair protein MSH6 OS=Arabidopsis thaliana OX=3702 GN=MSH6 PE=1 SV=2sp|O04716|MSH6_ARATH41.81818181.61E-05hypothetical protein CEUSTIGMA_g3419.t1 [Chlamydomonas eustigma]gi|1238995576|dbj|GAX75976.1|38.46153851.15E-39DNA mismatch repair protein MSH6 OS=Arabidopsis thaliana OX=3702 GN=MSH6 PE=1 SV=2sp|O04716|MSH6_ARATH41.81818181.61E-05N/AN/AN/AN/AN/AN/AN/AN/A
4g4.t1963Eukaryotic peptide chain release factor GTP-binding subunit ERF3A OS=Homo sapiens OX=9606 GN=GSPT1 PE=1 SV=1sp|P15170|ERF3A_HUMAN72.29729732.94E-72hypothetical protein CEUSTIGMA_g3418.t1 [Chlamydomonas eustigma]gi|1238995575|dbj|GAX75975.1|89.50617281.17E-97Eukaryotic peptide chain release factor GTP-binding subunit ERF3A OS=Homo sapiens OX=9606 GN=GSPT1 PE=1 SV=1sp|P15170|ERF3A_HUMAN72.29729732.94E-72K03267ERF3, GSPT; peptide chain release factor subunit 3PfamPF00009Elongation factor Tu GTP binding domain1.70E-34IPR000795Transcription factor, GTP-binding domain
5g6.t1291hypothetical protein CHLRE_10g421079v5 [Chlamydomonas reinhardtii]gi|1335042461|gb|PNW77074.1|58.33333331.66E-18hypothetical protein CHLRE_10g421079v5 [Chlamydomonas reinhardtii]gi|1335042461|gb|PNW77074.1|58.33333331.66E-18N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
6g7.t17908hypothetical protein CEUSTIGMA_g3945.t1 [Chlamydomonas eustigma]gi|1238994727|dbj|GAX76500.1|32.67857147.48E-34hypothetical protein CEUSTIGMA_g3945.t1 [Chlamydomonas eustigma]gi|1238994727|dbj|GAX76500.1|32.67857147.48E-34N/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/AN/A
7g9.t1471Thylakoid-associated protein slr0729 OS=Synechocystis sp. (strain PCC 6803 / Kazusa) OX=1111708 GN=slr0729 PE=4 SV=1sp|P72673|Y729_SYNY329.47368421.49E-06hypothetical protein CEUSTIGMA_g3416.t1 [Chlamydomonas eustigma]gi|1238995573|dbj|GAX75973.1|62.12121213.00E-49Thylakoid-associated protein slr0729 OS=Synechocystis sp. (strain PCC 6803 / Kazusa) OX=1111708 GN=slr0729 PE=4 SV=1sp|P72673|Y729_SYNY329.47368421.49E-06N/AN/APfamPF11378Protein of unknown function (DUF3181)6.90E-26IPR021518Protein of unknown function DUF3181
ID (e.g., 2) Gene or QueryAcc (e.g., g2.t1) Length or Query_Length (e.g., 817) NoBadName_Hit_Des or HitDescription (e.g., 2-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2) NoBadName_Hit_Name or HitName (e.g., sp|Q05921|RN5A_MOUSE) NoBadName_%ID or %ID (e.g., 34.8837209) NoBadName_eValue or eValue (e.g., 4.14E-06) NCBI_Hit_Des or HitDescription (e.g., ankyrin, partial [Anaeromyces robustus]) NCBI_Hit_Name or HitName (e.g., gi|1183350135|gb|ORX78377.1|) NCBI_%ID or %ID (e.g., 40.2298851) NCBI_eValue or eValue (e.g., 3.61E-10) Swiss_Hit_Des or HitDescription (e.g., 2-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2) Swiss_Hit_Name or HitName (e.g., sp|Q05921|RN5A_MOUSE) Swiss_%ID or %ID (e.g., 34.8837209) Swiss_eValue or eValue (e.g., 4.14E-06) KEGG_KO (e.g., K03267) KEGG_Des (e.g., ERF3, GSPT; peptide chain release factor subunit 3) Protein signatures (e.g., Pfam) Pfam_No (e.g., PF12796) Pfam_Des (e.g., Ankyrin repeats (3 copies)) Pfam_evalue (e.g., 1.80E-11) Interpro_No (e.g., IPR020683) Interpro_domain (e.g., Ankyrin repeat-containing domain) Output file example of 23-column mega table via NoBadWordsCombiner The reason we created two columns for the header (e.g., NoBadName_Hit_Des or HitDescription) is to reduce ambiguity. A uniform header is needed when merging different databases. We also included the percentage identity and E-value for each type of BLAST search result, which can be easily compared. If two BLAST database hit descriptions are both without ‘bad words’, the one with lower E-value will be chosen.

Expected outcomes

NoBadWordsCombiner is a free and straightforward online bioinformatics software tool for merging and minimizing hypothetical or uncharacterized proteins from various eukaryotic functional annotation databases. It provides a mega table combined with protein annotation database entries, such as InterPro, Pfam, and KEGG KO. To compare the data in the mega table, users can have an overview of the gene’s annotation patterns, including information on pathways, gene family domain, and gene family. The aim of this tool is to assist with high-quality gene model annotations of eukaryotic nuclear genomes. We provided a real example of this tool in action: the file contains the gene models and their functional descriptions for the UWO241 genome (Zhang et al., 2021a), which greatly aided downstream analyses of this genome, such as detecting highly similar duplicated genes (HSDs) as well as horizontally transferred genes and gene family expansions (Zhang et al., 2021b).

Limitations

NoBadWordsCombiner is limited to presenting gene annotations in a mega table without any plots or statistical interpretations, such as the total number of ‘bad words’, the frequency of genes containing ‘bad words’, or what types of genes have ‘bad words’. E-values are only used to measure the better BLAST results when both hits contain no ‘bad words’. If the genome is misassembled, using E-values alone might infer false positives. In the future, we hope to incorporate the threshold of aligned length, percentage of pairwise identity, and number of domains into the algorithm. For example, only when a certain criterion is satisfied (e.g., pairwise aligned length ≥ 50 amino acids, percentage identity ≥ 30%, and at least one domain), will the E-value be used to judge a better BLAST functional hit. The web tool is reliant on third-party tools to generate the input files, such as InterProScan (Quevillon et al., 2005) and the KEGG tools Ghost KOALA or BlastKOALA (Kanehisa et al., 2016). Users need to be familiar with the basic BLAST package and dash shell in Linux/Unix environments. Notably, there is a steep learning curve for users without any bioinformatics or programming experience. It is our hope to further develop the tool, removing some of the middle steps. For now, we have provided the build-in reference files for each input file as well as example data to facilitate the usability of the tool. In the future, NoBadWordsCombiner will be further improved, including continuous updating by considering more functional eukaryotic databases. It will also be expanded so it can work on other types of genomic data, such as prokaryotic and organelle genomes.

Troubleshooting

Problem 1

Why does BLASTP need to be chosen as an option? What E-value shall I choose? (Step 2)

Potential solution

Make sure to use the BLASTP option because amino acid sequences are generally more highly conserved than their corresponding nucleotide sequences. We recommend the E-value to be no larger than 1e-5 to ensure accurate prediction.

Problem 2

Can I resubmit the input files? (Step 5) Yes. Simply re-fresh (reload) the browser page.

Problem 3

Why is the KEGG KO annotation file needed and what does it look like? (Step 10) The example file has been provided with the name ‘Input_4_NoBadWords_ko’ from GitHub. The file documents the correlation of KO accession with each gene model identifier, which can be used to strengthen the gene functional category.

Problem 4

Is it difficult to run InterProScan? (Step 13) No. It is straightforward to run the tool. A real example of a InterProScan result has been provided at GitHub in the NoBadWordsCombiner_file_examples.zip file named ‘Input_5_NoBadWords_Pfam’. It is a tab-delimited file including the protein signatures, such as Pfam domain and InterPro annotations.

Problem 5

How does the tool proceed if the BLAST hits inferring hypothetical or uncharacterized proteins come from multiple databases? (step 17) If BLAST database hit descriptions from multiple databases all contain ‘bad words’, the one with the lowest e-value will be chosen.

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact David Roy Smith (dsmit242@uwo.ca) and technical contact Xi Zhang (xzha25@uwo.ca)

Materials availability

This study did not generate new unique reagents.
REAGENT or RESOURCESOURCEIDENTIFIER
Deposited data

Chlamydomonas sp. UWO241 (renamed Chlamydomonas priscuii)(Zhang et al., 2021a)Genbank: GCA_016618255

Software and algorithms

BLAST v2.2.26(Altschul et al., 1997)ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
UniProtKB/Swiss-Prot(Boutet et al., 2007)https://www.uniprot.org/uniprot/?query=reviewed:yes
UniProtKB/TrEMBL(Boeckmann et al., 2003)https://www.uniprot.org/uniprot/?query=reviewed:no
NCBI-NR(Pruitt et al., 2005)https://www.ncbi.nlm.nih.gov/refseq/
InterProScan v4.7(Quevillon et al., 2005)http://www.ebi.ac.uk/interpro/download/
BlastKOALA or GhostKOALA(Kanehisa and Goto, 2000; Kanehisa et al., 2016)https://www.kegg.jp
NoBadWordsCombinerThis articlehttp://hsdfinder.com/combiner
Python 3N/Ahttps://www.python.org/downloads/
Django v3.1.5N/Ahttps://www.djangoproject.com/download/
pandas v1.2.2N/Ahttps://pandas.pydata.org
blastxml_to_tabular.py(Cock et al., 2015)https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/blastxml_to_tabular.py
  13 in total

1.  KEGG: kyoto encyclopedia of genes and genomes.

Authors:  M Kanehisa; S Goto
Journal:  Nucleic Acids Res       Date:  2000-01-01       Impact factor: 16.971

2.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003.

Authors:  Brigitte Boeckmann; Amos Bairoch; Rolf Apweiler; Marie-Claude Blatter; Anne Estreicher; Elisabeth Gasteiger; Maria J Martin; Karine Michoud; Claire O'Donovan; Isabelle Phan; Sandrine Pilbout; Michel Schneider
Journal:  Nucleic Acids Res       Date:  2003-01-01       Impact factor: 16.971

3.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View.

Authors:  Emmanuel Boutet; Damien Lieberherr; Michael Tognolli; Michel Schneider; Parit Bansal; Alan J Bridge; Sylvain Poux; Lydie Bougueleret; Ioannis Xenarios
Journal:  Methods Mol Biol       Date:  2016

Review 4.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors:  S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal:  Nucleic Acids Res       Date:  1997-09-01       Impact factor: 16.971

Review 5.  BlastKOALA and GhostKOALA: KEGG Tools for Functional Characterization of Genome and Metagenome Sequences.

Authors:  Minoru Kanehisa; Yoko Sato; Kanae Morishima
Journal:  J Mol Biol       Date:  2015-11-14       Impact factor: 5.469

6.  InterProScan: protein domains identifier.

Authors:  E Quevillon; V Silventoinen; S Pillai; N Harte; N Mulder; R Apweiler; R Lopez
Journal:  Nucleic Acids Res       Date:  2005-07-01       Impact factor: 16.971

7.  Conserved 'hypothetical' proteins: new hints and new puzzles.

Authors:  M Y Galperin
Journal:  Comp Funct Genomics       Date:  2001

8.  Draft genome sequence of the Antarctic green alga Chlamydomonas sp. UWO241.

Authors:  Xi Zhang; Marina Cvetkovska; Rachael Morgan-Kiss; Norman P A Hüner; David Roy Smith
Journal:  iScience       Date:  2021-01-20

9.  Protocol for HSDFinder: Identifying, annotating, categorizing, and visualizing duplicated genes in eukaryotic genomes.

Authors:  Xi Zhang; Yining Hu; David Roy Smith
Journal:  STAR Protoc       Date:  2021-06-23

10.  NCBI BLAST+ integrated into Galaxy.

Authors:  Peter J A Cock; John M Chilton; Björn Grüning; James E Johnson; Nicola Soranzo
Journal:  Gigascience       Date:  2015-08-25       Impact factor: 6.524

View more
  1 in total

1.  TreeTuner: A pipeline for minimizing redundancy and complexity in large phylogenetic datasets.

Authors:  Xi Zhang; Yining Hu; Laura Eme; Shinichiro Maruyama; Robert J M Eveleigh; Bruce A Curtis; Shannon J Sibbald; Julia F Hopkins; Gina V Filloramo; Klaas J van Wijk; John M Archibald
Journal:  STAR Protoc       Date:  2022-02-15
  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.