| Literature DB >> 34704076 |
Xi Zhang1,2, Yining Hu3, David Roy Smith4.
Abstract
Annotating protein-coding genes can be challenging, especially when searching for the best hits against multiple functional databases. This is partly because of "bad words" appearing as top hits, such as hypothetical or uncharacterized proteins. To help alleviate some of these issues, we designed a bioinformatics tool called NoBadWordsCombiner, which efficiently merges the hits from various databases, strengthening gene definitions by minimizing functional descriptions containing "bad words." Unlike other available tools, NoBadWordsCombiner is user friendly, but it does require users to have some general bioinformatics skills, including a basic understanding of the BLAST package and dash shell in Linux/Unix environments. For complete details on the use and execution of this protocol, please refer to Zhang et al. (2021a).Entities:
Keywords: Bioinformatics; Genomics; Sequence analysis
Mesh:
Substances:
Year: 2021 PMID: 34704076 PMCID: PMC8521201 DOI: 10.1016/j.xpro.2021.100888
Source DB: PubMed Journal: STAR Protoc ISSN: 2666-1667
Input file example of NCBI nr database BLAST result
| QueryAcc | Query_Length | HitDescription | HitName | HitLength | HitBits | HSP_rank | %ID | eValue | Query_Start | Query_End | Hit_start | Hit_end | HSP_length |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| g1.t1 | 817 | hypothetical protein CEUSTIGMA_g3421.t1 [Chlamydomonas eustigma] | gi|1238995578|dbj|GAX75978.1| | 1443 | 260.766 | 1 | 54.2635659 | 1.41E-75 | 10 | 774 | 11 | 268 | 258 |
| g2.t1 | 399 | ankyrin, partial [Anaeromyces robustus] | gi|1183350135|gb|ORX78377.1| | 235 | 65.4698 | 1 | 40.2298851 | 3.61E-10 | 19 | 279 | 18 | 96 | 87 |
| g3.t1 | 3567 | hypothetical protein CEUSTIGMA_g3419.t1 [Chlamydomonas eustigma] | gi|1238995576|dbj|GAX75976.1| | 1103 | 172.17 | 1 | 38.4615385 | 1.15E-39 | 805 | 1674 | 330 | 597 | 299 |
| g4.t1 | 963 | hypothetical protein CEUSTIGMA_g3418.t1 [Chlamydomonas eustigma] | gi|1238995575|dbj|GAX75975.1| | 623 | 310.457 | 1 | 89.5061728 | 1.17E-97 | 469 | 954 | 172 | 333 | 162 |
| g6.t1 | 291 | hypothetical protein CHLRE_10g421079v5 [Chlamydomonas reinhardtii] | gi|1335042461|gb|PNW77074.1| | 103 | 82.8037 | 1 | 58.3333333 | 1.66E-18 | 103 | 282 | 34 | 93 | 60 |
| g7.t1 | 7908 | hypothetical protein CEUSTIGMA_g3945.t1 [Chlamydomonas eustigma] | gi|1238994727|dbj|GAX76500.1| | 2934 | 156.377 | 1 | 32.6785714 | 7.48E-34 | 6334 | 7761 | 2313 | 2860 | 560 |
| g9.t1 | 471 | hypothetical protein CEUSTIGMA_g3416.t1 [Chlamydomonas eustigma] | gi|1238995573|dbj|GAX75973.1| | 139 | 164.466 | 1 | 62.1212121 | 3.00E-49 | 76 | 468 | 11 | 139 | 132 |
| g10.t1 | 1827 | hypothetical protein GPECTOR_108g190 [Gonium pectorale] | gi|1004134917|gb|KXZ42995.1| | 463 | 331.257 | 1 | 78.8288288 | 1.18E-103 | 580 | 1245 | 48 | 269 | 222 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Input file example of SwissProt database BLAST result
| QueryAcc | Query_Length | HitDescription | HitName | HitLength | HitBits | HSP_rank | %ID | eValue | Query_Start | Query_End | Hit_start | Hit_end | HSP_length |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| g2.t1 | 399 | 2-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2 | sp|Q05921|RN5A_MOUSE | 735 | 48.1358 | 1 | 34.8837209 | 4.14E-06 | 25 | 267 | 125 | 206 | 86 |
| g3.t1 | 3567 | DNA mismatch repair protein MSH6 OS=Arabidopsis thaliana OX=3702 GN=MSH6 PE=1 SV=2 | sp|O04716|MSH6_ARATH | 1324 | 53.5286 | 1 | 41.8181818 | 1.61E-05 | 379 | 543 | 121 | 175 | 55 |
| g4.t1 | 963 | Eukaryotic peptide chain release factor GTP-binding subunit ERF3A OS=Homo sapiens OX=9606 GN=GSPT1 PE=1 SV=1 | sp|P15170|ERF3A_HUMAN | 499 | 234.958 | 1 | 72.2972973 | 2.94E-72 | 511 | 954 | 69 | 216 | 148 |
| g9.t1 | 471 | Thylakoid-associated protein slr0729 OS=Synechocystis sp. (strain PCC 6803 / Kazusa) OX=1111708 GN=slr0729 PE=4 SV=1 | sp|P72673|Y729_SYNY3 | 101 | 47.7506 | 1 | 29.4736842 | 1.49E-06 | 187 | 468 | 11 | 99 | 95 |
| g10.t1 | 1827 | Threonylcarbamoyl-AMP synthase OS=Schizosaccharomyces pombe (strain 972 / ATCC 24843) OX=284812 GN=sua5 PE=3 SV=1 | sp|O94530|SUA5_SCHPO | 408 | 217.238 | 1 | 50.8264463 | 3.67E-63 | 580 | 1242 | 58 | 299 | 242 |
| g15.t1 | 270 | Protein transport protein Sec61 subunit beta OS=Chlamydomonas reinhardtii OX=3055 GN=SEC61B PE=1 SV=1 | sp|A8I6P9|SC61B_CHLRE | 89 | 65.4698 | 1 | 53.7037037 | 2.15E-14 | 106 | 261 | 36 | 89 | 54 |
| g16.t1 | 897 | Probable prolyl 4-hydroxylase 4 OS=Arabidopsis thaliana OX=3702 GN=P4H4 PE=2 SV=1 | sp|Q8LAN3|P4H4_ARATH | 298 | 157.147 | 1 | 41.0788382 | 9.49E-45 | 1 | 708 | 69 | 289 | 241 |
| g17.t1 | 1104 | GATA transcription factor 3 OS=Arabidopsis thaliana OX=3702 GN=GATA3 PE=2 SV=2 | sp|Q8L4M6|GATA3_ARATH | 269 | 62.003 | 1 | 56.097561 | 1.17E-09 | 79 | 201 | 171 | 211 | 41 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … |
Input file example of gene name list
| Gene name |
|---|
| g1.t1 |
| g2.t1 |
| g3.t1 |
| g4.t1 |
| g5.t1 |
| g6.t1 |
| g7.t1 |
| … |
Input file example of KO accession with each gene model identifier retrieved from the KEGG database
| Gene identifier | KO accession |
|---|---|
| g59.t1 | K10849 |
| g60.t2 | K17087 |
| g61.t2 | N/A |
| g62.t1 | N/A |
| g63.t2 | N/A |
| g64.t1 | N/A |
| g65.t1 | K15172 |
| g66.t1 | K02519 |
| … | … |
Input file example of InterProScan database result
| Protein accession | Unique code | Sequence length | Protein signature | Signature accession | Signature description | Start location | Stop location | E-value | Status | Date | InterPro accession | InterPro description |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | SUPERFAMILY | SSF82153 | N/A | 129 | 260 | 9.42E-10 | T | 31-03-2019 | IPR036378 | FAS1 domain superfamily |
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | ProSiteProfiles | PS50213 | FAS1/BIgH3 domain profile. | 111 | 257 | 9.579 | T | 31-03-2019 | IPR000782 | FAS1 domain |
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | Pfam | PF02469 | Fasciclin domain | 123 | 259 | 5.80E-09 | T | 31-03-2019 | IPR000782 | FAS1 domain |
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | ProSitePatterns | PS00183 | Ubiquitin-conjugating enzymes active site. | 69 | 84 | - | T | 31-03-2019 | IPR023313 | Ubiquitin-conjugating enzyme, active site |
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | SMART | SM00212 | N/A | 2 | 148 | 1.10E-36 | T | 31-03-2019 | N/A | N/A |
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | PANTHER | PTHR44511 | N/A | 2 | 119 | 1.50E-59 | T | 31-03-2019 | N/A | N/A |
| g5250.t1 | f246997202ceeb0ebfd5ea2f454be9a2 | 262 | SMART | SM00554 | N/A | 145 | 260 | 7.20E-07 | T | 31-03-2019 | IPR000782 | FAS1 domain |
| g15700.t1 | e9641f2405b85bc4c48a85029514acf0 | 799 | MobiDBLite | mobidb-lite | consensus disorder prediction | 347 | 361 | - | T | 31-03-2019 | N/A | N/A |
| … | … | … | … | … | … | … | … | … | … | … | … | … |
Figure 1The NoBadWordsCombiner home page
(A) The GitHub web interface of this tool.
(B) Uploading the necessary input files.
(C) The interface of running the tool.
(D) The output example of the tool.
Output file example of 23-column mega table via NoBadWordsCombiner
| ID | Gene | Length | NoBadName_Hit_Des | NoBadName_Hit_Name | NoBadName_%ID | NoBadName_eValue | NCBI_Hit_Des | NCBI_Hit_Name | NCBI_%ID | NCBI_eValue | Swiss_Hit_Des | Swiss_Hit_Name | Swiss_%ID | Swiss_eValue | KEGG_KO | KEGG_Des | Pfam | Pfam_No | Pfam_Des | Pfam_evalue | Interpro_No | Interpro_domain |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | QueryAcc | Query_Length | HitDescription | HitName | %ID | eValue | HitDescription | HitName | %ID | eValue | HitDescription | HitName | %ID | eValue | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| 1 | g1.t1 | 817 | hypothetical protein CEUSTIGMA_g3421.t1 [Chlamydomonas eustigma] | gi|1238995578|dbj|GAX75978.1| | 54.2635659 | 1.41E-75 | hypothetical protein CEUSTIGMA_g3421.t1 [Chlamydomonas eustigma] | gi|1238995578|dbj|GAX75978.1| | 54.2635659 | 1.41E-75 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| 2 | g2.t1 | 399 | 2-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2 | sp|Q05921|RN5A_MOUSE | 34.8837209 | 4.14E-06 | ankyrin, partial [Anaeromyces robustus] | gi|1183350135|gb|ORX78377.1| | 40.2298851 | 3.61E-10 | 2-5A-dependent ribonuclease OS=Mus musculus OX=10090 GN=Rnasel PE=1 SV=2 | sp|Q05921|RN5A_MOUSE | 34.8837209 | 4.14E-06 | N/A | N/A | Pfam | PF12796 | Ankyrin repeats (3 copies) | 1.80E-11 | IPR020683 | Ankyrin repeat-containing domain |
| 3 | g3.t1 | 3567 | DNA mismatch repair protein MSH6 OS=Arabidopsis thaliana OX=3702 GN=MSH6 PE=1 SV=2 | sp|O04716|MSH6_ARATH | 41.8181818 | 1.61E-05 | hypothetical protein CEUSTIGMA_g3419.t1 [Chlamydomonas eustigma] | gi|1238995576|dbj|GAX75976.1| | 38.4615385 | 1.15E-39 | DNA mismatch repair protein MSH6 OS=Arabidopsis thaliana OX=3702 GN=MSH6 PE=1 SV=2 | sp|O04716|MSH6_ARATH | 41.8181818 | 1.61E-05 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| 4 | g4.t1 | 963 | Eukaryotic peptide chain release factor GTP-binding subunit ERF3A OS=Homo sapiens OX=9606 GN=GSPT1 PE=1 SV=1 | sp|P15170|ERF3A_HUMAN | 72.2972973 | 2.94E-72 | hypothetical protein CEUSTIGMA_g3418.t1 [Chlamydomonas eustigma] | gi|1238995575|dbj|GAX75975.1| | 89.5061728 | 1.17E-97 | Eukaryotic peptide chain release factor GTP-binding subunit ERF3A OS=Homo sapiens OX=9606 GN=GSPT1 PE=1 SV=1 | sp|P15170|ERF3A_HUMAN | 72.2972973 | 2.94E-72 | K03267 | ERF3, GSPT; peptide chain release factor subunit 3 | Pfam | PF00009 | Elongation factor Tu GTP binding domain | 1.70E-34 | IPR000795 | Transcription factor, GTP-binding domain |
| 5 | g6.t1 | 291 | hypothetical protein CHLRE_10g421079v5 [Chlamydomonas reinhardtii] | gi|1335042461|gb|PNW77074.1| | 58.3333333 | 1.66E-18 | hypothetical protein CHLRE_10g421079v5 [Chlamydomonas reinhardtii] | gi|1335042461|gb|PNW77074.1| | 58.3333333 | 1.66E-18 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| 6 | g7.t1 | 7908 | hypothetical protein CEUSTIGMA_g3945.t1 [Chlamydomonas eustigma] | gi|1238994727|dbj|GAX76500.1| | 32.6785714 | 7.48E-34 | hypothetical protein CEUSTIGMA_g3945.t1 [Chlamydomonas eustigma] | gi|1238994727|dbj|GAX76500.1| | 32.6785714 | 7.48E-34 | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A | N/A |
| 7 | g9.t1 | 471 | Thylakoid-associated protein slr0729 OS=Synechocystis sp. (strain PCC 6803 / Kazusa) OX=1111708 GN=slr0729 PE=4 SV=1 | sp|P72673|Y729_SYNY3 | 29.4736842 | 1.49E-06 | hypothetical protein CEUSTIGMA_g3416.t1 [Chlamydomonas eustigma] | gi|1238995573|dbj|GAX75973.1| | 62.1212121 | 3.00E-49 | Thylakoid-associated protein slr0729 OS=Synechocystis sp. (strain PCC 6803 / Kazusa) OX=1111708 GN=slr0729 PE=4 SV=1 | sp|P72673|Y729_SYNY3 | 29.4736842 | 1.49E-06 | N/A | N/A | Pfam | PF11378 | Protein of unknown function (DUF3181) | 6.90E-26 | IPR021518 | Protein of unknown function DUF3181 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| REAGENT or RESOURCE | SOURCE | IDENTIFIER |
|---|---|---|
| ( | Genbank: GCA_016618255 | |
| BLAST v2.2.26 | ( | |
| UniProtKB/Swiss-Prot | ( | |
| UniProtKB/TrEMBL | ( | |
| NCBI-NR | ( | |
| InterProScan v4.7 | ( | |
| BlastKOALA or GhostKOALA | ( | |
| NoBadWordsCombiner | This article | |
| Python 3 | N/A | |
| Django v3.1.5 | N/A | |
| pandas v1.2.2 | N/A | |
| blastxml_to_tabular.py | ( | |