| Literature DB >> 19593435 |
David Toomey1, Heinrich C Hoppe, Marian P Brennan, Kevin B Nolan, Anthony J Chubb.
Abstract
BACKGROUND: Genome sequencing and bioinformatics have provided the full hypothetical proteome of many pathogenic organisms. Advances in microarray and mass spectrometry have also yielded large output datasets of possible target proteins/genes. However, the challenge remains to identify new targets for drug discovery from this wealth of information. Further analysis includes bioinformatics and/or molecular biology tools to validate the findings. This is time consuming and expensive, and could fail to yield novel drugs if protein purification and crystallography is impossible. To pre-empt this, a researcher may want to rapidly filter the output datasets for proteins that show good homology to proteins that have already been structurally characterised or proteins that are already targets for known drugs. Critically, those researchers developing novel antibiotics need to select out the proteins that show close homology to any human proteins, as future inhibitors are likely to cross-react with the host protein, causing off-target toxicity effects later in clinical trials. METHODOLOGY/PRINCIPALEntities:
Mesh:
Substances:
Year: 2009 PMID: 19593435 PMCID: PMC2704375 DOI: 10.1371/journal.pone.0006195
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schema of data processing.
Genomes2Drugs is a free online resource. The web interface was written using open-source Java Enterprise Edition, BioJava 1.6 and NetBeans IDE 6.0. Input sequences are aligned against the human proteome, the PDB dataset and the DrugBank target proteins dataset. Only the best results are preserved. The resulting output files are parsed using BioJava and entered into a MySQL 5.1 database, where the results are sorted and ranked. Output XML files are generated from this data.
Key for output file column headings.
| Column title | Explanation |
| query_id | Unique query entry number. |
| query_accession | First word of input protein title. |
| query_title | Input protein title after ‘〉’. |
| query_length | Number of residues in input sequence. |
| RhuDB | Logarithm (base 10) of the ratio of 〈human expect〉 and 〈drugbank expect〉. |
| RhuDBRank | Entries ranked by descending RhuDB. |
| RhuPDB | Logarithm (base 10) of the ratio of 〈human expect〉 and 〈PDB expect〉. |
| RhuPDBRank | Entries ranked by descending RhuPDB. |
| RDBPDB | Logarithm (base 10) of the ratio of 〈drugbank expect〉 and 〈PDB expect〉. |
| RDBPDBRank | Entries ranked by descending RDBPDB. |
| human_accession | First word of human protein title. |
| human_title | Extracted from target sequence name in BLASTp output. |
| human_expect | Only optimal human/query alignment is returned, i.e. lowest BLASTp E value. |
| human_rank | Query vs Human genome alignments are ranked by descending 〈human_expect〉. I.e. poor/no match to the human genome is scored well and given a low rank number. |
| human_identities | Number of identical residues in query and human sequences. |
| human_percent_identities | (〈human identities〉/〈query length〉)*100. |
| human_positives | Number of homologous residues in query and human sequences. |
| human_percent_positives | (〈human positives〉/〈query length〉)*100. |
| pdb_accession | Protein Data Bank accession number: pdb¦xxxx¦x |
| pdb_title | Name of protein 3-D structure. |
| pdb_expect | Only optimal PDB/query alignment is returned, i.e. lowest BLASTp E value. |
| pdb_rank | Query vs Protein Data Bank sequence alignments are ranked by ascending 〈pdb_expect〉. I.e. excellent matches with very low E values are scored well and given a low rank number. |
| pdb_identities | Number of identical residues in query and PDB sequences. |
| pdb_percent_identities | (〈pdb_identities〉/〈query length〉)*100. |
| pdb_positives | Number of homologous residues in query and PDB sequences. |
| pdb_percent_positives | (〈pdb_positives〉/〈query length〉)*100. |
| drugbank_accession | DrugBank accession number of target protein: nnnn_all_target_protein.fasta. |
| drugbank_title | Name of DrugBank target protein, including target drug accession numbers in parentheses: (DBnnnnn). |
| drugbank_expect | Only optimal DrugBank/query alignment is returned, i.e. lowest BLASTp E value. |
| drugbank_rank | Query vs DrugBank sequence alignments are ranked by ascending 〈pdb_expect〉. I.e. excellent matches with very low E values are scored well and given a low rank number. |
| drugbank_identities | Number of identical residues in query and DrugBank sequences. |
| drugbank_percent_identities | (〈drugbank_identities〉/〈query length〉)*100. |
| drugbank_positives | Number of homologous residues in query and DrugBank sequences. |
| drugbank_percent_positives | (〈drugbank_positives〉/〈query length〉)*100. |
Definition of ratio ranges and error codes.
| RhuDB | RhuPDB | RDBPDB | |
| EBLASTp[hum]ψ vs. EBLASTp[DB/PDB] | −183 to 183 | −183 to 183 | −7000 |
| EBLASTp[hum] | −2000 | −5000 | −8000 |
| ‘Null’ DB/PDB | −3000 | −6000 | −9000 |
BLASTp expect value of the best query/human genome alignment (null = 1000).
BLASTp expect value of the best query/DrugBank alignment or query/protein data bank alignment (not null).
No alignment found between query and either DrugBank or PDB databases (null).
Figure 2Enrichment of P. falciparum proteome by RhuPDB – PDB targets.
Enrichment curves plot the accumulation of user-defined ‘hits’ as a function of rank number. Thus in an ideal case (red line), each consecutive entry in the ascending ranked list will be a hit. Alternatively, if ranking provides no selection the hits will be distributed randomly across the genome (light blue line). The enrichment percentage as a function of rank are shown in dark blue. The 5283 proteins in the P. falciparum 3D7 strain test set were searched using Genomes2Drugs and ranked by RhuPDB. P. falciparum and malaria related hits from PDB were identified using keyword searching of the 〈pdb_title〉 field, and their position in the ranked list identified. The insert, which highlights the first 500 entries, shows that almost 80% of the entries with close homology to known P. falciparum crystal structures were identified in the first 10% of the genome.
Figure 3Enrichment of P. falciparum proteome by RhuDB – PDB targets.
Enrichment curves were plotted as described in Figure 2. The 5283 protein malarial proteome was ranked by RhuDB. P. falciparum and malaria related hits from PDB were identified using keyword searching of the 〈pdb_title〉 field. The enrichment percentage as a function of rank are shown in dark blue, while the red line shows an ideal case, and the light blue line indicates a random distribution. The insert highlights the first 500 entries.
Figure 4Enrichment of P. falciparum proteome by RhuDB – DrugBank targets.
Enrichment curves were plotted as described in Figure 2. The 5283 protein malarial proteome was ranked by RhuDB. P. falciparum and malaria related hits from DrugBank were identified using keyword searching of DrugBank website [4], as shown in supplementary Table S2 online. The 〈drugbank_title〉 field entries were matched to this list of P. falciparum or malaria related drug targets. The enrichment percentage as a function of rank are shown in dark blue, while the red line shows an ideal case, and the light blue line indicates a random distribution. The insert highlights the first 500 entries.