| Literature DB >> 34747487 |
Piotr Deszyński1, Jakub Młokosiewicz1, Adam Volanakis2, Igor Jaszczyszyn1, Natalie Castellana3, Stefano Bonissone3, Rajkumar Ganesan4, Konrad Krawczyk1.
Abstract
Nanobodies, a subclass of antibodies found in camelids, are versatile molecular binding scaffolds composed of a single polypeptide chain. The small size of nanobodies bestows multiple therapeutic advantages (stability, tumor penetration) with the first therapeutic approval in 2018 cementing the clinical viability of this format. Structured data and sequence information of nanobodies will enable the accelerated clinical development of nanobody-based therapeutics. Though the nanobody sequence and structure data are deposited in the public domain at an accelerating pace, the heterogeneity of sources and lack of standardization hampers reliable harvesting of nanobody information. We address this issue by creating the Integrated Database of Nanobodies for Immunoinformatics (INDI, http://naturalantibody.com/nanobodies). INDI collates nanobodies from all the major public outlets of biological sequences: patents, GenBank, next-generation sequencing repositories, structures and scientific publications. We equip INDI with powerful nanobody-specific sequence and text search facilitating access to >11 million nanobody sequences. INDI should facilitate development of novel nanobody-specific computational protocols helping to deliver on the therapeutic promise of this drug format.Entities:
Mesh:
Substances:
Year: 2022 PMID: 34747487 PMCID: PMC8728276 DOI: 10.1093/nar/gkab1021
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Contents of INDI in May 2021
| Source | Unique sequences | Unique accessions | Main source | Metadata |
|---|---|---|---|---|
| Structures | 535 | 804 PDB codes | Protein Data Bank ( | PDB code, PDB title, authors, resolution technique, text headers of chain fasta files |
| NCBI GenBank | 1858 | 2070 GenBank ids | GenBank ( | GenBank ID, GenBank description/definition, reported organism, date, reference title reference link/pmid, reference journal, reference authors |
| Next-generation sequencing | 11 228 600 | Seven Bioproject ids | Sequence Read Archive ( | BioProject id, SRA id |
| Patents | 14 376 | 687 patent families | Patented Antibody Database ( | Patent number, applicants, patent title, patent abstract |
| Manual | 1268 | 109 papers | Scientific publications | Publication title, publication abstract, publication link |
Data in INDI is divided into five distinct sources. For each source we provide the reference to the online resource we obtained the data from (with the exception of scientific publications), metadata associated with accessions in source as well as August 2021 statistics of the number of nanobodies we extracted.
Figure 1.Data sources and information organization in INDI. We obtain nanobody data from five distinct sources: structures, GenBank, patents, scientific publications and NGS. Structures, GenBank and patents are suitable for automated identification divided into identification of antibody sequences and subsequent filtering of nanobody sequences based on text. Scientific publications and NGS are not suitable for automated identification and they require ad-hoc manual curation. Data from all sources are standardized into sequence and metadata indexes. The web-utility of INDI enables users to query nanobody sequence and metadata indexes spanning all five repositories.
The most common hallmark residue motifs across five INDI datasets and sdab-DB. We calculated the statistics of combination of amino acids in IMGT positions 42, 49, 50 and 52 in all INDI datasets as well as sdab-DB
| Manual (126 motifs) | Structures (83 motifs) | Patents (452 motifs) | NGS (12,307 motifs) | GenBank (206) | sdab-DB (152) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Motif | % total | Motif | % total | Motif | % total | Motif | % total | Motif | % total | Motif | % total |
| FERF | 24% | FERF | 27% | FERF | 30% | FERG | 30% | FERF | 20% | FERF | 33% |
| FERG | 16% | FERG | 15% | YQRL | 15% | VGLW | 10% | FERG | 17% | FERG | 14% |
| YERW | 10% | YQRL | 9% | FERG | 11% | YQRL | 9% | VGLW | 13% | YQRL | 8% |
| VGLW | 5% | VGLW | 6% | VGLW | 4% | FERF | 9% | YQRL | 7% | VGLW | 8% |
| YQRL | 5% | YERL | 4% | FERL | 2% | FERA | 2% | YERL | 3% | YERL | 4% |
| YERL | 2% | YERW | 3% | YKRL | 1% | YERL | 2% | YERF | 2% | FERA | 2% |
| YERF | 1% | FERW | 1% | YERL | 1% | FGRG | 1% | IGLW | 1% | YERW | 1% |
| YQRW | 1% | FERA | 1% | FGRF | 1% | FERE | 1% | FERA | 1% | YERF | 1% |
| FERA | 1% | YERG | 1% | YQRF | 1% | FARG | 1% | FERL | 1% | FERL | 1% |
| WQRL | 1% | YERF | 1% | IGLW | 1% | FERR | 1% | FERV | 1% | YQRW | 1% |
The percentage of each motif is given with respect to the total number of sequences in any given source.
The most common hallmark residue motifs across the seven NGS datasets in INDI
| PRJDB7792 (869) | PRJNA638614 (6061) | PRJNA321369 (5722) | PRJEB7678 (4525) | PRJNA642677 (5478) | PRJEB25673 (1423) | PRJDB2382 (1984) | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Motif | %Total | Motif | %Total | Motif | %Total | Motif | %Total | Motif | %Total | Motif | %Total | Motif | %Total |
| FERG | 46% | FERG | 37% | FERG | 32% | YQRL | 26% | FERG | 33% | FERG | 44% | VALW | 27% |
| FERA | 8% | FERA | 5% | VGLW | 21% | FERG | 17% | FERF | 13% | VGLW | 14% | YQRL | 16% |
| FKRG | 5% | VGLW | 5% | FERA | 3% | FERF | 15% | VGLW | 8% | YERW | 6% | YERL | 13% |
| YQRL | 4% | FERF | 4% | YECL | 1% | VGLW | 9% | YQRL | 8% | FERA | 5% | VGLW | 3% |
| FGRG | 2% | YERL | 4% | FGLW | 1% | YERL | 2% | FERA | 2% | FERF | 3% | FERG | 2% |
| FQRG | 2% | FARG | 3% | FGRG | 1% | FQRL | 2% | YERL | 1% | AGLW | 2% | FERF | 1% |
| SERG | 1% | FERE | 2% | FERR | 1% | VGPW | 1% | FGRG | 1% | FERE | 2% | YQRM | 1% |
| YERG | 1% | FERW | 1% | FARG | 1% | FERL | 1% | FERR | 1% | FKRG | 1% | YQRV | 1% |
| VERG | 1% | FGRG | 1% | FERE | 1% | FERW | 1% | FERE | 1% | VERG | 1% | YQRW | 1% |
| FDRG | 1% | VGPW | 1% | FQRG | 1% | YERG | 1% | FERV | 0% | VGPW | 1% | YQRF | 1% |
We calculated the statistics of combination of amino acids in IMGT positions 42, 49, 50 and 52 in the seven INDI NGS bioprojects. The percentage of each motif is given with respect to the total number of sequences in any given bioproject.
Number of clusters and sequences falling within the same clusters at 70% sequence identity
| #Sources in clusters | Clusters | Sequences |
|---|---|---|
| 1 | 73 479 | 3 702 072 |
| 2 | 1245 | 1 275 004 |
| 3 | 410 | 914 843 |
| 4 | 161 | 1 050 308 |
| 5 | 89 | 4 303 563 |
For each cluster we noted the number of sources that contributed sequences – with the maximum number being five (NGS, patents, GenBank, structures and manual).
Clustering analysis of non-NGS components of INDI
| Manual | Structure | GenBank | Patent | ||||||
|---|---|---|---|---|---|---|---|---|---|
| CD-HIT sequence identity cutoff | Database | %Row sequences | %Column sequences | %Row sequences | %Column sequences | %Row sequences | %Column sequences | %Row sequences | %Column sequences |
| 70 | Manual | - | - | 79 | 89 | 86 | 77 | 94 | 83 |
| Structure | 89 | 79 | - | - | 93 | 71 | 95 | 76 | |
| GenBank | 77 | 86 | 71 | 93 | - | - | 93 | 85 | |
| Patent | 83 | 94 | 76 | 95 | 85 | 93 | - | - | |
| 80 | Manual | - | - | 44 | 49 | 34 | 29 | 48 | 33 |
| Structure | 49 | 44 | - | - | 53 | 28 | 69 | 34 | |
| GenBank | 29 | 34 | 28 | 53 | - | - | 69 | 40 | |
| Patent | 33 | 48 | 34 | 69 | 40 | 69 | - | - | |
| 90 | Manual | - | - | 11 | 9 | 6 | 4 | 14 | 1 |
| Structure | 9 | 11 | - | - | 16 | 4 | 32 | 4 | |
| GenBank | 4 | 6 | 4 | 16 | - | - | 49 | 8 | |
| Patent | 1 | 14 | 4 | 32 | 8 | 49 | - | - | |
| 99 | Manual | - | - | 2 | 5 | 4 | 2 | 8 | 0 |
| Structure | 5 | 2 | - | - | 11 | 3 | 27 | 1 | |
| GenBank | 2 | 4 | 3 | 11 | - | - | 38 | 6 | |
| Patent | 0 | 8 | 1 | 27 | 6 | 38 | - | - | |
We clustered nanobody sequences from manual curation, structures, patents and GenBank using CD-HIT at 70%, 80%, 90% and 99% sequence identity. For each clustering cutoff we indicate the percentage of sequences from any given source that were clustered together with any sequences from another source. For instance, at clustering cutoff 80%, 49% of sequences from structures cluster with manually curated sequences.
Figure 2.Distribution of IMGT CDR-H3 lengths in INDI. We extracted unique IMGT-defined CDR-H3 sequences from each dataset in INDI and noted their lengths.
Clustering contrast of INDI, sdAB-DB and iCAN
| CD-HIT sequence identity cutoff | Total clusters | Clusters without sdAB-DB nor iCAN sequences | #sequences from clusters without sdAB-DB nor iCAN sequences | Clusters with sequences only from sdAB-DB | Clusters with sequences only from iCAN |
|---|---|---|---|---|---|
| 70% | 784 | 444 | 1453 | 27 | 11 |
| 80% | 3903 | 2577 | 7493 | 45 | 57 |
| 90% | 7725 | 5528 | 12 356 | 59 | 125 |
| 99% | 13 285 | 10 041 | 13 456 | 73 | 251 |
We clustered 1452 sequences from sdAB-DB and 2391 sequences from iCAN together with three automatically-obtainable subsets of INDI: patents, structures and GenBank. Manual sequences and NGS were left out from this comparison so as not to saturate the clustering with 11 million NGS sequences and to avoid non-automatically obtained manual sequences. A total of 17 645 sequences were clustered together using CD-HIT. The columns indicate the number of clusters and the clusters and numbers of sequences without any sdAB-DB sequences as well as number of clusters with only sdAB-DB sequences.