| Literature DB >> 20682560 |
Cinzia Cantacessi1, Aaron R Jex, Ross S Hall, Neil D Young, Bronwyn E Campbell, Anja Joachim, Matthew J Nolan, Sahar Abubucker, Paul W Sternberg, Shoba Ranganathan, Makedonka Mitreva, Robin B Gasser.
Abstract
Transcriptomics (at the level of single cells, tissues and/or whole organisms) underpins many fields of biomedical science, from understanding the basic cellular function in model organisms, to the elucidation of the biological events that govern the development and progression of human diseases, and the exploration of the mechanisms of survival, drug-resistance and virulence of pathogens. Next-generation sequencing (NGS) technologies are contributing to a massive expansion of transcriptomics in all fields and are reducing the cost, time and performance barriers presented by conventional approaches. However, bioinformatic tools for the analysis of the sequence data sets produced by these technologies can be daunting to researchers with limited or no expertise in bioinformatics. Here, we constructed a semi-automated, bioinformatic workflow system, and critically evaluated it for the analysis and annotation of large-scale sequence data sets generated by NGS. We demonstrated its utility for the exploration of differences in the transcriptomes among various stages and both sexes of an economically important parasitic worm (Oesophagostomum dentatum) as well as the prediction and prioritization of essential molecules (including GTPases, protein kinases and phosphatases) as novel drug target candidates. This workflow system provides a practical tool for the assembly, annotation and analysis of NGS data sets, also to researchers with a limited bioinformatic expertise. The custom-written Perl, Python and Unix shell computer scripts used can be readily modified or adapted to suit many different applications. This system is now utilized routinely for the analysis of data sets from pathogens of major socio-economic importance and can, in principle, be applied to transcriptomics data sets from any organism.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20682560 PMCID: PMC2943614 DOI: 10.1093/nar/gkq667
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Bioinformatic analyses of the Oesophagostomum dentatum data sets. Stars indicate analyses performed using custom-written Perl, Python and/or Unix shell computer scripts, accessible via http://research.vet.unimelb.edu.au/gasserlab/index.html. [1] Individual and combined expressed sequence tags (EST) data sets are assembled using CAP3 (compiled Linux 64-bit executable) to generate consensus sequences. [2] Assembled contigs with high similarity (cut-off: <1E-15) to nucleotide sequences of the vertebrate host (Sus scrofa) are eliminated. [3] Database similarity searches (for individual or combined data sets) are carried out using BLASTn and BLASTx (compiled Linux 64-bit executable; 42), embedded in custom-built Unix shell scripts. [4] Sequences (from the individually and combined assembled data sets) are conceptually translated into peptide sequences using ESTScan (compiled Linux 64-bit executable with a Perl wrapper). [5] Domains/motifs within translated peptides are identified via InterProScan (Perl wrapper) and linked to biological pathways in C. elegans using KOBAS (stand-alone Python application; 44). Functional annotation of the predicted peptides is performed by gene ontology (Perl wrapper; 27). [6] The individually assembled data sets are subtracted from one another (in both directions) using a BLASTn algorithm (42) embedded in a custom-built Unix shell script; proteins inferred from subtracted transcripts are assigned parental (i.e. level 1) InterPro terms and subtracted from one another using a BLASTp algorithm, embedded in a custom-built Unix shell script. [7] Potential drug target candidates for each of the individually assembled and/or in silico subtracted data sets are predicted and ranked according to the ‘severity’ of the non-wild-type RNAi phenotypes observed for the corresponding C. elegans orthologues/homologues (custom-built Unix shell scripts). [8] Probabilistic interaction networks among C. elegans orthologues of subtracted molecules are predicted (command lines).
Summary of the nucleotide sequence data for the adult female, adult male, and third (L3) and fourth (L4) larval stages of Oesophagostomum dentatum prior to and following in silico subtraction as well as detailed bioinformatic annotation and analyses
| Female | Male | L3 | L4 | Combined | |
|---|---|---|---|---|---|
| Number of unassembled ESTs | 336 131 | 490 645 | 503 566 | 496 025 | 1 826 367 |
| Contigs (average length ± SD) | 23 807 (483 ± 290) | 29 043 (484 ± 289) | 30 176 (465 ± 281) | 26 349 (498 ± 308) | 36 233 (516 ± 316) |
| Singletons | 23 303 (233 ± 50) | 37 248 (243 ± 45) | 49 341 (227 ± 57) | 36 875 (242 ± 40) | 452 528 (244 ± 37) |
| Total | 47 110 | 66 291 | 79 517 | 63 224 | 488 761 |
| Containing an open reading frame (%) | 38 504 (81.7) | 52 787 ( | 57 818 ( | 50 533 ( | 85 395 (17.5) |
| Returning InterProScan results (%) | 20 229 ( | 26 496 ( | 27 297 (47.2) | 26 121 (51.7) | 56 940 (66.7) |
| Gene ontology (%) | 9970 (25.9) | 12 386 (23.5) | 12 763 (22.1) | 12 735 (25.2) | 25 216 ( |
| | 17 031 | 19 510 | 19 705 | 19 645 | 19 346 |
| | 8864 | 10 091 | 10 926 | 10649 | 11 007 |
| | 30 482 | 35 934 | 34 904 | 35 241 | 35 182 |
| With orthologues in | 23 485 ( | 28 643 (43.2) | 32 904 (41.4) | 30 000 (47.4) | |
| Other parasitic nematodes (%) | 17 533 (37.2) | 21 553 (32.5) | 23 748 (29.9) | 38 634 ( | |
| Other organisms (%) | 12 011 (25.5) | 13 843 ( | 14 731 (18.5) | 14 332 (22.7) | |
| KOBAS (number of biological pathways predicted) | 256 | 254 | 249 | 255 | |
| Number of ESTs (contigs + singletons) | 3451 (671 + 2780) | 10 344 (2902 + 7442) | 14 380 (2752 + 11 628) | 7520 (1280 + 6240) | |
| Containing an open reading frame (%) | 2397 ( | 7117 ( | 7222 (50.2) | 4789 (63.7) | |
| Returning InterProScan results (%) | 521 (21.7) | 1179 (16.6) | 1224 ( | 989 (20.7) | |
| Gene ontology (%) | 376 (15.7) | 840 (11.8) | 760 (10.5) | 652 (13.6) | |
| | 314 | 625 | 684 | 527 | |
| | 177 | 355 | 412 | 359 | |
| | 563 | 1259 | 1073 | 948 | |
| With homologues in | 824 (23.9) | 1834 (17.7) | 2252 (15.6) | 1589 (21.1) | |
| Other parasitic nematodes (%) | 558 (16.1) | 1212 (11.7) | 1384 (9.6) | 1052 ( | |
| Other organisms (%) | 159 (4.6) | 123 (1.2) | 176 (1.2) | 137 (1.8) | |
| KOBAS (number of biological pathways predicted) | 7 | 16 | 18 | 23 |
The 20 most represented (InterPro) protein domains inferred from peptides conceptually translated from individual contigs for Oesophagostomum dentatum [combined assembly of data for adult female, adult male, and the third (L3) and fourth (L4) larval stages] and InterPro protein domains (level 1) assigned to predicted peptides unique to each stage or sex following in silico subtraction
| InterPro description | InterPro code | Number of predicted peptides (%) |
|---|---|---|
| SCP-like extracellular | IPR014044 | 377 (1.2) |
| NAD(P)-binding domain | IPR016040 | 365 (1.1) |
| Proteinase inhibitor I2, Kunitz metazoa | IPR002223 | 339 ( |
| Zinc finger, LIM-type | IPR001781 | 332 ( |
| WD40 repeat | IPR001680 | 312 (0.9) |
| Ankyrin | IPR002110 | 257 (0.8) |
| EF-HAND 2 | IPR018249 | 247 (0.7) |
| WD40 repeat, subgroup | IPR019781 | 242 (0.7) |
| Allergen V5/Tpx-1 related | IPR001283 | 236 (0.7) |
| Protein kinase-like | IPR011009 | 220 (0.6) |
| RNA recognition motif, RNP-1 | IPR000504 | 216 (0.6) |
| WD40 repeat 2 | IPR019782 | 215 (0.6) |
| Protease inhibitor I4, serpin | IPR000215 | 207 (0.6) |
| Src homology-3 domain | IPR001452 | 201 (0.6) |
| Peptidase C1A, papain C-terminal | IPR000668 | 194 (0.6) |
| C-type lectin | IPR001304 | 183 (0.5) |
| Kelch repeat type 1 | IPR006652 | 183 (0.5) |
| Annexin repeat | IPR018502 | 183 (0.5) |
| Protein kinase, core | IPR000719 | 172 (0.5) |
| EF-HAND 1 | IPR018247 | 168 (0.5) |
| Chitin binding protein, peritrophin-A | IPR002557 | 18 (8.6) |
| Basic-leucine zipper (bZIP) transcription factor | IPR004827 | 10 (4.8) |
| DNA primase, small subunit | IPR002755 | 6 (2.9) |
| p53-like transcription factor, DNA-binding | IPR008967 | 5 (2.4) |
| DNA-binding HORMA | IPR003511 | 4 ( |
| Acyl-CoA dehydrogenase/oxidase | IPR013786 | 3 (1.4) |
| Frizzled-like domain | IPR020067 | 3 (1.4) |
| Lipid transport protein | IPR001747 | 3 (1.4) |
| PreATP-grasp-like fold | IPR016185 | 3 (1.4) |
| UbiA prenyltransferase | IPR000537 | 3 (1.4) |
| PapD-like | IPR008962 | 16 ( |
| Major sperm protein | IPR000535 | 15 (3.7) |
| C-type lectin | IPR018378 | 6 (1.5) |
| Phosphoenolpyruvate carboxykinase | IPR008209 | 6 (1.5) |
| Protein of unknown function DUF236 | IPR004296 | 6 (1.5) |
| Scramblase | IPR005552 | 6 (1.5) |
| ClpX, ATPase regulatory subunit | IPR004487 | 5 (1.3) |
| Galactose oxidase/kelch | IPR011043 | 5 (1.3) |
| Ribosomal protein S2 | IPR001865 | 5 (1.3) |
| Amidinotransferase | IPR003198 | 4 ( |
| RmlC-like jelly roll fold | IPR014710 | 17 (4.5) |
| Six-bladed beta-propeller, TolB-like | IPR011042 | 10 (2.7) |
| Protein of unknown function DUF590 | IPR007632 | 9 (2.4) |
| 7TM GPCR, serpentine receptor class r (Str), Nematode | IPR019428 | 8 (2.1) |
| Acyltransferase ChoActase/COT/CPT | IPR000542 | 7 (1.9) |
| Putative DNA binding | IPR009061 | 7 (1.9) |
| 7TM GPCR, serpentine receptor class e (Sre), Nematode | IPR004151 | 6 (1.6) |
| Nuclear hormone receptor, ligand-binding, core | IPR000536 | 6 (1.6) |
| Coenzyme A transferase | IPR004165 | 5 (1.3) |
| Ion transport | IPR005821 | 5 (1.3) |
| Peptidase M24, methionine aminopeptidase | IPR001714 | 7 (2.2) |
| FAD-binding, type 2 | IPR016166 | 4 (1.3) |
| Oxysterol-binding protein | IPR000648 | 4 (1.3) |
| Translation protein SH3-like | IPR008991 | 4 (1.3) |
| Tubulin/FtsZ, GTPase domain | IPR003008 | 4 (1.3) |
| 6-phosphogluconate dehydrogenase | IPR008927 | 3 ( |
| Peptidase C13, legumain | IPR001096 | 3 ( |
| Aminoacyl-tRNA synthetase | IPR015413 | 3 ( |
| Adenosylcobalamin biosynthesis, ATP | IPR016030 | 3 ( |
| Aspartate/other aminotransferase | IPR000796 | 2 (0.6) |
aNumber of unique InterPro domains assigned to predicted peptides in each data set
Functions predicted for proteins encoded in the transcriptome of Oesophagostomum dentatum (combined assembly), based on gene ontology (GO)
| GO description (GO code) | Number of predicted peptides (%) |
|---|---|
| Metabolic process (GO:0008152) | 2102 (10.9) |
| Proteolysis (GO:0006508) | 1361 ( |
| Translation (GO:0006412) | 1033 (5.4) |
| Transport (GO:0006810) | 816 (4.2) |
| Protein amino acid phosphorylation (GO:0006468) | 763 ( |
| Intracellular (GO:0005622) | 1925 (17.5) |
| Membrane (GO:0016020) | 1717 (15.6) |
| Nucleus (GO:0005634) | 1279 (11.6) |
| Integral to membrane (GO:0016021) | 1159 (10.5) |
| Ribosome (GO:0005840) | 736 (6.7) |
| ATP binding (GO:0005524) | 2645 (7.5) |
| Catalytic activity (GO:0003824) | 2449 ( |
| Binding (GO:0005488) | 1622 (4.6) |
| Zinc ion binding (GO:0008270) | 1229 (3.5) |
| Oxidoreductase activity (GO:0016491) | 1226 (3.5) |
| Protein binding (GO:0005515) | 1206 (3.4) |
| Nucleic acid binding (GO:0003676) | 919 (2.6) |
| DNA binding (GO:0003677) | 788 (2.2) |
| Structural constituent of ribosome (GO:0003735) | 755 (2.1) |
| Nucleotide binding (GO:0000166) | 717 ( |
aTotal number of unique GO terms assigned to predicted peptides.
The parental (=level 2) GO categories were assigned according to (InterPro) domains inferred from proteins with homology to functionally annotated molecules.
Examples of C. elegans orthologues of contigs unique to each Oesophagostomum dentatum adult female, adult male and the third (L3) and fourth (L4) larval stages, following in silico subtraction, ranked according to the ‘severity’ of the RNAi phenotype/s observed, and for which inferred peptides were associated with ‘druggable’ (InterPro) domains and/or Enzyme Commission (EC) numbers as well as examples of candidate compounds linked to these domains, predicted using the BRENDA database. The number of the C. elegans ortologues predicted to interact with each of the molecules listed is also indicated
| Contig code | Gene name | RNAi phenotypes | Protein description | Druggable IPR domain (description) | Examples of BRENDA compounds | No. of predicted interacting genes | |
|---|---|---|---|---|---|---|---|
| Contig722 | T23G5.1 | Embryonic lethal, embryonic defects, larval lethal, larval arrest, sterile | Ribonucleotide reductase | IPR000788 (ribonucleotide) | D-phosphoserine | 35 | |
| Contig18241 | F44F4.2 | Embryonic lethal, maternal sterile, sterile progeny | Protein tyrosine phosphatase | IPR000242 (protein tyrosine) | 4-nitrophenyl phosphate | – | |
| Contig15526 | T21E3.1 | Embryonic lethal, maternal sterile | Protein tyrosine phosphatase | IPR000242 (protein tyrosine) | 4-nitrophenyl phosphate | – | |
| Contig10671 | Y110A7A.4 | Embryonic lethal, reduced brood size | Thymidylate synthase | IPR000398 (thymidylate) | 5,10-methylenetetrahydrofolate + deoxyuridine phosphate | 26 | |
| E6SSEER01EX2TA | F17C8.1 | Embryonic defects, larval arrest | Adenylyl cyclase | IPR001054 (denylyl) | 3′,5′-cAMP + diphosphate | 2 | |
| Contig12350 | W03A5.1 | Embryonic lethal, embryonic defects | Fibroblast/platelet-derived growth factor receptor and related receptor tyrosine kinase | IPR001254 (serine proteases) | Cleaved azocasein | – | |
| Contig10801 | T04B2.2 | Embryonic lethal, embryonic defects | Protein tyrosine kinase | IPR001245 (tyrosine protein kinase) | ADP + a phosphoprotein | – | |
| Contig13376 | T04B2.2 | Embryonic lethal, embryonic defects | Protein tyrosine kinase | IPR001245 (tyrosine protein kinase) | ADP + a phosphoprotein | – | |
| Contig10782 | ZK354.6 | Embryonic defects | Casein kinase | IPR001245 (tyrosine protein kinase) | ADP + a phosphoprotein | – | |
| Contig13084 | C25A8.5 | Aldicarb resistant | Protein tyrosine kinase | IPR001254 (serine proteases | Cleaved azocasein | – | |
| Contig10987 | T04D3.4 | Embryonic lethal, larval arrest | Adenylate/guanylate kinase | IPR001054 (guanylate cyclase) | 3′,5′-cAMP + diphosphate | 1 | |
| Contig17117 | B0240.3 | Embryonic lethal, slow growth | Transmembrane guanylate cyclase | IPR001054 (guanylate cyclase) | 3′,5′-cAMP + diphosphate | 27 | |
| Contig10518 | R01E6.1b | Slow growth | Guanylate cyclase | IPR001054 (guanylate cyclase) | 3′,5′-cAMP + diphosphate | – | |
| Contig10600 | C24G6.2b | Fibroblast/platelet-derived growth factor receptor and related receptor tyrosine kinase | IPR11009 (protein kinase) | Cleaved azocasein | – | ||
| Contig1406 | R134.2 | Slow growth | Guanylyl cyclase | IPR001054 (guanylate cyclase) | 3′,5′-cAMP + diphosphate | – | |
| Contig11765 | Y46H3A.1 | Extended life span | 7-transmembrane receptor | IPR11009 (protein kinase) | ADP + a phosphoprotein | – | |
| Contig23920 | T05G5.3 | Embryonic lethal, embryonic defects, maternal sterile | Protein kinase PCTAIRE and related kinases | IPR000719 (protein kinase) | ADP + a phosphoprotein | 139 | |
| Contig1501 | K12D12.1 | Embryonic lethal, embryonic defects, larval arrest | DNA topoisomerase type II | IPR002205 (DNA girase) | Catenated DNA networks + ADP + phosphate | 39 | |
| Contig2892 | C46A5.4 | Protruding vulva | IPR002007 (animal haem peroxidase | 2-Amino-9,10a-dihydro-3H-phenoxazin-3-one | – | ||
| Contig20741 | C46A5.4 | Protruding vulva | IPR002007 (animal haem peroxidase) | 2-Amino-9,10a-dihydro-3H-phenoxazin-3-one | – | ||
| Contig25779 | R11A5.7 | Dumpy | Zinc carboxypeptidase | IPR000834 (Zinc carboxypeptidases) | 4-chlorocinnamic acid + L- β-phenyllactate | 5 |