| Literature DB >> 27806688 |
Pingchuan Li1, Xiande Quan1, Gaofeng Jia1,2, Jin Xiao1,3, Sylvie Cloutier4, Frank M You5.
Abstract
BACKGROUND: Resistance gene analogs (RGAs), such as NBS-encoding proteins, receptor-like protein kinases (RLKs) and receptor-like proteins (RLPs), are potential R-genes that contain specific conserved domains and motifs. Thus, RGAs can be predicted based on their conserved structural features using bioinformatics tools. Computer programs have been developed for the identification of individual domains and motifs from the protein sequences of RGAs but none offer a systematic assessment of the different types of RGAs. A user-friendly and efficient pipeline is needed for large-scale genome-wide RGA predictions of the growing number of sequenced plant genomes.Entities:
Keywords: Genome-wide prediction; Nucleotide binding site (NBS); Pipeline; Receptor like kinase (RLK); Receptor like protein (RLP); Resistance gene analog (RGA)
Mesh:
Year: 2016 PMID: 27806688 PMCID: PMC5093994 DOI: 10.1186/s12864-016-3197-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Workflow of RGAugury. The pipeline was designed to use protein sequences to detect conserved domains and motifs found in genes involved in plant resistance and identify RGAs by integrating results generated from five programs: BLAST, InterProScan, pfam_scan, nCoil and Phobius. The annotated candidates for four different RGA types are exported as plain files. Analyses performed in parallel mode are labelled in blue. Intermediate results are indicated by a dashed-line box. GFF3: Generic Feature Format version 3; CC: coiled-coil; LRR: leucine-rich repeat; NB-ARC: nucleotide binding adapter shared by APAF-1, R gene products and CED-4; STTK: serine/threonine and tyrosine kinase; LysM: lysin motif; TM: transmembrane
Fig. 2RGA identification based on domain structures of genes. CC: coiled-coil; CN: CC-NBS; CNL: CC-NBS-LRR; LRR: leucine rich repeat; LysM: lysin motif; NB-ARC: nucleotide binding site-activity regulated cytoskeleton; NBS: nucleotide-binding site; NL: NBS-LRR; RGA: resistance gene analog; RLK: receptor like kinase; RLP: receptor like protein; STTK: serine/threonine and tyrosine kinase; TIR: Toll/Interleukin-1 receptor; TM: transmembrane; TN: TIR-NBS; TNL: TIR-NBS-LRR; TX: TIR-unknown domain
Fig. 3Web user interface pages of RGAugury. a The main page of RGAuguary for data input. All parameter values required in the command line version are specified directly on this page. Only protein sequences in FASTA format are required. A GFF3 file corresponding to the input protein sequences is optional but recommended. Databases for InterProscan can be selected by choosing either a predesigned ‘Quick’ mode or a ‘Deep’ mode. The default E-value cut-off for the initial RGA filtering with BLASTP is 1e-5. b The RGA prediction result summary page
Evaluation of RGA identification accuracy with RGAugury using the Arabidopsis thaliana dataset (TAIR10)
| RGA type | No. of known RGAs | No. of RGAs identified | % identified |
|---|---|---|---|
| NBS | 193 | 190 | 98.5 |
| RLP | 54 | 46 | 85.2 |
| RLK | 456 | 460 | 100.0 |
Summary of RGA identification results for 50 sequenced plant genomes
Note: Two modes for database selection were used: the ‘Quick’ mode (Pfam + Gene3D) and the ‘Deep’ mode (Pfam + Gene3D + SMART + Superfamily). Results were separated by a slash if differences existed between the two modes. Plants were sorted by taxonomic groups which are labelled on the left side of the table. A. coerulea: Aquilegia coerulea; A. halleri: Anemone halleri; A. lyrata: Arabidopsis lyrata; A. thaliana: Arabidopsis thaliana; A. trichopoda: Amborella trichopoda; B. distachyon: Brachypodium distachyon; B. rapa: Brassica rapa; B. stricta: Boechera stricta; C. clementina: Citrus clementina; C. grandiflora: Capsella grandiflora; C. papaya: Carica papaya; C. reinhardtii: Chlamydomonas reinhardtii; C. rubella: Capsella rubella; C. sativus: Cucumis sativus; C. sinensis: Citrus sinensis; C. subellipsoidea: Coccomyxa subellipsoidea; E. grandis: Eucalyptus grandis; E. salsugineum: Eutrema salsugineum; F. vesca: Fragaria vesca; G. max: Glycine max; G. raimondii: Gossypium raimondii; K. laxiflora: Kalanchoe laxiflora ; L. usitatissimum: Linum usitatissimum; M. acuminata: Musa acuminata; M. domestica: Malus domestica; M. esculenta: Manihot esculenta; M. guttatus: Mimulus guttatus; M. pusilla: Micromonas pusilla; M. truncatula: Medicago truncatula; O. lucimarinus: Ostreococcus lucimarinus; O. sativa: Oryza sativa; P. hallii: Panicum hallii; P. patens: Physcomitrella patens; P. persica: Prunus persica; P. trichocarpa: Populus trichocarpa; P. virgatum: Panicum virgatum; P. vulgaris: Phaseolus vulgaris; R. communis: Ricinus communis; S. bicolor: Sorghum bicolor; S. italica: Setaria italica; S. lycopersicum: Solanum lycopersicum; S. moellendorffii: Selaginella moellendorffii; S. polyrhiza: Spirodela polyrhiza; S. purpurea: Salix purpurea; S. tuberosum: Solanum tuberosum; T. cacao: Theobroma cacao; V. carteri: Volvox carteri; V. vinifera: Vitis vinifera; Z. mays: Zea mays
Fig. 4Performance of RGAugury. Forty-nine sequenced plant genomes (Zea mays was excluded, see text) with varying numbers of protein coding genes were used for RGA identification on a server embedded with 40 CPUs. Time to complete the processing of the entire pipeline for each dataset was recorded as a performance measurement. Performance for the ‘Quick’ mode (Pfam + Gene3D databases) and ‘Deep’ mode (Pfam + Gene3D + SMART + Superfamily) were compared. The dots and R value in red represent results for the ‘Quick’ mode and those in red represent the results for the ‘Deep’ mode