| Literature DB >> 35009139 |
Zhenya Liu1, Zirui Ren2, Lunyi Yan2, Feng Li1.
Abstract
Members of the leucine-rich repeat (LRR) superfamily play critical roles in multiple biological processes. As the LRR unit sequence is highly variable, accurately predicting the number and location of LRR units in proteins is a highly challenging task in the field of bioinformatics. Existing methods still need to be improved, especially when it comes to similarity-based methods. We introduce our DeepLRR method based on a convolutional neural network (CNN) model and LRR features to predict the number and location of LRR units in proteins. We compared DeepLRR with six existing methods using a dataset containing 572 LRR proteins and it outperformed all of them when it comes to overall F1 score. In addition, DeepLRR has integrated identifying plant disease-resistance proteins (NLR, LRR-RLK, LRR-RLP) and non-canonical domains. With DeepLRR, 223, 191 and 183 LRR-RLK genes in Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa ssp. Japonica) and tomato (Solanum lycopersicum) genomes were re-annotated, respectively. Chromosome mapping and gene cluster analysis revealed that 24.2% (54/223), 29.8% (57/191) and 16.9% (31/183) of LRR-RLK genes formed gene cluster structures in Arabidopsis, rice and tomato, respectively. Finally, we explored the evolutionary relationship and domain composition of LRR-RLK genes in each plant and distributions of known receptor and co-receptor pairs. This provides a new perspective for the identification of potential receptors and co-receptors.Entities:
Keywords: LRR domain; deep learning; plant disease-resistance genes
Year: 2022 PMID: 35009139 PMCID: PMC8796025 DOI: 10.3390/plants11010136
Source DB: PubMed Journal: Plants (Basel) ISSN: 2223-7747
Figure 1Framework of the DeepLRR CNN model.
Figure 2Radar chart of CNN model and three machine learning models. The radar chart shows four evaluation indicators: Precision, Sensitivity, F1-score and MCC. The brown line represents the average performance of the 5-fold cross validation for each model and the dark blue line represents the performance of each model using the test dataset.
Performance of DeepLRR and six existing tools in predicting LRR units for 572 test protein sequences.
| Method | Precision | Sensitivity | F1 |
|---|---|---|---|
| LRRpredictor | 0.582 |
| 0.692 |
| LRRsearch | 0.676 | 0.813 | 0.739 |
| LRRfinder | 0.798 | 0.669 | 0.728 |
| Pfam | 0.192 | 0.037 | 0.062 |
| Prosite |
| 0.379 | 0.522 |
| Smart | 0.398 | 0.167 | 0.235 |
| DeepLRR | 0.744 | 0.783 |
|
Prosite has the highest precision of 0.836 while LRRpredictor has the highest sensitivity of 0.854. The prediction result of DeepLRR achieves a great balance between precision and sensitivity (0.744, 0.783) and it has the highest F1-score of 0.763.
Figure 3The homepage of the DeepLRR website. The left side of the main body of the website briefly introduces the research focus of DeepLRR while the right side shows the main functional modules of DeepLRR.
Figure 4Re-annotation of the LRR-RLK gene in the Arabidopsis genome, chromosome mapping, gene cluster analysis and phylogenetic analysis. (A) The Venn diagram on the left shows the annotated results of the LRR-RLK gene in the Arabidopsis genome for DeepLRR, reference genome TAIR10.1 and the representative paper respectively. The histogram on the right shows the domain composition of the LRR-RLK gene that DeepLRR could not successfully annotate, including three datasets. One is unique to TAIR10.1, the other is shared by TAIR10.1 and the representative paper and the last is unique to the representative paper. (B) The distribution of LRR-RLK genes was re-annotated by DeepLRR on the chromosomes of Arabidopsis. The green rectangles represent different gene clusters, tandem repeat genes are marked with an asterisk and gene names marked in red are LRR-RLK genes annotated only by DeepLRR. (C) An unrooted phylogenetic tree of LRR-RLK genes was re-annotated by DeepLRR in Arabidopsis. The phylogenetic tree was established with amino acid sequences of the kinase domains using the neighbor-joining (NJ) method. The circles with different colors on the sub-nodes of the phylogenetic tree show different ranges of bootstrap values. The red circle shows bootstrap values from 0.9 to 1, the gold circle shows bootstrap values from 0.7 to 0.9 and the dark grey circle shows bootstrap values from 0.5 to 0.7. The different background colors of the leaf nodes indicate that the number of LRR units contained covers different ranges. Dark red indicates that the number of LRR units is greater than or equal to 20, dark yellow indicates that the number of LRR units is greater than or equal to 10 and less than 20, and dark blue indicates that the number of LRR units is less than 10. The histogram outside the leaf node shows the number of corresponding LRR units in detail. In addition, the phylogenetic tree shows the receptor and co-receptor pairs that have been experimentally verified so far. The circle represents a receptor, the triangle represents a co-receptor and the same color indicates that there is an interaction. Finally, the domain composition of each LRR-RLK gene is shown in detail.