| Literature DB >> 25269070 |
Tiange Lang1, Kangquan Yin2, Jinyu Liu1, Kunfang Cao3, Charles H Cannon4, Fang K Du5.
Abstract
Predicting protein domains is essential for understanding a protein's function at the molecular level. However, up till now, there has been no direct and straightforward method for predicting protein domains in species without a reference genome sequence. In this study, we developed a functionality with a set of programs that can predict protein domains directly from genomic sequence data without a reference genome. Using whole genome sequence data, the programming functionality mainly comprised DNA assembly in combination with next-generation sequencing (NGS) assembly methods and traditional methods, peptide prediction and protein domain prediction. The proposed new functionality avoids problems associated with de novo assembly due to micro reads and small single repeats. Furthermore, we applied our functionality for the prediction of leucine rich repeat (LRR) domains in four species of Ficus with no reference genome, based on NGS genomic data. We found that the LRRNT_2 and LRR_8 domains are related to plant transpiration efficiency, as indicated by the stomata index, in the four species of Ficus. The programming functionality established in this study provides new insights for protein domain prediction, which is particularly timely in the current age of NGS data expansion.Entities:
Mesh:
Substances:
Year: 2014 PMID: 25269070 PMCID: PMC4182558 DOI: 10.1371/journal.pone.0108719
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Protein domain structure of the protein encoded by the ERECTA gene in Arabidopsis thaliana.
A. From the N- to C-terminal, the protein is composed of one LRRNT_2 domain, two LRR_8 domains and one Pkinase domain. B. Amino acids of the protein. The LRRNT_2 domain and two LRR_8 domains are underlined. Leucine repeats can be found in the latter domains.
Figure 2The proposed programming functionality for predicting protein domains directly from genomic sequence data without a reference genome.
The Illumina reads were first trimmed with quality control methods. Then, assembly software ABySS, SOAPdenovo and Velvet were used separately to obtain original contigs. Next, length control methods were used to select contigs larger than 250 base pairs. Afterwards, the assembly software Phrap was used to obtain final contigs and Genscan was used to predict peptides from these contigs. Finally, Hmmsearch was used to predict protein domains.
Results from the assembly software.
| Species | #fastqreads | Coverage | Software | #contig_250 | max_len(bp) | #pep | max_len(aa) | #LRRNT_2 | #LRR_8 |
| FA | 2,185,253,886 | 4.86 | Abyss | 26,816 | 1,968 | 10,846 | 606 | 5 | 72 |
| SOAP | 26,898 | 1,906 | 10,735 | 578 | 11 | 71 | |||
| Velvet | 123,763 | 6,407 | 23,086 | 702 | 22 | 120 | |||
| Phrap | 114,596 | 6,914 | 21,901 | 880 | 19 | 132 | |||
| FT | 2,197,543,362 | 4.88 | Abyss | 54,144 | 2,739 | 8,595 | 436 | 3 | 40 |
| SOAP | 59,831 | 2,524 | 8,419 | 306 | 4 | 33 | |||
| Velvet | 170,753 | 9,251 | 15,319 | 418 | 6 | 59 | |||
| Phrap | 154,710 | 10,755 | 14,807 | 467 | 9 | 62 | |||
| FL | 1,993,136,266 | 4.43 | Abyss | 7,679 | 2,002 | 2,426 | 506 | 1 | 24 |
| SOAP | 8,321 | 3,430 | 2,611 | 506 | 3 | 23 | |||
| Velvet | 86,717 | 6,718 | 6,479 | 534 | 3 | 32 | |||
| Phrap | 84,287 | 6,665 | 6,822 | 550 | 3 | 45 | |||
| FF | 869,615,244 | 1.93 | Abyss | 7,087 | 5,558 | 2,669 | 772 | 2 | 14 |
| SOAP | 7,049 | 7,064 | 2,609 | 772 | 0 | 12 | |||
| Velvet | 12,129 | 7,511 | 3,092 | 772 | 0 | 17 | |||
| Phrap | 13,972 | 9,203 | 3,827 | 1,536 | 2 | 19 |
FA, FT, FL and FF stands for Ficus altissima, Ficus tinctoria, Ficus langkokensis and Ficus fistulosa, respectively.
#fastq reads: number of fastq reads from Illumina Hiseq2000.
#contig_250: number of predicted contigs longer than 250 base pairs.
max_len (bp): number of base pairs (bp) of the contigs predicted with maximum length.
#pep: number of peptides predicted.
max_len (aa): number of amino acids (aa) of the peptides predicted with maximum length.
#LRRNT_2: number of LRRNT_2 domains predicted.
#LRR_8: number of LRR_8 domains predicted.
Figure 3Maximum length (number of amino acids) of peptides predicted by the programming functionality.
The Illumina reads for F. altissima (FA), F. tinctoria (FT), F. langkokensis (FL) and F. fistulosa (FF) were assembled by ABySS, SOAPdenovo and Velvet. Phrap was used to assemble the contigs from ABySS, SOAPdenovo and Velvet, and then Genscan was used to predict peptides from these contigs. The maximum length of the peptides could be increased by Phrap in FA, FT, FL and FF.
Redundancy removed by Phrap.
| Species | #contig_250from Abyss,SOAP andVelvet | #basepairs | #contig_250not usedby Phrap | #basepairs | #contig_250used byPhrap | #basepairs | #contig_250after Phrap | #basepairs | Percent ofredundancy removedby Phrap |
| FA | 177477 | 79652086 | 86943 | 37241030 | 90534 | 42411056 | 27653 | 13333742 | 36.51 |
| FT | 284698 | 124099037 | 95706 | 40299597 | 188992 | 83799440 | 59004 | 26821922 | 45.91 |
| FL | 102717 | 40680387 | 75107 | 29210626 | 27610 | 11469761 | 9180 | 3886983 | 18.64 |
| FF | 26265 | 9447858 | 7416 | 2344792 | 18849 | 7103066 | 6556 | 2670964 | 46.91 |
FA, FT, FL and FF stands for Ficus altissima, Ficus tinctoria, Ficus langkokensis and Ficus fistulosa, respectively.
#contig_250: number of contigs longer than 250 base pairs.
#base pairs: number of base pairs.
Physiological, anatomical and stomata response data in Ficus.
| Species | #stomata | #epidermalcells | Stomataldensity | Epidermalcell density | Stomatalindex | |
| FA | M | 12.91667 | 231.1944 | 326.5458 | 5844.819 | 5.301273 |
| SD | 2.061553 | 20.15769 | 52.11805 | 509.606 | 0.79305 | |
| SE | 0.343592 | 3.359615 | 8.686342 | 84.93433 | 0.132175 | |
| FT | M | 20.84848 | 169.0303 | 527.0699 | 4273.25 | 10.90198 |
| SD | 4.016538 | 13.41754 | 101.542 | 339.2083 | 1.331365 | |
| SE | 0.699189 | 2.335693 | 17.67619 | 59.04859 | 0.231761 | |
| FL | M | 15.66667 | 99.47619 | 396.0685 | 2514.854 | 13.61349 |
| SD | 1.932184 | 7.35268 | 48.84747 | 185.8829 | 1.467321 | |
| SE | 0.421637 | 1.604486 | 10.65939 | 40.56297 | 0.320196 | |
| FF | M | 19.125 | 99.2 | 483.4985 | 2507.872 | 15.90947 |
| SD | 3.879433 | 10.0584 | 98.07582 | 254.2861 | 1.932021 | |
| SE | 0.969858 | 2.597068 | 24.51895 | 65.65639 | 0.498846 |
FA, FT, FL and FF stands for Ficus altissima, Ficus tinctoria, Ficus langkokensis and Ficus fistulosa, respectively.
M, SD, and SE: mean, standard deviation and standard error, respectively.
#stomata: number of stomata.
#epideman cells: number of epidermal cells.
Figure 4Number of LRRNT_2, LRR_8 and actin domains predicted in F. altissima (FA), F. tinctoria (FT), F. langkokensis (FL) and F. fistulosa (FF) (A); and stomata index in FA, FT, FL and FF (B).
As the number of LRRNT_2 and LRR_8 domains decreased for FA, FT, FL and FF, the stomata index increased.