Literature DB >> 28105312

YeATSAM analysis of the walnut and chickpea transcriptome reveals key genes undetected by current annotation tools.

Sandeep Chakraborty1, Pedro J Martínez-García1, Abhaya M Dandekar1.   

Abstract

Background: The transcriptome, a treasure trove of gene space information, remains severely under-used by current genome annotation methods. 
Methods: Here, we present an annotation method in the YeATS suite (YeATSAM), based on information encoded by the transcriptome, that demonstrates artifacts of the assembler, which must be addressed to achieve proper annotation.  Results and Discussion: YeATSAM was applied to the transcriptome obtained from twenty walnut tissues and compared to MAKER-P annotation of the recently published walnut genome sequence (WGS). MAKER-P and YeATSAM both failed to annotate several hundred proteins found by the other. Although many of these unannotated proteins have repetitive sequences (possibly transposable elements), other crucial proteins were excluded by each method. An egg cell-secreted protein and a homer protein were undetected by YeATSAM, although these did not produce any transcripts. Importantly, MAKER-P failed to classify key photosynthesis-related proteins, which we show emanated from Trinity assembly artifacts potentially not handled by MAKER-P. Also, no proteins from the large berberine bridge enzyme (BBE) family were annotated by MAKER-P. BBE is implicated in biosynthesis of several alkaloids metabolites, like anti-microbial berberine. As further validation, YeATSAM identified ~1000 genes that are not annotated in the NCBI database by Gnomon. YeATSAM used a RNA-seq derived chickpea ( Cicer arietinum L.) transcriptome assembled using Newbler v2.3.  Conclusions: Since the current version of YeATSAM does not have an ab initio module, we suggest a combined annotation scheme using both MAKER-P and YeATSAM to comprehensively and accurately annotate the WGS.

Entities:  

Keywords:  MAKER-P; RNA-seq; Trinity; berberine bridge enzyme; genome annotation; transcriptome; walnut genome sequence

Year:  2016        PMID: 28105312      PMCID: PMC5200947          DOI: 10.12688/f1000research.10040.1

Source DB:  PubMed          Journal:  F1000Res        ISSN: 2046-1402


Introduction

The genome of a particular organism is static in all cells, unlike the dynamic transcriptome, which is the transcription of the gene space into RNA molecules in a fashion responsive to a variety of factors, such as developmental stage, tissue, and external stimuli. RNA-seq, a high-throughput RNA sequencing method, has radically transformed the identification of transcripts and quantification of transcriptional levels ( Flintoft, 2008; Wang ). It is supported by a diverse set of computational methods for analyzing the resulting data ( Chakraborty ; Chang ; Chu ; Fu ; Grabherr ; Lohse ; Mbandi ; Schulz ; Simpson ; Trapnell ; Trapnell ; Wang ; Zerbino & Birney, 2008). Rapid advances in genome sequencing technologies have generated sequences for a deluge of organisms and species. The task of annotating these sequences has been addressed by several flows. These pipelines are categorized in http://omictools.com/genome-annotation-category and http://genometools.org/ and reviewed in ( Yandell & Ence, 2012). Here, we focus specifically on MAKER-P ( Campbell ; Holt & Yandell, 2011; Law ; Neale ), which was used to annotate the recently published walnut genome sequence (WGS) ( Martínez-García ). In the current study, the YeATS suite ( Chakraborty ) was enhanced to include genome annotation capabilities using RNA-seq-derived transcriptomes (YeATS annotation module - YeATSAM). First, the Trinity-assembled transcriptome obtained from twenty different tissues was compared to the WGS, excluding transcripts emanating from extraneous sources. This step incidentally revealed both biodiversity and plant-microbe interactions in walnut tree(s) from Davis, California ( Chakraborty ). The WGS-derived transcripts were split into three open reading frames (ORFs), which were subjected to BLAST analysis using a plant proteome database obtained from the Ensembl database ( Kersey ). Transcripts can contain more than one significant ORF and must be handled differently depending on whether they map to the same or a different protein. The resulting analysis provided the WGS annotation. Both MAKER-P and YeATSAM failed to annotate several hundred proteins annotated by the other. Many of the proteins had repetitive sequences or domains that, although difficult to detect, do not represent critical proteins during annotation. An egg cell-secreted protein ( Sprunck ), a copper chaperone ( Shin ), and a clavata3/ESR-Related protein ( Kinoshita ) were among the proteins not detected through the YeATSAM flow. Some proteins undetected in the MAKER-P flow are more significant in the context of a plant genome: several photosynthesis-related proteins encoded by the chloroplast ( Nelson & Yocum, 2006) and the large family of FAD-binding berberine bridge enzymes (BBE) involved in biosynthesis of antimicrobial benzophenanthridines ( Cheney, 1963; Winkler ). We posited possible reasons for such exclusions and recommend incorporating both flows for comprehensive enumeration of genes in the WGS. As further validation, YeATSAM was applied to chickpea ( Cicer arietinum L.), an important pulse crop with many nutritional and health benefits ( Jukanti ). The RNA-seq-derived transcriptome of chickpea has also been sequenced ( Garg ) and was processed through the YeATSAM pipeline to identify ~1000 proteins that are encoded by these transcripts, but are not annotated in the NCBI database, most of which were annotated using Gnomon ( Souvorov ).

Methods

The input to YeATSAM is a set of post-assembly transcripts (∅ ) and the walnut genome sequence (WGS) ( Figure 1). Transcripts that do not align to the WGS were removed ( Chakraborty ). A BLAST database of protein peptides (plantpep.fasta: 1M seqeunces) using ~30 organisms (list.plants) from the Ensembl genome was created ( Kersey ). The three longest open reading frames (ORF), obtained using the ‘getorf’ utility in the EMBOSS suite ( Rice ), for each transcript in (∅ ) underwent BLAST analysis ( Camacho ) to the ‘plantpep.fasta’. For cutoff E-value=1E-8, depending on the number of matches, the transcripts were clustered as:
Figure 1.

YeATSAM flow.

First, transcripts from extraneous organisms are pruned. Next, the three longest open reading frames (ORFs) from each transcript undergo BLAST analysis to a database of plant peptides. Depending on the number of significant matches, the transcripts are clustered as: ( a) None - either a previously unknown gene, or non-coding RNA. ( b) One - Unique ORF ( c) Multiple ORFs matching to the same gene - merge the ORFs if the Evalue of the combined ORF is significantly lower. ( d) Multiple ORFs matching to different genes - duplicate the transcripts, associating each with a different ORF. Subsequently, the ORFs are merged based on overlapping amino acid sequences and exact substrings are removed.

None - either a previously unknown gene or non-coding RNA. One - unique ORF. Multiple ORFs matching to the same gene - merge the ORFs if the Evalue of the combined ORF is significantly lower. Multiple ORFs matching to different genes - duplicate the transcripts, associating each transcript with a different ORF.

YeATSAM flow.

First, transcripts from extraneous organisms are pruned. Next, the three longest open reading frames (ORFs) from each transcript undergo BLAST analysis to a database of plant peptides. Depending on the number of significant matches, the transcripts are clustered as: ( a) None - either a previously unknown gene, or non-coding RNA. ( b) One - Unique ORF ( c) Multiple ORFs matching to the same gene - merge the ORFs if the Evalue of the combined ORF is significantly lower. ( d) Multiple ORFs matching to different genes - duplicate the transcripts, associating each with a different ORF. Subsequently, the ORFs are merged based on overlapping amino acid sequences and exact substrings are removed.

In vitro methods

Fifteen samples of walnut tissue were gathered from Chandler trees growing in the Stuke block at UC Davis between April and October 2008. Four additional samples were taken from Chandler plant material from the same orchard maintained in tissue culture. Several grams of leaf and root tissue from each plant were frozen in liquid nitrogen and then transferred to a -80 C freezer. RNA was isolated from each sample using the hot borate method ( Wilkins & Smart, 1996) followed by purification and DNAse treatment using an RNA/DNA Mini Kit (Qiagen, Valencia, CA) per the manufacturer’s protocol. High-quality RNA was confirmed by running an aliquot of each sample on an Experion Automated Electrophoresis System (Bio-Rad Laboratories, Hercules, CA). The cDNA libraries were constructed following the Illumina mRNA-sequencing sample preparation protocol (Illumina Inc., San Diego, CA). Final elution was performed with 16µL RNase-free water. The quality of each library was determined using a BioRad Experion (BioRad, Hercules, CA). Each library was run as an independent lane on a Genome Analyzer II (Illumina, San Diego, CA) to generate 85bp paired-end sequences from each cDNA library. Over a billion reads were obtained. Prior to assembly, all reads underwent quality control for paired-end reads and trimming using Sickle v1.33 ( Joshi & Fass, 2011). The minimum read length was 45bp with a minimum Sanger quality score of 35. The quality-controlled reads were de novo assembled with Trinity v2.0.6 ( Grabherr ). Standard parameters were used and the minimum contig length was 300bp. Individual assemblies for each library and a combined assembly of all tissues were performed. The walnut genome sequence has been released to the public domain ( http://ucanr.edu/sites/wgig/). The Illumina (Genome Analyzer II) for all 20 tissues can be accessed at http://www.ncbi.nlm.nih.gov/sra/PRJNA232394. The transcriptome of Cicer arietinum (transHybrid.fasta, ICC4958; Desi chickpea) was obtained from http://www.nipgr.res.in/ctdb.html ( Garg ). The dataset ‘represents optimized de novo hybrid assembly of 454 and short-read sequence data.’ About two million 454 reads were assembled using Newbler v2.3 followed by hybrid assembly with 53409 transcripts generated by optimized short-read data assembly using TGICL, as reported previously ( Garg ). The set of annotated proteins from chickpea was obtained from the NCBI database (chickpea.pep.fasta, N=34198). PHYML v3.0 was used to generate phylogenetic trees from alignments ( Guindon ). Multiple sequence alignment was done using ClustalW ( Larkin ) and figures were generated using the ENDscript server 2.0 ( Robert & Gouet, 2014). The source code written in Perl is provided as Dataset 1 (YeATSAM.tgz). A README is provided inside the top-level directory for installation and running the programs.

Results and discussion

The input to YeATSAM was ~111K Trinity-assembled transcripts (Combined TrinityFull.fasta) ( Figure 1). Each transcript was aligned to the WGS (wgs.5d.scafSeq200+.trimmed) using BLAST ( Camacho ). Transcripts that did not align to the WGS (cutoff BLAST bitscore=75) were excluded ( Chakraborty ). Those transcripts that aligned to the WGS (list.transcriptome.clean: 106K) were split into the three longest open reading frames (ORF) (list.transcriptome.clean.ORFS: 320K). A BLAST database of protein peptides (plantpep.fasta:1M sequences) using ~30 organisms (list.plants) from the Ensembl genome was created ( Kersey ). The availability of proteomes from related organisms accelerates the annotation. The BLAST results of list.transcriptome.clean.ORFS: 320K on ‘plantpep.fasta’ was processed using a cutoff: bitscore=60, Evalue~=1E-10.

Merging ORFs: broken transcripts

There are two instances in which ORFs can be merged to create a longer amino acid sequence. The first scenario occurs when a particular transcript has multiple ORFs that match to the same protein with high significance, indicating that a sequencing or assembly error has broken a contiguous ORF ( Chakraborty ). In total, 5% of the present transcripts (5,000 of 106,000) had two or more ORFs matching with high significance to the same protein, exactly mirroring the 5% error rates seen in transcripts restricted to the transcriptome from the tissue at the heartwood/sapwood transition zone in black walnut ( Chakraborty ). While most of these transcripts have repetitive elements, there were other non-repetitive sequences with this particular problem. C20727_G1_I1 is one example: it has two ORFS, ORF_15 and ORF_36, that match a DNA repair metallo-β-lactamase family protein (Accession number: XP007043420.1) with Evalues=9E-70 and 6E-96, respectively ( Figure 2a). The two ORFs were merged (inserting the sequence ‘ZZZ’, although the length of the missing fragment is not known) since the Evalue of the combined ORF reduces to 2E-175 and the merged sequence was chosen as representative for the transcript. ORFs are not merged when the combined ORF did not significantly decrease the Evalue and the longer ORF was selected to represent the transcript.
Figure 2.

Open reading frames (ORF) that can be merged.

( a) ORFs from the same transcript: C20727_G1_I1 has two ORFS (ORF 15 and ORF 36) matching to a DNA repair metallo-β-lactamase family protein (Accession number: XP007043420.1) with high significance. We merged the two ORFs (inserting ‘ZZZ’) since the Evalue of the combined ORF is significantly reduced. ( b) ORFs from different transcripts: We merged ORFs from two different transcripts (C53209_G8_I1 and C53209_G6_I1), since both transcripts map to the same scaffold (SUPER472) can be overlapped based on the sequence string ‘PNRSSLP’, and the merged ORF has a significantly reduced Evalue.

Open reading frames (ORF) that can be merged.

( a) ORFs from the same transcript: C20727_G1_I1 has two ORFS (ORF 15 and ORF 36) matching to a DNA repair metallo-β-lactamase family protein (Accession number: XP007043420.1) with high significance. We merged the two ORFs (inserting ‘ZZZ’) since the Evalue of the combined ORF is significantly reduced. ( b) ORFs from different transcripts: We merged ORFs from two different transcripts (C53209_G8_I1 and C53209_G6_I1), since both transcripts map to the same scaffold (SUPER472) can be overlapped based on the sequence string ‘PNRSSLP’, and the merged ORF has a significantly reduced Evalue. The other scenario occurs when the assembler fails to merge two transcripts into a single one. In this instance, two ORFs emanating from different transcripts with significant overlaps were merged. While the merging of two ORFs was described previously ( Chakraborty ), we introduced an additional filter to select mergeable ORFs based on whether the E-value obtained by merging the two ORFs is significantly reduced. For example, transcripts C53209_G8_I1 and C53209_G6_I1 both map to the scaffold SUPER472 and their corresponding ORFs can be merged based on the sequence string ‘PNRSSLP’ ( Figure 2b). The individual ORFs and the combined ORFs align to an autophagy-related protein (TAIR ID: AT3G49590.2) with Evalues 2e-106, 8e-63, and 1e-180, respectively. The increased significance of the combined ORF, in addition to other checks, like ensuring that mapping is to the same scaffold, adds further support to the fact that these transcripts should have been contiguous in the final assembled transcriptome.

Transcripts with multiple ORFs

About 3% of transcripts have ORFs that map to different proteins. Some transcripts should not have been merged. C1089_G1_I1 is an interesting example: a 4574 nt transcript that maps to the chloroplast and encodes two genes. One is highly variable and the other is conserved. The two ORFS, ORF_64 (fwd: 1117-2631) and ORF_108 (fwd: 3195 - 4271), map to maturase K (TAIR ID: ATCG00040.1) and photosystem II reaction center protein (TAIR ID: ATCG00020.1) with very high significance. Maturase K is a good candidate for barcoding angiosperms because it has highly variable coding sequences ( Yu ), while the photosystem II reaction center protein is completely conserved (100% similarity with Arabidopsis). Another example is C19241_G1_I1 (4702 nt), split into ORF_68 (fwd: 176-3487) and ORF_115 (reverse: 4509-4096) encoding a damaged DNA binding protein (TAIR ID: AT4G05420.1) and photosystem I subunit K (TAIR ID: AT1G30380.1) with high significance, respectively. These transcripts are split in the YeATSAM flow, resulting in one ORF per transcript. Subsequently, this artifact of the Trinity assembly led to several unannotated proteins in the MAKER-P flow.

Identifying genes not detected by either YeATSAM or MAKER-P

We compared the annotations of walnut by MAKER-P (walnut.wgs.5d.all.maker.proteins.fasta) and YeATSAM (DB.ORFBEST.60). MAKER-P and YeATSAM each failed to annotate several proteins identified by the other (MAKER-P=~4000; YeATSAM=700). Although most of these unannotated proteins have repetitive sequences (transposable elements), some vital, non-repetitive proteins were excluded by each method. For example, an egg cell-secreted protein (‘WALNUT 00001389-RA’) ( Sprunck ), a Clavata3/esr-related gene (‘WALNUT 00023705-RA’) ( Kinoshita ) and a copper chaperone (‘WALNUT 00006344-RA’) ( Shin ) were not annotated in the YeATSAM flow. These genes do not have transcripts in the twenty tissues analyzed in the current study and are most likely pseudogenes.

Proteins unannotated by MAKER-P

MAKER-P fails to annotate many key photosystem-related proteins ( Table 1). The transcript C59245_G1_I1 has ORF_43 (fwd: 176-1714) and ORF_70 (fwd: 2212-2496) mapping to photosystem II reaction center protein B (PSBB) and photosystem II reaction center protein H (PSBH), respectively. While MAKER-P does annotate PSBB, it failed to detect PSBH. These proteins map to transcripts encoding two significant ORFs (>1E-10), indicating that failure to handle this might have excluded these proteins. Also, these proteins are encoded by the chloroplast. However, this limitation of MAKER-P is not confined to transcripts emanating from the chloroplast. For example, C48031_G3_I1 encodes a leucine-rich repeat transmembrane protein kinase (AT5G48940.1) and a metallo- β-lactamase family protein (TAIR ID: AT4G33540.1) and is mapped to scaffold ‘SUPER374’. MAKER-P failed to annotate the β-lactamase family protein.
Table 1.

Key photosystem-related proteins in the chloroplast not annotated by MAKER-P and YeATSAM.

These transcripts have multiple open reading frames (ORFs) mapping to different proteins with high significance. For example, C59245_G1_I1 has another ORF (43) which maps to photosystem II reaction center protein B (PSBB). MAKER-P annotates PSBB, but not PSBH. These transcripts all emanate from the chloroplast, although not all genes that MAKER-P failed to annotate were from the chloroplast. Genes predicted by MAKER-P that are not identified by YeATSAM are listed with their homology to corresponding genes in the TAIR database.

TRSORFLenTAIRDescriptionE-value
C52274_G4_I1_B189231ATCG00720.1PETB photosynthetic electron transfer B4.00-155
C52274_G4_I1_C231177ATCG00730.1PETD photosynthetic electron transfer D1.00e-108
C53854_G1_I1_A4598ATCG00070.1PSBK photosystem II reaction center protein K precursor1.00E-27
C53854_G1_I1_B6262ATCG00080.1PSBI photosystem II reaction center protein I3.00E-20
C54343_G2_I1_A891ATCG00580.1PSBE photosystem II reaction center protein E4.00E-54
C59245_G1_I1_B7095ATCG00710.1PSBH photosystem II reaction center protein H4.00E-43
WALNUT_00014004-RA-1117AT5G16850.1TERT Telomerase reverse transcriptase0.0
WALNUT_00018632-RA-295ATMG00560.1RPL2 Nucleic acid-binding, OB-fold-like protein9e-152
WALNUT_00019747-RA-326AT1G24040.1Acyl-CoA N-acyltransferases (NAT) superfamily protein5e-121
WALNUT_00031866-RA-311AT5G07810.1SNF2 domain-containing protein/helicase domain- containing9e-115
WALNUT_00020600-RA-155ATCG01240.1RPS7.2 ribosomal protein S7 chrC:140704-1411711e-108
WALNUT_00016414-RA-231AT5G41850.1alpha/beta-Hydrolases superfamily protein | chr5:16756698-167577916e-96
WALNUT_00027509-RA-289AT2G43190.3ribonuclease P family protein | chr2:17956220-179578332e-94
WALNUT_00022174-RA-389AT2G07707.1Plant mitochondrial ATPase, F0 complex, subunit5e-86
WALNUT_00018616-RA-124ATCG00890.1NDHB.1 NADH-Ubiquinone/plastoquinone (complex I)1e-79
WALNUT_00007302-RA-924AT5G14990.1BEST Arabidopsis thaliana protein match is: myosin2e-79

Key photosystem-related proteins in the chloroplast not annotated by MAKER-P and YeATSAM.

These transcripts have multiple open reading frames (ORFs) mapping to different proteins with high significance. For example, C59245_G1_I1 has another ORF (43) which maps to photosystem II reaction center protein B (PSBB). MAKER-P annotates PSBB, but not PSBH. These transcripts all emanate from the chloroplast, although not all genes that MAKER-P failed to annotate were from the chloroplast. Genes predicted by MAKER-P that are not identified by YeATSAM are listed with their homology to corresponding genes in the TAIR database. Furthermore, MAKER-P failed to annotate any FAD-binding berberine bridge enzymes (BBE) in the WGS ( Kutchan & Dittrich, 1995). These enigmatic enzymes are implicated in the transformation of (S)-reticuline to (S)-scoulerine during benzophenanthridine alkaloid biosynthesis in plants ( Winkler ). This pathway is over-expressed upon osmotic stress and pathogen attack ( Attila ; González-Candelas ), provides resistance in lettuce, sunflower and transgenic tobacco by generating anti-microbial compounds ( Custers ), and has unknown functions at specific developmental stages in Arabidopsis ( Irshad ; Pagnussat ). Moreover, it is expressed in floral nectar (Nectarin V, NtBBE) ( Carter & Thornburg, 2004) and roots of tobacco ( Kajikawa ), and in xylem sap of cabbage ( Ligat ) and grapevine ( Chakraborty ). NtBBE was constitutively expressed in the Phytophthora infestans-resistant potato genotype SW93-1015 ( Ali ). Benzophenanthridines are antimicrobial; the California poppy ( Eschscholzia californica) is used as a traditional medicine ( Cheney, 1963; Oldham ). Oral administration of the alkaloid berberine isolated from a Chinese herb lowered cholesterol in 32 hypercholesterolemic patients over three months ( Kong ). Berberine has also been shown to possess antidiabetic properties ( Lee ). The number of BBE genes in different plant species varies significantly from one in moss ( Physcomitrella patens) to 64 in western poplar ( Populus trichocarpa) ( Daniel ). A. thaliana has 27 TAIR IDs assigned to BBE enzymes, with two splice variants ( Supplementary Table 1) ( Daniel ). Based on the current transcriptome, there are four full length BBE genes ( JrBBE1 to 4) that map to different scaffolds in the WGS, in addition to other fragmented transcripts ( Table 2 and Table 3). JrBBE1 (C54052_G1_I1) maps to the scaffold JCF7180001213852 and encodes a 564 aa long ORF, which has significant matches to Uniprot:P30986. The closest match of Uniprot:P30986 (with a low significance of 1E-07) to the MAKER-P annotation is ‘WALNUT 00019959-RA’, a 476 aa long cytokinin dehydrogenase. The sequence alignment of JrBBE genes to Uniprot (P30986) is shown ( Figure 3a).
Table 2.

FAD-binding berberine bridge enzymes (BBE) are undetected in MAKER-P.

These oxidases are involved in the benzophenanthridine alkaloid biosynthesis in plants. Arabidopsis has 27 loci for this family (and a splice variant) ( Table 3). Here, there are four full length berberine bridge enzyme (BBE) genes (named JrBBE1-4) identified using the transcriptome. Some of the proteins are truncated (like C54286_G1_I1), which might be an artifact of the Trinity assembler. Thus, this is not a complete enumeration of the JrBBE genes.

IdTranscriptLengthScaffoldORFTAIR Id
JrBBE1C54052_G1_I1564JCF718000121385234AT1G26420.1
JrBBE2C53871_G1_I1564JCF718000121741028AT1G30700.1
JrBBE3C55152_G1_I1552JCF7180001222284:2429142-289093137AT4G20820.1
JrBBE4C7952_G1_I1559JCF7180001218369110AT2G34790.1
C54286_G2_I1307JCF718000121707635AT1G11770.1
C54286_G1_I1128JCF71800012170767AT4G20830.1
C12765_G1_I1114JCF71800012183698AT4G20840.1
C51815_G1_I4168JCF718000121836929AT4G20860.1
Table 3.

Expression counts (normalized) of transcripts from the FAD-binding berberine bridge enzyme (BBE) family.

The genes have tissue-specific expression - JrBBE3 is highly expressed in the roots and transition zone. The tissue abbreviations are from Chakraborty .

idTranscriptCECICKEMFLHCHLHPHUIFLELMLYPKPLPTRTSETZ
JrBBE1C54052_G1_I1444136197
JrBBE2C53871_G1_I12321115791
JrBBE3C55152_G1_I14334256212351040346
JrBBE4C7952_G1_I132858551171115824113737123315420160217518
C54286_G2_I1332030
C54286_G1_I119724
C12765_G1_I1267723944282323195229286
C51815_G1_I4
Figure 3.

Multiple sequence alignment of BBE from walnut and other organisms.

( a) The JrBBE sequences were aligned to berberine bridge enzyme (BBE) genes from Eschscholzia californica (EcBBE; California poppy), Arabidopsis thaliana (AtBBE15) and Nicotiana tabacum (Nectarin V). Secondary structure information from the structure PDBid:3D2D ( E. californica) was used to annotate the sequences. The signal peptides are different in these proteins, suggesting different localization of these proteins in walnut. ( b) Phylogenetic tree generated from the multiple sequence alignment.

Multiple sequence alignment of BBE from walnut and other organisms.

( a) The JrBBE sequences were aligned to berberine bridge enzyme (BBE) genes from Eschscholzia californica (EcBBE; California poppy), Arabidopsis thaliana (AtBBE15) and Nicotiana tabacum (Nectarin V). Secondary structure information from the structure PDBid:3D2D ( E. californica) was used to annotate the sequences. The signal peptides are different in these proteins, suggesting different localization of these proteins in walnut. ( b) Phylogenetic tree generated from the multiple sequence alignment.

FAD-binding berberine bridge enzymes (BBE) are undetected in MAKER-P.

These oxidases are involved in the benzophenanthridine alkaloid biosynthesis in plants. Arabidopsis has 27 loci for this family (and a splice variant) ( Table 3). Here, there are four full length berberine bridge enzyme (BBE) genes (named JrBBE1-4) identified using the transcriptome. Some of the proteins are truncated (like C54286_G1_I1), which might be an artifact of the Trinity assembler. Thus, this is not a complete enumeration of the JrBBE genes.

Expression counts (normalized) of transcripts from the FAD-binding berberine bridge enzyme (BBE) family.

The genes have tissue-specific expression - JrBBE3 is highly expressed in the roots and transition zone. The tissue abbreviations are from Chakraborty . As with the walnut transcriptome, the chickpea transcriptome (transHybrid.fasta: n=34760) ( Garg ) was split into three ORFs, each of which was BLAST’ed to the subset of plant proteins created from the Ensembl database. Subsequently, the ORFs with significant homology to this database (n=29263) were BLAST’ed to the set of annotated chickpea proteins in the NCBI database (n=34198). Most of these annotations were done using Gnomon ( Souvorov ) ( http://www.ncbi.nlm.nih.gov/bioproject/PRJNA190909), which analyzed ~35000 transcripts. There are ~1500 proteins identified by YeATSAM that are absent in the NCBI database (Evalue cutoff 1E-10). Some of these proteins and their corresponding genes in the TAIR database are shown ( Table 4). TC00902 is an interesting example with two merged genes: a hydrogen ion-transporting ATP synthase (TAIR ID: ATMG00640.1) and a cytochrome C biogenesis (TAIR ID: ATMG00900.1). While Gnomon identified the cytochrome C biogenesis protein (Genbank: XP_004500083.1), it failed to identify the ATP synthase. Unlike MAKER-P, Gnomon generates transcripts through predictive algorithms and does not take the transcriptome as an input. Notwithstanding, these chickpea genes remain unannotated despite the presence of a straightforward method to detect them from available transcripts.
Table 4.

Selected genes in chickpea that are not annotated in the NCBI database.

Most of the NCBI genes were predicted using Gnomon. YeATSAM used the publicly available transcriptome from chickpea to identify these genes. The corresponding genes from the TAIR database are shown. Several transcripts (like TC20962) encode multiple genes, while others (like TC01181) have only one significant ORF. TRid, transcript id; TAIRid: Arabidopsis thaliana id.

TRidTAIRidDescriptionEvalue
TC20962 AATMG00070.1NAD9 NADH dehydrogenase subunit 9 chrM:23663-242353e-116
TC20962 BAT2G07687.1Cytochrome c oxidase, subunit III chr2:3311854-33126513e-107
TC20962 CAT2G07674.1Unknown conserved protein chr2:3269151-32699066e-41
  TC01181ATMG01360.1          COX1 cytochrome oxidase chrM:349830-3514130.0
TC11063AT3G30841.1Cofactor-independent phosphoglycerate mutase chr3:12591595-125934010.0
TC06038ATMG00090.1Structural constituent of ribosome;protein binding chrM:25482-287333e-124
TC13206AT3G13440.1S-adenosyl-L-methionine-dependent methyltransferases superfamily1e-118
TC07586AT2G07725.1Ribosomal L5P family protein chr2:3448402-34489592e-113
TC19047ATMG00570.1Sec-independent periplasmic protein translocase8e-107
TC00902 BATMG00640.1Hydrogen ion transporting ATP synthases, rotational3e-104
TC15163AT4G28360.1Ribosomal protein L22p/L17e family protein chr4:14029294-140309261e-100
TC13677AT5G05210.1Surfeit locus protein 6 chr5:1548198-15495349e-91
TC13780 AAT2G07707.1Plant mitochondrial ATPase, F0 complex, subunit 8 protein2e-90
TC18786AT1G73440.1Calmodulin-related chr1:27611418-276121825e-45

Selected genes in chickpea that are not annotated in the NCBI database.

Most of the NCBI genes were predicted using Gnomon. YeATSAM used the publicly available transcriptome from chickpea to identify these genes. The corresponding genes from the TAIR database are shown. Several transcripts (like TC20962) encode multiple genes, while others (like TC01181) have only one significant ORF. TRid, transcript id; TAIRid: Arabidopsis thaliana id.

Future work

Among the ~700 genes not detected by YeATSAM, there are ~500 genes with no matches in the complete ‘nr’ database. Of these, ~300 have no transcripts (SetA), while the remaining ~200 have matches among the transcripts (SetB). Considering the sensitivity of RNA-seq and the wide coverage of twenty tissues, it is a definite possibility that SetA are pseudogenes. Future work in YeATSAM will focus on methods to distinguish these two classes of genes.

Conclusions

The availability of a RNA-seq-derived transcriptome from a newly sequenced organism like walnut, for which there are related annotated genomes ( Arabidopsis, Vitis, etc), immensely simplifies annotation of the genome and influences the choice of annotation software. Here, we introduce a new annotation method in the YeATS suite (YeATS Annotation Module - YeATSAM), which was used to annotate the newly-sequenced walnut genome using a simple workstation. The key differentiating factor in YeATSAM is the splitting of the assembled transcriptome into multiple ORFs ( Chakraborty ). Transcripts often have more than one significant ORF that must be handled differently depending on whether they map to the same or different proteins. We show that YeATSAM failed to annotate ~700 genes identified by MAKER-P, while identifying ~4000 genes missed by MAKER-P. While most of these genes have repetitive stretches, both methods missed vital genes identified by the other. Since many of the additional genes identified by MAKER-P have no known transcripts, we posit that these were identified using ab initio methods. In the absence of such an ab initio module in YeATSAM, we propose a combined method using both MAKER-P and YeATSAM to annotate the WGS. YeATSAM was also applied to the chickpea transcriptome and identified ~1000 proteins that are not annotated in the NCBI database. This transcriptome was assembled using Newbler v2.3 ( Garg ) and most of the 34198 chickpea proteins in the NCBI database were annotated using Gnomon, the standard annotation tool ( http://www.ncbi.nlm.nih.gov/genome/guide/gnomon.shtml).

Software availability

Latest source code: https://github.com/sanchak/YeATSAM Archived source code at time of publication: DOI: 10.5281/zenodo.165992 ( Sanchak, 2016) License: GNU General Public License The article presents an annotation method, YeATSAM, that leverages the information contained in RNA-Seq derived transcriptomes. The method was compared with two other annotation methods using two organisms: MAKER-P (a RNA evidence based and ab initio hybrid method) with walnut, and NCBI Gnomon (a homology based and ab initio hybrid method) with chickpea. Although YeATSAM and MAKER-P identified same genes, there were also genes that were identified by only one of them (about 4,000 by YeATSAM and about 700 by MAKER-P), as well as genes that both methods failed to identify. Similarly, YeATSAM identified about 1,000 genes that Gnomon failed to identify. The article is well written, the analysis is technically sound, the tables and figures present the results well, and the conclusions are supported by the data. Nonetheless, I would suggest the following changes: Address discrepancies in the numbers reported, e.g., 20 tissues (in abstract, introduction, results, and future work) v. 19 tissues (in methods: 15 samples + four additional samples); 700 (in results) v. ~700 (in future work, and coclusions); ~1,500 chickpea proteins (in results) v. ~1,000 (in conclusions). Instead of approximate values report actual values. As the tool is designed to be used with other organisms (besides walnut and chickpea), make the method and workflow (Figure 1) independent of any organism (e.g., the input to YeATSAM is the genome sequence rather than the walnut genome). Figure 3 (b) can be resized without losing its readability. I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard. This work focuses on a current major challenge in improving genome and transcriptome automated annotation. It also deals with difficulties derived from imperfect de novo assemblies, such as transcripts representing fused and split genes. The increasing affordability to generate sequencing data enhances the demand for more powerful annotation predicting tools and pipelines, although exact annotations will still remain for wet-lab experimentation. This paper compares the YeATSAM tool to previously annotated genomes, in which existing de novo assemblies are used as generated and analyzed with blast, interproscan or similar tools for homology-based annotation. Even though the paper indicates novelty of the method, there are critical points that need modification. The method reported here - YeATSAM - is not clearly different from the work already reported in a previous paper [1]. This method reported here (identify 3 longest ORFs, then blast to known proteins, then merge or split if needed) looks identical to previously published in F1000 Research [1] - for instance, Figure 1 in the previously published paper is an almost identical replica of the Figure 1 in this paper. The current work does appropriately cite this previous paper. However, if there is a novel algorithm to describe here, it needs to be clearly delineated from this previous work. Otherwise, it should just be cited. This previous publication also compares the annotation of the walnut genome by YeATS and Maker-P. The previous paper and this paper both profile walnut transcripts where ORFs were merged and transcripts that match multiple proteins; this paper does use different transcripts to demonstrate the methodology and results. To emphasize the novelty of the present paper, the authors should clarify exactly what this paper offers in addition to the previous paper. In this regard, the paper does go a bit further than the previous one by detailing genes that were unannotated by MAKER-P but found via this method; those genes were not reported previously. If the algorithm has not changed from the previous work, a new focus for this paper is needed, possibly reporting these novel genes such as the BBEs. The addition of the chickpea genome annotation is barely described - a single short results paragraph. The author also has an existing F1000 research article describing the use of YeATS on chickpea transcripts and describing the detection of missed genes and describing multiple ORFs mapping to different proteins and fragmented ORFs of the same protein [2]. How does this report differ from that one? That one is not cited in this report Data reproducibility and accessibility - the new annotations are not made available for either walnut or chickpea (unless they are the same as the ones provided already in Chakraborty et al. 2015 [1]). It would be very difficult to replicate this experiment. No parameters or commands are provided to determine how PHYML, ClustalW or ENDscript server were utilized. I confirmed that YeATSAM.zip (listed as YeATSAM.tgz in manuscript) with README is available for download and the links to data are functional. I was unable to install YeATSAM; the installation and usage instructions are very vague. Specifics: The joined results of MAKER-P and YeATSAM look promising for improving genome annotations. However, a figure or table describing the total number of genes predicted by each software and the overlap would be very helpful to visualize the results. The report commonly has words like “several” or “many”  and the usage of “~” in front of numbers. Numbers should be reported exactly where they are important to the method and results. Examples: “A BLAST database of protein peptides (plantpep.fasta: 1M seqeunces) using ~30 organisms (list.plants)” - also list.plants does not link to anything. “About 3% of transcripts have ORFs that map to different proteins” MAKER-P and YeATSAM each failed to annotate several proteins identified by the other (MAKER-P=~4000; YeATSAM=700)” “Among the ~700 genes not detected by YeATSAM, there are ~500 genes with no matches in the complete ‘nr’ database. Of these, ~300 have no transcripts (SetA), while the remaining ~200 have matches among the transcripts (SetB).” Based on the content of the manuscript, the introduction focuses adequately on the explanation of the problematic of annotating newly assembled genomes and transcriptomes. However, a deeper introduction to the software utilized may be relevant for a better understanding of their choice and also of their basic mechanics. In relation to the results commented on the introduction, the relevance of some of the selected genes is not clear. Specifically, the relevance of the three “critical” proteins not detected by YeATSAM, which are not transcribed and are thus considered pseudogenes, is confusing. In relation to the generation of de novo assemblies, the authors are suggested to provide detail on how the assemblies were combined, considering that the simple addition of libraries would lead to high redundancy. For the walnut genome, were the MAKER-P and YeATSAM packages using the same set of RNASeq reads? This would be an important point to emphasize - a true comparison of the two methods would preferably use the same starting point. The original walnut paper reports using 19 libraries (Martinez-Garcia et al. 2016); this paper reports 20 libraries. In the results and discussion section, the manipulation of ORFs is an interesting concept, although the difference to the described methodology in Chakraborty et al. 2015 [1] is not clear. The use of the term ORF is confusing here since it appears that the merged sequences are the encoded peptides, while ORFs are nucleotide sequences. Moreover, it seems likely the ORFs from the same gene might match different proteins because they are being compared to 30 different organisms. The ORFs could match to the orthologs of the gene in question from different organisms. (i.e. they have different matches to database entries, both orthologs, but they are legitimately from the same gene). In this case, merging is the best avenue, but the software would actually split the transcript apart. Was this seen in some transcripts? Also, when referring to significance with similar proteins, values should be provided. The authors mention that many genes unannotated by MAKER-P have repetitive stretches. What types of repetitive stretches? There is no methodology given for this analysis? This needs to be described/explained. In regards to the sentence “Although most of these unannotated proteins have repetitive sequences (transposable elements)” - does that mean the unannotated proteins originate from within transposable elements, or transposable elements have inserted into the gene itself? The authors do not address the overall differences in proteins detected by each annotation program - is a pattern that may explain these? Pseudogenes are mentioned twice, but this idea is not fully explained. While 20 tissues will capture many genes, it is probably not exhaustive – is there any additional evidence these “genes” are actually pseudogenes such as premature stop codons or frameshift mutations? Minor: This sentence needs improved clarity: “The BLAST results of list.transcriptome.clean.ORFS: 320K on ‘plantpep.fasta’ was processed using a cutoff: bitscore=60, Evalue~=1E-10" Heading “Transcripts with multiple ORFs” - the section above also deals with transcripts with multiple ORFs. This heading could be clarified. Some revision on the writing would improve readability. Abbreviations are recommended to be properly specified at first use in the manuscript and always in figures. Also, numbers and units should be spaced. In relation to the language, the authors are advised to review the use of scientific English, as well as verb tense consistency. In Table 1, the main line indicates proteins not annotated by either program while the last line indicates listing of genes predicted by MAKER-P. These two sentences in the same caption lead to confusion. In addition, sizing and description of other figures might be improved. We have read this submission. We believe that we have an appropriate level of expertise to state that we do not consider it to be of an acceptable scientific standard, for reasons outlined above. In this paper the authors investigate a new annotation method in the YeATS suite (YeATS Annotation Module - YeATSAM), which was used to annotate the newly-sequenced walnut genome using a simple workstation. In YeATSAM the assembled transcriptome is splitting into multiple ORFs. They show that YeATSAM failed to annotate ~700 genes identified by MAKER-P, while identifying ~4000 genes missed by MAKER-P. While most of these genes have repetitive stretches, both methods missed important genes identified by the other. Since many of the additional genes identified by MAKER-P have no known transcripts, the authors suggest that these were identified using ab initio methods. In the absence of such an ab initio module in YeATSAM, they propose a combined method using both MAKER-P and YeATSAM to annotate the WGS. This work is very interesting because the results probe the adequacy of this new annotation method. In general, the presentation is clear and the conclusions are adjusted to the results obtained. The figures and tables are also clear.  Some comments are listed below: In the abstract, please change the order in “Results and Conclusions” part, from lines 17 to 21. Consider to mention first “YeATSAM used a […] chickpea transcriptome assembled using Newbler v2.3” and then that “1000 genes were identified, which were not previously annotated by Gnomon annotation tool”. Fourth and fifth paragraphs of Introduction could be changed to the discussion and in the introduction leave some short sentences about this. En fifth line of Methods section correct “seqeunces”. Please consider to explain further section “future work”. We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
  49 in total

1.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data.

Authors:  Likun Wang; Zhixing Feng; Xi Wang; Xiaowo Wang; Xuegong Zhang
Journal:  Bioinformatics       Date:  2009-10-24       Impact factor: 6.937

2.  Paranoid potato: phytophthora-resistant genotype shows constitutively activated defense.

Authors:  Ashfaq Ali; Laith Ibrahim Moushib; Marit Lenman; Fredrik Levander; Kerstin Olsson; Ulrika Carlson-Nilson; Nadezhda Zoteyeva; Erland Liljeroth; Erik Andreasson
Journal:  Plant Signal Behav       Date:  2012-03-01

3.  Shotgun proteomic analysis of yeast-elicited California poppy (Eschscholzia californica) suspension cultures producing enhanced levels of benzophenanthridine alkaloids.

Authors:  John T Oldham; Marina Hincapie; Tomas Rejtar; P Kerr Wall; John E Carlson; Carolyn W T Lee-Parsons
Journal:  J Proteome Res       Date:  2010-09-03       Impact factor: 4.466

4.  Characterization and mechanism of the berberine bridge enzyme, a covalently flavinylated oxidase of benzophenanthridine alkaloid biosynthesis in plants.

Authors:  T M Kutchan; H Dittrich
Journal:  J Biol Chem       Date:  1995-10-13       Impact factor: 5.157

Review 5.  Nutritional quality and health benefits of chickpea (Cicer arietinum L.): a review.

Authors:  A K Jukanti; P M Gaur; C L L Gowda; R N Chibbar
Journal:  Br J Nutr       Date:  2012-08       Impact factor: 3.718

6.  RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics.

Authors:  Marc Lohse; Anthony M Bolger; Axel Nagel; Alisdair R Fernie; John E Lunn; Mark Stitt; Björn Usadel
Journal:  Nucleic Acids Res       Date:  2012-06-08       Impact factor: 16.971

7.  Bridger: a new framework for de novo transcriptome assembly using RNA-seq data.

Authors:  Zheng Chang; Guojun Li; Juntao Liu; Yu Zhang; Cody Ashby; Deli Liu; Carole L Cramer; Xiuzhen Huang
Journal:  Genome Biol       Date:  2015-02-11       Impact factor: 13.583

8.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

9.  A new picture of cell wall protein dynamics in elongating cells of Arabidopsis thaliana: confirmed actors and newcomers.

Authors:  Muhammad Irshad; Hervé Canut; Gisèle Borderies; Rafael Pont-Lezica; Elisabeth Jamet
Journal:  BMC Plant Biol       Date:  2008-09-16       Impact factor: 4.215

10.  Deep RNA-Seq profile reveals biodiversity, plant-microbe interactions and a large family of NBS-LRR resistance genes in walnut (Juglans regia) tissues.

Authors:  Sandeep Chakraborty; Monica Britton; P J Martínez-García; Abhaya M Dandekar
Journal:  AMB Express       Date:  2016-02-17       Impact factor: 3.298

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.