Literature DB >> 20598141

An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets.

Parsa Hosseini1, Arianne Tremblay, Benjamin F Matthews, Nadim W Alkharouf.   

Abstract

BACKGROUND: The data produced by an Illumina flow cell with all eight lanes occupied, produces well over a terabyte worth of images with gigabytes of reads following sequence alignment. The ability to translate such reads into meaningful annotation is therefore of great concern and importance. Very easily, one can get flooded with such a great volume of textual, unannotated data irrespective of read quality or size. CASAVA, a optional analysis tool for Illumina sequencing experiments, enables the ability to understand INDEL detection, SNP information, and allele calling. To not only extract from such analysis, a measure of gene expression in the form of tag-counts, but furthermore to annotate such reads is therefore of significant value.
FINDINGS: We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation software tool specifically designed for Illumina CASAVA sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the homology-based functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations.
CONCLUSIONS: TASE is a powerful tool to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both homology-based annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset. TASE is specially designed to translate sequence data in a CASAVA-build into functional annotations while producing corresponding gene expression measurements. Achieving such analysis is executed in an ultrafast and highly efficient manner, whether the analysis be a single-read or paired-end sequencing experiment. TASE is a user-friendly and freely available application, allowing rapid analysis and annotation of any given Illumina Solexa sequencing dataset with ease.

Entities:  

Year:  2010        PMID: 20598141      PMCID: PMC2908109          DOI: 10.1186/1756-0500-3-183

Source DB:  PubMed          Journal:  BMC Res Notes        ISSN: 1756-0500


Background

In one run, the Illumina Solexa Genome Analyzer II sequencer produces over 50 billion nucleotides of DNA sequence data [1]. The Illumina Solexa sequencer can be used to sequence genomes as well as sequence DNA reverse transcribed from RNA to provide gene expression information. As the read length of Illumina Solexa sequencing increases, mainly due to advancements in its chemistry, so too does the volume of data generated from sequencing experiments. What may have taken months to sequence many years ago now takes days, with the additional bonus of unprecedented genome depth. However with such rapid turnaround-time comes its own set of challenges. First, terabytes of storage space is required for the resultant data, and in order to analyze such datasets, high powered computing infrastructure is required to extract and make sense of the data [2,3]. Furthermore, analysis of lesser popular sequenced organisms such as plants, including fruits, and vegetables, is not supported by Illumina's GenomeStudio [4], proving to make post-sequencing analysis even more challenging. With Solexa sequencing, the output from the sequencer is initially in the form of .tiff (Tagged Image File Format) images [2]. These images go through a pipeline known as the GenomeAnalyzer (Illumina, Inc), developed specifically for performing three major functions: image analysis, base-calling and genome alignment. Alternatives to the GenomeAnalyzer however do exist, such as Swift [5]. By the end of the GenomeAnalyzer pipeline, the GenomeAnalyzer would have performed alignments with the sequenced reads and a reference genome with accompanying DNA sequence quality scores [2]. Furthermore, third-party tools exist which map sequenced reads onto a reference genome [6,7]. An optional fourth component, CASAVA, takes the newly generated GenomeAnalyzer alignments and performs SNP detection, allele calling and INDEL detection, amongst many other features [2]. From this analysis, a CASAVA-build is produced, containing the sequenced DNA reads which are separated into folders representing the specific chromosome they are located in. The CASAVA-build is compatible with Illumina's GenomeStudio software package were the CASAVA-build can be visualized with greater depth while gaining deeper insight into features such as understanding INDELs, SNP information, exon splice variants and junctions. However the genomes of many organisms do not have the necessary prerequisite files to be in a format compatible with GenomeStudio. Such compatibly is determined by whether necessary organism-specific prerequisite files are available on the USCS Genome Browser [8]. The CASAVA-build organizes and stores reads in directories which represent the chromosomes of the sequenced organism [1]. The directories are further divided into 10 mega base increments such that the reads found within that 10 mega base genomic range are placed in that particular sub-folder [2]. Manually organizing DNA reads within the build is error prone since every chromosome is represented with a directory, and within that are additional sub-folders to represent DNA reads broken-up into 10 mega base windows. Human error can be eliminated by developing an automated method to store all the reads into a given file of which represents all the reads in the chromosome. Therefore, knowing that each chromosome is represented by a directory, a viable approach to eliminating user-error is by traversing the sub-folders of the chromosome's directory and concatenating all the sequenced DNA reads into a single file. This file contains all the reads found in the chromosomes directory, except it eliminates the need for having numerous sub-folders and additional files. Using publicly available genome and functional annotations, sequenced reads are iteratively annotated. Following suit, a measure of gene expression known as tag-counting is employed which calculates the number of synthesized DNA sequenced being found between functionally annotated regions. Herein, we propose TASE, or Tag counting and Analysis of Solexa Experiments, a database-driven Java GUI, which accomplishes this by performing read concatenation, tag-counting and the analysis of Illumina datasets in an ultrafast and highly efficient manner, especially useful for organisms with genomes not supported by Illumina GenomeStudio.

Methods

Implementation

TASE is written in Java and the Java Swing user-interface library. We chose Java and Swing due to its ease and robust nature for developing user-interface applications. TASE uses Microsoft SQL Server database management system [9], serving as a data-store for both the chromosomes in the given lane and the annotation files for the given sequenced organism. TASE interfaces with SQL Server using the jTDS JDBC driver [10]; a fast Java database driver utilized to enable the calculation of tag-count and derivation of functional annotations. TASE also graphically represents chromosomal reads per lane using the JFreeChart graphing library [11].

Concatenation of reads

TASE analysis is divided into two distinct but yet highly related phases: DNA read concatenation for each given chromosome per selected lane of interest, followed by gene expression calculations using tag count measurements and homology-based annotation. To initiate analysis, a successfully generated CASAVA-build must first be present. Within this build, the 'export' directory contains folders for all the chromosomes pertaining to the sequenced organism, and its contents are what drive the analysis [2]. Upon defining a CASAVA-build, the contents of the 'export' folder are recursively traversed, iterating through all the sub-folders which represent chromosomes. In doing so, all the DNA reads for the given chromosome are appended to its own respective file. Therefore the number of reads for all the sub-folder will equal that in the respective chromosome file. The index of the read of which signifies its locations within a given chromosome is also appended alongside the DNA sequence; proving crucial in the eventual stage of deriving functional annotations and calculating tag-counts. Other properties such as the Illumina Solexa hardware ID, direction of the sequence (forward or reverse), and flow cell lane number, are also saved to the file. Bar graphs are produced for all lanes selected for analysis which illustrate the number of DNA reads per chromosomes (Figure 1).
Figure 1

Reads per chromosome. The number of sequenced reads per CASAVA folder is concatenated, calculated and visually presented before annotation is performed. Dataset from Tremblay, A. (2010).

Reads per chromosome. The number of sequenced reads per CASAVA folder is concatenated, calculated and visually presented before annotation is performed. Dataset from Tremblay, A. (2010).

Measuring gene expression using tag-counts, and functional annotation

A set of two tab-delimited text files are required to initiate tag-count analysis and functional annotations, respectively: 1) Genomic start and end sites: Must contain genomic start and end sites for genes pertaining to the sequenced organism. The base-pair ranges will be used to perform tag-count analysis. 2) Homology-based annotations: There must also be annotations corresponding to the genomic start and end sites. The annotations are used in assigning gene functional annotations based on homology. Both files serve a critical role in analysis: Gene expression relies on counting the number of DNA sequences that fall within the range of the start and end sites of a gene, i.e. tag-counting. TASE takes the two user-defined files and performs table-querying between them, producing a joined-table containing the start and the end of the translated portion of the gene (ORF), as well as the respective functional annotation pertaining to that given genomic range. Therefore there must be attributes common between the two files to enable successful table-joining to occur (Figure 2), or else both tag-count analysis and gene annotations will produce inaccurate output. Such annotation files are readily available for many organisms in public repositories such as organism-specific databases pertaining to the sequenced organism. For example, both files representing the functional annotations and defined gene-encoding regions for Glycine max (Soybean) were found on the DOE JGI Glycine max ftp [12]. An experimental dataset for use in TASE was obtained by Tremblay et. al [13].
Figure 2

Defining the necessary columns for tag-counting and analysis. There must be a column in both files which have like values. Above, such a column is 'Accession' and 'AccessionNumber'.

Defining the necessary columns for tag-counting and analysis. There must be a column in both files which have like values. Above, such a column is 'Accession' and 'AccessionNumber'. Distinct column names must be present in the first line of both files, due to the fact that they are important aspects of both tag-counting and annotation derivation. Upon selection of the two files, a dialog is presented in which it is divided in two halves: each for the two required files. A total of six selections are to be made which conclude which columns from the first line in the two files are to represent the start site and end site index, the keys (to be used between the two files), the chromosome and finally, the column containing functional annotations (Figure 2). After the necessary columns are selected, a dialog is presented to enable a connection to an SQL Server instance, prompting for the server username and password. The dialog prompts also for the server instance as well as the class-driver for the SQL Server JDBC driver. By default, the class-driver for jTDS is automatically inserted. After successful login, a database is created using Java and jTDS, named after the TASE project name entered upon first starting TASE. Following suit and utilizing the jTDS JDBC driver, both the gene-encoding ranges and functional annotations files are bulk-imported to the newly created database. All the files representing the chromosomes, concatenated earlier, are also bulk-imported into the same database. Each file, whether it represents a chromosome or one of the user-defined files, have their contents stored in their own physical table. Depending on the processor speed and system specifications, database bulk-upload time will vary. Once upload to the database is complete, a dialog appears which contains all the chromosome files which were uploaded. Clicking any chromosome name within this list will initialize both tag-count calculation and functional annotation derivation. Upon such a click, SQL code is automatically generated which interacts with the jTDS driver and SQL Server to ultimately execute the analysis for a given chromosome. The following algorithm serves as the basis behind both tag-count analysis and functional-annotation derivation: for each chromosome selected for analysis: extract and store its DNA read indices. tag count = number of times RNA-Seq reads are found in-between all annotations start and end site. retrieve the homology-based annotation for the corresponding tag-count, based on the columns specified as shared between the two files. continue write output to file For any selected chromosome, the resultant output is saved as a tab-delimited text file with the following notation: {chromosome}_{lane #}.txt. The files are saved in the 'output' folder of the TASE project directory created while running TASE. Generated output is also displayed in tabs, enabling an opportunity to view the top 50 annotations sorted by tag-count (Figure 3). Furthermore, the output file contains all the columns in both the functional annotations and gene-encoding region files, with the addition of tag-count measurements to signify gene-expression values per annotation.
Figure 3

Generated output. Resultant output is displayed visually and saved locally.

Generated output. Resultant output is displayed visually and saved locally.

Findings

TASE has high computational efficiency, both in-terms of analysis time and tag-counting. To measure performance, we utilized soybean (Glycine max) data in which all eight lanes of the Illumina flow cell were utilized [13] (Table 1).
Table 1

Number of reads per lane

ChromosomeChrom size (bp)Lane1Lane2Lane3Lane4Lane5Lane6Lane7Lane8
155,915,5954408756460144310125574CONTROL737217778590255

251,656,7135800074229190042165634CONTROL96910102557117524

347,781,0765067764715164496142812CONTROL8401088910101738

449,243,8525072664908165724144360CONTROL8464589600102509

541,936,5044408056686144930125480CONTROL735877837389551

650,722,8216120077619199609173280CONTROL100638106572122915

744,683,1575045164728165386144206CONTROL8479089232102316

846,995,5327539295643245844214476CONTROL125271133156152202

946,843,7504807961227157808138132CONTROL806338498697566

1050,969,6356106877456198859172297CONTROL101868107309122479

1139,172,7906030677010195521170430CONTROL100450105876120835

1240,113,1404313155977141515123466CONTROL724937678488282

1344,408,9717244292603235446204937CONTROL119710126250146480

1449,711,2044908863027161172140375CONTROL826708697199934

1550,939,1605704073027186523162741CONTROL95461100196114936

1637,397,3854107452851134554117510CONTROL686357214382947

1741,906,7745646672499186013161964CONTROL9451399838113529

1862,308,1405946076866195838170495CONTROL100206105912121360

1950,589,4414524057579146309127167CONTROL747707922690339

2046,773,1674614658890150392131338CONTROL765338071692677

Reads aligned to genome1,074,1531,374,0003,510,2913,056,674-1,791,5141,892,3922,170,374

Reads with annotations640,467851,5952,297,3711,970,284-1,101,2141,150,1251,363,662

Reads without annotation433,686522,4051,212,9201,086,390-690,300742,267806,712

Lanes 1 and 2 had one pM of cDNA, lanes 3 and 4 had 4 pM of cDNA while lanes 6, 7 and 8 had 2 pM of cDNA. The number of reads is roughly proportional to the cDNA concentration. The number of reads per lane which aligned to the Soybean genome is provided. The number of reads which had functional annotation is also provided. Figure 1 ratifies the textual data pertaining to lane 2 in this table.

Number of reads per lane Lanes 1 and 2 had one pM of cDNA, lanes 3 and 4 had 4 pM of cDNA while lanes 6, 7 and 8 had 2 pM of cDNA. The number of reads is roughly proportional to the cDNA concentration. The number of reads per lane which aligned to the Soybean genome is provided. The number of reads which had functional annotation is also provided. Figure 1 ratifies the textual data pertaining to lane 2 in this table. TASE was executed using the soybean genome build 1.0 [14]. A Python script was developed to extract the DNA sequence out of files representing individual chromosomes. Functional annotations and gene locations were retrieved from the DOE JGI Glycine max website [12]. The Soybean genome is approximately 1115 mega bases [15,16] and 7 of the 8 flow cell lanes had well over one million reads [13]. Lane 5 is an Illumina control [13]. For the other lanes, all reads contained Asian Soybean Rust (ASR); skewing the actual number of Soybean-only DNA reads [13]. Regardless, TASE was more than capable of analyzing all eight lanes with ease; handling the analysis of a single lane in no more than 7 minutes (Table 2). All tests were run on a dual-core 2 gigabyte CPU personal notebook with 4 gigabytes RAM, Windows 7 OS and SQL Server 2008 Developer Edition.
Table 2

Performance testing TASE

#/lanes analyzedSpecific lane(s)#/chromosomes w/readsTotal #/readsRead concatenation (min:sec)Tag counting, annotation (min:sec)
12201,374,0000:593:17

14203,056,6742:024:56

41,3,7,8808,647,2108:0411:52

8Entire flow cell16014,869,39812:1315:55

Numerous tests were performed to measure the efficiency of TASE using datasets of varying sizes.

Performance testing TASE Numerous tests were performed to measure the efficiency of TASE using datasets of varying sizes. Performance was tested using data from 4 of the 8 lanes of the flow cell. All lanes had well over 1 million reads, with a read length of 39 base-pairs (Table 2). However, regardless of the sheer number of reads, TASE performed read concatenation, annotation and tag-counting results in less than 20 minutes (Table 2). However analysis time is proportional to genome size. Therefore, analysis times will vary for organisms with larger or smaller genomes. The analysis time for one lane was no more than 7 minutes (Table 2). As additional lanes are added to the workload, time necessary to not only concatenate but also perform tag-counting and annotation increases in a linear fashion. In a traditional Illumina sequencing experiment, there is usually one lane dedicated as a control [2]. Due to there being minimal DNA reads, TASE analyzes this lane in a matter of seconds, cutting the tag-counting and annotation time possibly by several minutes or even more.

Conclusions

We developed TASE (Tag counting and Analysis of Solexa Experiments), a rapid tag-counting and annotation GUI-based software tool specifically designed for Illumina sequencing datasets. Developed in Java and deployed using jTDS JDBC driver and a SQL Server backend, TASE provides an extremely fast means of calculating gene expression through tag-counts while annotating sequenced reads with the gene's presumed function, from any given CASAVA-build. Though TASE is developed for Windows operating systems with SQL Server, however its packaged jTDS JDBC driver provides compatibility with Sybase database management systems in non-Windows operating systems. Such a build is generated for both DNA and RNA sequencing. Analysis is broken into two distinct components: DNA sequence or read concatenation, followed by tag-counting and annotation. The end result produces output containing the functional annotation and respective gene expression measure signifying how many times sequenced reads were found within the genomic ranges of functional annotations. TASE is a powerful GUI tool, free of a command-line prompt, with the intent to facilitate the process of annotating a given Illumina Solexa sequencing dataset. Our results indicate that both functional annotation and tag-count analysis are achieved in very efficient times, providing researchers to delve deep in a given CASAVA-build and maximize information extraction from a sequencing dataset.

Availability

Project name: TASE (Tag counting and Analysis of Solexa Experiments) Project homepage: http://sourceforge.net/projects/tase/ Operating Systems: Windows Programming languages: Java SE 1.6, Java Swing Other requirements: Microsoft SQL Server 6.5, 7, 2000, 2005, 2008 License: GNU General Public License v3 (GPLv3)

Abbreviations

CASAVA: Consensus Assessment of Sequence and Variation.

Competing interests

The authors have no competing interests. This work was funded in part by the United Soybean Board under grant 7258. Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the United States Department of Agriculture.

Authors' contributions

PH developed the TASE application, graphical user interface and application logic. NA and BM supervised work, reviewed manuscript, application and algorithm. AT tested TASE and provided advice on the algorithm. All authors read and approve the final manuscript.
  4 in total

1.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

2.  Genome sequence of the palaeopolyploid soybean.

Authors:  Jeremy Schmutz; Steven B Cannon; Jessica Schlueter; Jianxin Ma; Therese Mitros; William Nelson; David L Hyten; Qijian Song; Jay J Thelen; Jianlin Cheng; Dong Xu; Uffe Hellsten; Gregory D May; Yeisoo Yu; Tetsuya Sakurai; Taishi Umezawa; Madan K Bhattacharyya; Devinder Sandhu; Babu Valliyodan; Erika Lindquist; Myron Peto; David Grant; Shengqiang Shu; David Goodstein; Kerrie Barry; Montona Futrell-Griggs; Brian Abernathy; Jianchang Du; Zhixi Tian; Liucun Zhu; Navdeep Gill; Trupti Joshi; Marc Libault; Anand Sethuraman; Xue-Cheng Zhang; Kazuo Shinozaki; Henry T Nguyen; Rod A Wing; Perry Cregan; James Specht; Jane Grimwood; Dan Rokhsar; Gary Stacey; Randy C Shoemaker; Scott A Jackson
Journal:  Nature       Date:  2010-01-14       Impact factor: 49.962

3.  Managing and analyzing next-generation sequence data.

Authors:  Brent G Richter; David P Sexton
Journal:  PLoS Comput Biol       Date:  2009-06-26       Impact factor: 4.475

4.  Swift: primary data analysis for the Illumina Solexa sequencing platform.

Authors:  Nava Whiteford; Tom Skelly; Christina Curtis; Matt E Ritchie; Andrea Löhr; Alexander Wait Zaranek; Irina Abnizova; Clive Brown
Journal:  Bioinformatics       Date:  2009-06-23       Impact factor: 6.937

  4 in total
  15 in total

1.  A genome resource to address mechanisms of developmental programming: determination of the fetal sheep heart transcriptome.

Authors:  Laura A Cox; Jeremy P Glenn; Kimberly D Spradling; Mark J Nijland; Roy Garcia; Peter W Nathanielsz; Stephen P Ford
Journal:  J Physiol       Date:  2012-04-16       Impact factor: 5.182

2.  Evidence for compensatory upregulation of expressed X-linked genes in mammals, Caenorhabditis elegans and Drosophila melanogaster.

Authors:  Xinxian Deng; Joseph B Hiatt; Di Kim Nguyen; Sevinc Ercan; David Sturgill; LaDeana W Hillier; Felix Schlesinger; Carrie A Davis; Valerie J Reinke; Thomas R Gingeras; Jay Shendure; Robert H Waterston; Brian Oliver; Jason D Lieb; Christine M Disteche
Journal:  Nat Genet       Date:  2011-10-23       Impact factor: 38.330

3.  Statistical Optimization of Pharmacogenomics Association Studies: Key Considerations from Study Design to Analysis.

Authors:  Benjamin J Grady; Marylyn D Ritchie
Journal:  Curr Pharmacogenomics Person Med       Date:  2011-03-01

4.  Transcriptome responses of insect fat body cells to tissue culture environment.

Authors:  Norichika Ogata; Takeshi Yokoyama; Kikuo Iwabuchi
Journal:  PLoS One       Date:  2012-04-06       Impact factor: 3.240

5.  A bioinformatics approach to distinguish plant parasite and host transcriptomes in interface tissue by classifying RNA-Seq reads.

Authors:  Daisuke Ikeue; Christian Schudoma; Wenna Zhang; Yoshiyuki Ogata; Tomoaki Sakamoto; Tetsuya Kurata; Takeshi Furuhashi; Friedrich Kragler; Koh Aoki
Journal:  Plant Methods       Date:  2015-05-03       Impact factor: 4.993

6.  Analysis of Phakopsora pachyrhizi transcript abundance in critical pathways at four time-points during infection of a susceptible soybean cultivar using deep sequencing.

Authors:  Arianne Tremblay; Parsa Hosseini; Shuxian Li; Nadim W Alkharouf; Benjamin F Matthews
Journal:  BMC Genomics       Date:  2013-09-11       Impact factor: 3.969

7.  Inter-niche and inter-individual variation in gut microbial community assessment using stool, rectal swab, and mucosal samples.

Authors:  Roshonda B Jones; Xiangzhu Zhu; Emili Moan; Harvey J Murff; Reid M Ness; Douglas L Seidner; Shan Sun; Chang Yu; Qi Dai; Anthony A Fodor; M Andrea Azcarate-Peril; Martha J Shrubsole
Journal:  Sci Rep       Date:  2018-03-07       Impact factor: 4.379

8.  Age and Diet Affect Genetically Separable Secondary Injuries that Cause Acute Mortality Following Traumatic Brain Injury in Drosophila.

Authors:  Rebeccah J Katzenberger; Barry Ganetzky; David A Wassarman
Journal:  G3 (Bethesda)       Date:  2016-12-07       Impact factor: 3.154

9.  An Immortalized Genetic Mapping Population for Perennial Ryegrass: A Resource for Phenotyping and Complex Trait Mapping.

Authors:  Janaki Velmurugan; Dan Milbourne; Vincent Connolly; J S Heslop-Harrison; Ulrike C M Anhalt; M B Lynch; Susanne Barth
Journal:  Front Plant Sci       Date:  2018-05-31       Impact factor: 5.753

10.  NagRBt Is a Pleiotropic and Dual Transcriptional Regulator in Bacillus thuringiensis.

Authors:  Zhang-Lei Cao; Tong-Tong Tan; Yan-Li Zhang; Lu Han; Xiao-Yue Hou; Hui-Yong Ma; Jun Cai
Journal:  Front Microbiol       Date:  2018-09-11       Impact factor: 5.640

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.