| Literature DB >> 25471934 |
Francesco Venco, Yuriy Vaskin, Arnaud Ceol, Heiko Muller.
Abstract
BACKGROUND: Life-science laboratories make increasing use of Next Generation Sequencing (NGS) for studying bio-macromolecules and their interactions. Array-based methods for measuring gene expression or protein-DNA interactions are being replaced by RNA-Seq and ChIP-Seq. Sequencing is generally performed by specialized facilities that have to keep track of sequencing requests, trace samples, ensure quality and make data available according to predefined privileges. An integrated tool helps to troubleshoot problems, to maintain a high quality standard, to reduce time and costs. Commercial and non-commercial tools called LIMS (Laboratory Information Management Systems) are available for this purpose. However, they often come at prohibitive cost and/or lack the flexibility and scalability needed to adjust seamlessly to the frequently changing protocols employed. In order to manage the flow of sequencing data produced at the Genomic Unit of the Italian Institute of Technology (IIT), we developed SMITH (Sequencing Machine Information Tracking and Handling).Entities:
Mesh:
Year: 2014 PMID: 25471934 PMCID: PMC4255740 DOI: 10.1186/1471-2105-15-S14-S3
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Infrastructure, main tasks, and architecture. A) Infrastructure: Sequencing is performed on an Illumina HiSeq2000 instrument. Data are stored on an Isilon mass storage device. Data are elaborated on a Sun Grid Engine High Performance Computing cluster (SGE-HPC). Application servers run web applications for Genome browsing, data listings, the SMITH LIMS, and host the MySQL information tier. The user data directories are organized by group leader name, user login name, file-type, and run date. B) Sample tracking in SMITH. A sample passes through four states ("requested", "queued", "confirmed", "analysed"). Submitted samples have status "requested". When a sample is added to the virtual flow cell, its status changes to "queued". Upon the group leader confirmation the status changes to "confirmed". The sample is then run and analysed by the workflow engine and assumes the final status "analysed". HPC refers to a high performance computing cluster. C) Architecture of the workflow unit. Generated commands invoke Galaxy workflows that subsequently call the un-pluggable core. A part of the instruments can be on the Galaxy side (proprietary tools and scripts) and the other part (open-source tools) is moved to the core.
Figure 2Data model. The data model of the SMITH database is shown. The user table is omitted to avoid crossing of table connections. The image was generated using MySQL Workbench software.
Description of database tables.
| Table | Description |
|---|---|
| Includes name and surname, phone number, email, etc. The passwords are not stored in the database but provided either from a Lightweight Directory Access Protocol (LDAP) server or from a file realm. | |
| Represents the biological sample of a sequencing experiment. Most attributes are used to set the sequencing machine. A new sample is created at each new request for a sequencing experiment. | |
| Contains the parameters characterizing a sequencing run: Read length, read mode, and sequencing depth. These parameters have been combined into a set of predefined recipes. This approach makes it easier for the user to choose appropriate parameters and reduces the number of possible applications, which in turn facilitates sequencing diverse sample in the same sequencing run. | |
| Contains all the possible sequencing barcode indices used in the laboratory. When the users prepare their own sequencing library, they must provide information about the sequencing barcode indices. | |
| Using the web interface, it is possible to request more than one sample at the same time. Such samples are linked by the MultipleRequest table. | |
| Groups the samples into projects. A project is associated to a list of users (collaborators). The project creator can set special permissions for collaborators to view or modify the information regarding specific samples. | |
| Connects each sample to custom attributes and values. This approach permits enriching each sample with specific meta-data that can be used for searching for specific samples and for statistical analyses. Note that all the tables connected to sample and representing the results of the sequencing and the following analyses will be connected to the meta-data. | |
| Represents the run of the sequencing machine for a specific sample, connected to the sequencing reagents used. Many samples can run together and be connected by the same run_id. | |
| Keep track of FASTQ files produced. It stores the paths to files, samples and runs that originated the data. | |
| Stores the algorithm and the reference genome used as well as the path to the resulting aligned data (in BAM format) | |
| Analysis steps following the alignment are saved in this table. Many algorithms use as input the output of a precedent step. Thus, the table contains a one-to-many reference to itself. | |