| Literature DB >> 29048533 |
Javier Quilez1,2, Enrique Vidal1,2, François Le Dily1,2, François Serra1,2,3, Yasmina Cuartero1,2,3, Ralph Stadhouders1,2, Thomas Graf1,2, Marc A Marti-Renom1,2,3,4, Miguel Beato1,2, Guillaume Filion1,2.
Abstract
T47D_rep2 and b1913e6c1_51720e9cf were 2 Hi-C samples. They were born and processed at the same time, yet their fates were very different. The life of b1913e6c1_51720e9cf was simple and fruitful, while that of T47D_rep2 was full of accidents and sorrow. At the heart of these differences lies the fact that b1913e6c1_51720e9cf was born under a lab culture of Documentation, Automation, Traceability, and Autonomy and compliance with the FAIR Principles. Their lives are a lesson for those who wish to embark on the journey of managing high-throughput sequencing data.Entities:
Keywords: FAIR Principles; bioinformatics; high-throughput sequencing; management and analysis best practices
Mesh:
Year: 2017 PMID: 29048533 PMCID: PMC5714127 DOI: 10.1093/gigascience/gix100
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Challenges associated with the management and analysis of high-throughput sequencing data
| Challenge | Impact | Consideration |
|---|---|---|
| Poor sample description | • Prevents data processing and quality control | Metadata collection |
| • Incorrect analysis and results | ||
| • Lack of reproducibility | ||
| • Delays publication | ||
| Unsystematic sample naming | • Duplicated or similar names | Sample identifier scheme |
| • Ambiguous identification | ||
| • Precludes computational treatment | ||
| • Data disclosure | ||
| Untidy data organization | • Data cannot be found | Structured and hierarchical data organization |
| • Time consumption | ||
| • Inability to automate searches | ||
| Yet another analysis | • Repeated manual execution of analyses | Scalability, parallelization, automatic configuration, and modularity |
| • Inability to deconvolute analysis, producing different results | ||
| • Compulsory linear execution | ||
| Undocumented procedures | • Poor understanding of results | Documentation |
| • Irreproducibility | ||
| • Hampers catching errors | ||
| Data overflow | • No access to data | Interactive web applications |
| • Size and number of files make individual inspection inefficient |
What did go wrong with the T47D_rep2 sample? Its description and metadata were not collected, digitized, and stored in a central repository; orphan of a sample identification scheme, it received a duplicated name; why this and previous samples were generated, where their data were located, and how these were processed and with which methods were not documented. As storified with the lives of b1913e6c1_51720e9cf and T47D_rep2, managing and analyzing the growing amount of sequencing data present several challenges. This table details their impact on scientific quality and proposes considerations to address them.
Figure 1:A traceable life for b1913e6c1_51720e9cf. (a) The metadata for b1913e6c1_51720e9cf were collected via an online Google Form and stored both online (Google Sheet) and in a local SQL database. A good metadata collection system should be (i) short and easy to complete, (ii) instantly accessible by authorized users, and (iii) easy to parse for humans and computers. (b) b1913e6c1_51720e9cf was sequenced along with other samples, whose raw sequencing data were located in a directory named after the date of the sequencing run. There one could find the FASTQ files containing the sequencing reads from b1913e6c1_51720e9cf as well as information about their quality; no modified, subsetted, or merged FASTQ file was stored to ensure that analyses started off from the very same set of reads. In a first step, the raw data of b1913e6c1_51720e9cf were processed with the Hi-C analysis pipeline, which created a “b1913e6c1_51720e9cf” directory at the same level where all processed Hi-C samples were located. “b1913e6c1_51720e9cf” had multiple subdirectories that stored the files generated in each of the steps of the pipeline, the logs of the programs, and the integrity verifications of key files. Moreover, such subdirectories accounted for variations in the analysis pipelines (e.g., genome assembly version, aligner) so that data were not overwritten. In a second step, processed data from b1913e6c1_51720e9cf and other samples were used to perform the downstream analyses Chloe asked Paul. Within the directory he allocated to her analyses, Paul created a new one called “2017–03-08_hic_validation” with the description of the analysis, along with the scripts used and the tables and figures generated.
Figure 2:Automating the analysis and visualization of b1913e6c1_51720e9cf data. (a) Scalability, parallelization, automatic configuration, and modularity of analysis pipelines. Paul launched the Hi-C pipeline for hundreds of samples with a single command (gray rectangle): the submission script (“*.submit.sh”) generated as many pipeline scripts as samples listed in the configuration file (“*.config”). The configuration file also contained the hard-coded parameters shared by all samples, such as the maximum running time Paul underestimated for some samples. Processing hundreds of samples was relatively fast because (i) the pipeline script for each of the samples was submitted as an independent job in the computing cluster, where it was queued (orange) and eventually executed in parallel (green), and (ii) the pipeline code in “*seq.sh” was adapted for running in multiple processors. For further automation, each process retrieved sample-specific information (e.g., species, read length) from the metadata SQL database; in addition, metrics generated by the pipeline (e.g., running time, number of aligned reads) were recorded into the database. Because the pipeline code was grouped into modules, Paul was able to easily re-run the “generate_matrix” module for those samples that failed in his first attempt. (b) Interactive web application to visualize Hi-C data. b1913e6c1_51720e9cf alone generated ∼70 files of plots and text when passed through the Hi-C pipeline. Inspecting them might have seemed a daunting task for Chloe: she did not feel comfortable navigating the cluster and lacked the skills to manipulate them anyway, and even if she did, examining so many files for dozens of samples seemed endless. Luckily for her, Paul had developed an interactive web application with R Shiny (Table 2) that allowed her to visualize data and metadata and perform specific analyses in a user-friendly manner.
Tools used in the story
| Tool | Usage | Website |
|---|---|---|
| Docker | Interoperability |
|
| Docker Hub | Repository for Docker containers |
|
| GEO | Repository for high-throughput genomics data |
|
| GitHub | Version control and backup of code |
|
| Google Forms and Sheets | Online collection and display of metadata |
|
| Jupyter Notebook | Document procedures and perform analysis |
|
| R Shiny | Deploy web applications |
|
| R Studio | Document procedures and perform analysis |
|
Note that Jupyter Notebook and R Studio environments are not good for analyses that run for a long time and/or require heavy computational power. Therefore, we recommend them as a way to document how data are processed (even if long/heavy analyses are executed elsewhere) and to perform downstream analyses (e.g., summarizing, plotting) after the long-running ones are done.