| Literature DB >> 29432475 |
Piotr Madanecki1, Magdalena Bałut1, Patrick G Buckley2, J Renata Ochocka1, Rafał Bartoszewski1, David K Crossman3, Ludwine M Messiaen4, Arkadiusz Piotrowski1.
Abstract
High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).Entities:
Mesh:
Substances:
Year: 2018 PMID: 29432475 PMCID: PMC5809091 DOI: 10.1371/journal.pone.0192858
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1Typical HTDP workflow diagram with main features.
Summary of the main HTDP features.
| Feature | Note |
|---|---|
| Horizontal merging of files | Based on genome location (many locations formats and join options are supported); horizontal joining of the files based on one or multiple columns (max 4) is supported. |
| Vertical merging of files (by append) | The files may be in different formats. Columns with the same name are merged |
| Filtering data by list of values | |
| Operations on samples | A total of 1–4 user selectable columns indicate sample name. Filtering based on the other columns content that are present in user-defined percentage of samples is possible |
| Developed as java application | Platform independence—works on all supported architectures and operating systems without emulators |
| Flexible input file(s) format | Any character-delimited column data file(s) is accepted (including VCF with headers and non-canonical entries in the INFO field). User defined delimiting as well as commenting characters are allowed for input files. Missing data values are permitted and can be denoted by user-specified symbol. Column/row names are not mandatory and can be automatically assigned. Multiple files with different delimiters and column/row composition can be opened and combined in one data set (limited by available RAM |
| Many export formats | Output files are saved as tab delimited text files. Additionally, export/conversion to the following formats is supported: BED, BED detail, PSL, GFF, Personal Genome SNP, ENCODE RNA elements: BED6 + 3 scores, ENCODE narrowPeak: Narrow (or Point-Source) Peaks, ENCODE broadPeak: Broad Peaks (or Regions), ENCODE gappedPeak: Gapped Peaks (or Regions), ENCODE peptideMapping: BED6+4 |
a This feature is shown in Example 1 and is intended for illustrative purposes. This example contains sets of one stage tasks demonstrating the most useful and unique features of HTDP on simple artificial data set (S2 File).
b Random Access Memory (RAM) requirement is c.a. 1GB RAM per 50MB of input file size, however the exact values may vary depending on the specific query composition and internal complexity of the input files. Sample test results and additional information are provided in the user manual (S1 File, page: 6).