| Literature DB >> 35617194 |
Abstract
Text formats are common in bioinformatics, as they allow for editing and filtering using standard tools, as well as, since text formats are often human readable, manual inspection and evaluation of the data. Bioinformatics is a rapidly evolving field, hence, new techniques, new software tools, new kinds of data often require the definition of new formats. Often new formats are not formally described in a standard or specification document. Although software libraries are available for accessing the most common formats, writing parsers for text formats, for which no library is currently available, is a very common though tedious task, utilized by many researchers in the field. This manuscript presents the open source software library and toolset TextFormats (available at https://github.com/ggonnella/textformats), which aims at simplifying the definition and parsing of text formats. Formats specifications are written in a simple data description format using an interactive wizard. Automatic generation of data examples and automatic testing of specifications allow for checking for correctness. Given the specification for a text format, TextFormats allows parsing and writing data in that format, using several programming languages (Nim, Python, C/C++) or the provided command line and graphical user interface tools. Although designed as a general purpose software, the main target application field, for the above mentioned reasons, is expected to be in bioinformatics: Thus, the specifications of several common existing bioinformatics formats are included.Entities:
Mesh:
Year: 2022 PMID: 35617194 PMCID: PMC9135226 DOI: 10.1371/journal.pone.0268910
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.752
Fig 1View of a TFSL specification as a tree.
An example of TextFormats specification in YAML format (left) and the the information contained in the specification viewed as a tree (right).
Kinds of datatype definitions in the Text Formats Specification Language.
| Structure | Group | Definition key | Description |
|---|---|---|---|
| Scalar | Discrete values |
| only one value is valid |
|
| value must be the element of a set | ||
| Regular expressions |
| string matching a regular expression | |
|
| string matching one of a set of regular expressions (optionally associated to different data transformation rules) | ||
| Numeric intervals |
| Signed base-10 integer, in a specified interval. | |
|
| Unsigned base-2, -8, -10 or -16 integer, in a specified interval. | ||
|
| Floating point number in a specified open or closed interval. | ||
| Compound | Unordered sequences |
| list of elements, each with the same datatype or one of a set of datatypes, not depending on the element position in the list |
|
| list of elements, each associated to a string label (in a given set), defining the semantics and datatype of the element | ||
|
| list of elements, each associated to two string labels, defining, respectively, semantics and a datatype of the element | ||
| Ordered sequences |
| ordered sets of elements, each with a possibly different datatype | |
| Scalar/compound | Alternatives |
| multiple alternative valid representations |
Time for compilation of TFSL specifications from YAML files and loading of pre-compiled specifications.
| Format | Compile only | Compile and save pre-compiled | Load pre-compiled |
|---|---|---|---|
| Accessions | 0.02 s | 0.02 s |
|
| NCBI ID | 0.03 s | 0.03 s |
|
| Fasta |
| 0.01 s | 0.01 s |
| FastQ | 0.01 s | 0.01 s |
|
| SAM |
| 0.21 s | 0.27 s |
| EGC | 0.23 s | 0.24 s |
|
| GFA1 |
| 0.31 s | 0.45 s |
| GFA2 | 1.28 s | 1.40 s |
|
| GFF3 | 1.79 s | 1.79 s |
|
bold font indicates the fastest time for obtaining the specification: loading a pre-compiled specification or compiling the YAML specification.;
The operations were performed using the TextFormats command line tool tf_spec, with the sub-commands compile (compile and save to file) and info (compile YAML or load pre-compiled).
The reported times are the average over 3 runs of the real time measured by GNU time, on a Linux OpenSuse 15.3 workstation (CPU i5–4570 3.2 Ghz, RAM 16 Gb), using TextFormats 1.2.2.
Operations implemented by TextFormats and corresponding API functions and CLI commands.
| Input | Operation | Interface | Function/Command |
|---|---|---|---|
| Specification | Compile | Nim |
|
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Load | Nim |
| |
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Run tests | Nim |
| |
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Text representation | Validate | Nim |
|
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Decode (input: string) | Nim |
| |
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Decode (input: file) | Nim |
| |
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Data | Check if suitable for representation | Nim |
|
| C |
| ||
| Python |
| ||
| CLI |
| ||
| Encode into text representation | Nim |
| |
| C |
| ||
| Python |
| ||
| CLI |
|
Running times of equivalent programs based on TextFormats or other libraries, implemented in Nim, Python, and C.
| N. input lines | Library | Nim | Python | (vs Nim) | C | (vs Nim) | |
|---|---|---|---|---|---|---|---|
| (SAM) Case study 1 | 100 000 |
| 5.45 s | 5.70 s | (+ 4.6%) | 5.61 s | (+ 2.9%) |
| 500 000 |
| 26.89 s | 28.14 s | (+ 4.6%) | 27.76 s | (+ 3.2%) | |
| 1 000 000 |
| 53.70 s | 55.91 s | (+ 4.1%) | 55.46 s | (+ 3.3%) | |
| 100 000 | htslib | 0.35 s | 2.44 s | 0.09 s | |||
| 500 000 | htslib | 1.66 s | 12.16 s | 0.42 s | |||
| 1 000 000 | htslib | 3.37 s | 24.33 s | 0.83 s | |||
| (EGC) Case study 3 | 100 000 |
| 5.38 s | 5.87 s | (+ 9.1%) | 5.31 s | (- 1.3%) |
| 500 000 |
| 25.78 s | 28.33 s | (+ 9.9%) | 25.40 s | (- 1.4%) | |
| 1 000 000 |
| 52.74 s | 55.70 s | (+ 5.6%) | 51.62 s | (- 2.3%) | |
| 100 000 |
| n.a. | 2.19 s | n.a. | |||
| 500 000 |
| n.a. | 11.37 s | n.a. | |||
| 1 000 000 |
| n.a. | 22.69 s | n.a. | |||
| (GFA2) (Case study 4) | 363 613 |
| 93.55 s | 96.83 s | (+ 3.5%) | n.a. | |
| 363 613 | GfaPy | n.a. | 191.83 s | n.a. |
(SAM) Case study 1: program for collecting information from a SAM file, based on the TextFormats or the htslib library;
(EGC) Case study 3: program for parsing the EGC format (defined in the text) writing the information to JSON and then back to EGC, based on the TextFormats library, or as a ad hoc Python parser;
(GFA2) Case study 4: Python program for validating a GFA2 file and collecting basic statistics on the file, based on TextFormats library or on the GfaPy library;
(vs Nim) Running time difference of the Python or C version (when implemented) of the TextFormats-based programs to the Nim version;
The reported times are the average over 3 runs of the real time measured by GNU time, on a Linux OpenSuse 15.3 workstation (CPU i5–4570 3.2 Ghz, RAM 16 Gb), using TextFormats 1.2.2.
Accession identifiers of NCBI, DDBJ, ENA/EBI and UniProt sequence databases defined in the spec/accessions.yaml TextFormats specification.
| Database | Data coded in accession |
|---|---|
| INSD read archives (SRA, DRA, ERA) | Institution (NCBI, DDBJ, ENA/EBI), Type of data (study, run, sample, experiment, analysis), Entry |
| UniProtKB | Database name, Entry |
| Trace Archive | Database name, Entry |
| INSD assembled sequence (Nucleotide, Protein, Bulk, MGA) | Database name, Entry |
| INSD metadata (BioProject, BioSample) | Institution (NCBI, DDBJ, ENA/EBI), Type of Record (BioProject, BioSample), Entry |
| RefSeq | Type of molecule (Genomic, RNA, protein), Type of assembly (reference, alternate), Type of annotation (curated, predicted model), Entry |
| Ensembl | Species, Feature type (exon, protein, gene, transcript etc), Entry |
The definitions on which the specification is based were obtained from the following documentation pages: https://www.ncbi.nlm.nih.gov/Sequin/acc.html, https://www.ddbj.nig.ac.jp/acc_def-e.html, https://www.ddbj.nig.ac.jp/prefix-e.html#dra, https://www.ncbi.nlm.nih.gov/books/NBK21091/table/ch18.T.refseq_accession_numbers_and_mole/, https://www.uniprot.org/help/accession_numbers and https://www.ensembl.org/info/genome/stable_ids/prefixes.html.
Fasta sequence identifiers used by NCBI, defined in the spec/ncbi_id.yaml TextFormats specification.
| Type of sequence | Accession prefix | Example |
|---|---|---|
| NCBI RefSeq database |
|
|
| NCBI GenBank database |
|
|
| NCBI GenBank (third-party annotation) |
|
|
| EMBL sequence database |
|
|
| EMBL sequence (third-party annotation) |
|
|
| DDBJ sequence database |
|
|
| DDBJ sequence (third-party annotation) |
|
|
| SWISS-Prot database |
|
|
| TrEMBL database |
|
|
| PIR database |
|
|
| PDB database |
|
|
| PRF database |
|
|
| patent sequence |
|
|
| pre-grant patent sequence |
|
|
| general database reference |
|
|
| local sequence |
|
|
| GenInfo backbone sequence ID |
|
|
| GenInfo backbone molecule type |
|
|
| GenInfo import ID |
|
|
| GenInfo integrated database |
|
|
| NCBI internal, genome pipeline |
|
|
| NCBI internal, named annotation track |
|
|
The format of each type of identifier is described in the documentation of the NCBI Toolkit, at https://ncbi.github.io/cxx-toolkit/pages/ch_demo#ch_demo.id1_fetch.html_ref_fasta.