| Literature DB >> 30134911 |
Artem Babaian1,2, Anicet Ebou3, Alyssa Fegen4, Ho Yin Kam5, German E Novakovsky6, Jasper Wong7, Dylan Aïssi8, Li Yao9.
Abstract
BACKGROUND: Computational biology requires the reading and comprehension of biological data files. Plain-text formats such as SAM, VCF, GTF, PDB and FASTA, often contain critical information which is obfuscated by the data structure complexity.Entities:
Keywords: Command line interface; FASTA; FASTQ; SAM; Sublime; Syntax highlighting; VCF; Vim
Mesh:
Substances:
Year: 2018 PMID: 30134911 PMCID: PMC6106740 DOI: 10.1186/s12859-018-2315-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Syntax highlighting for sequence alignment map (.sam) file format. a Terminal screenshot of the ‘less HG00128_hgr1.sam’ command run i. normally or ii. with bioSyntax. Related information in the header and data sections are grouped by colours (genomic coordinates, green; sample information, pale blue ...) to improve legibility. Each data-row is an individual sequencing read. Iii. CIGAR alignment strings in particular can be highlighted such that they become substantially easier to read. b A broad view of the nucleotide and PHRED-score for 30 reads i. before, and ii. after syntax highlighting. Underlying information of about the data becomes intuitively visible such as PCR-duplicates (black arrow) and poor quality areas and reads (blue arrow) based on iii. PHRED score
Fig. 2bioSyntax nucleotide colour scheme. a The four primary bases are coloured in two pairs of contrasting colours. IUPAC ambiguous bases are then coloured in increasingly lighter tones of the approximately mixed colours. To accomodate 4-dimensional bases in 3-dimensional colours, aMino (A or C) and Keto (G or T) bases are darker. b A comparison of nucleotide colour-schemes in the literature. c bioSyntax colouring allows for approximation of a sequences GC-content by how warm (high GC) or cool (high AT) it appears