Literature DB >> 28409507

FasParser: a package for manipulating sequence data.

Abstract

A computer software package called 'FasParser' was developed for manipulating sequence data. It can be used on personal computers to perform series of analyses, including counting and viewing differences between two sequences at both DNA and codon levels, identifying overlapping regions between two alignments, sorting of sequences according to their IDs or lengths, concatenating sequences of multiple loci for a particular set of samples, translating nucleotide sequences to amino acids, and constructing alignments in several different formats, as well as some extracting and filtrating of data for a particular FASTA file. Majority of these functions can be run in a batch mode, which is very useful for analyzing large data sets. This package can be used by a broad audience, and is designed for researchers that do not have programming experience in sequence analyses. The GUI version of FasParser can be downloaded from https://github.com/Sun-Yanbo/FasParser, free of charge.

Entities: Disease Gene

Keywords: Batch processing; Extraction and filtration; FasParser; Sequence comparison

Mesh：

Substances：
RNA
DNA

Year: 2017 PMID： 28409507 PMCID： PMC5396028 DOI： 10.24272/j.issn.2095-8137.2017.017

Source DB: PubMed Journal: Zool Res ISSN： 2095-8137

INTRODUCTION

Recent developments in sequencing technology function to generate a vast amount of DNA and RNA sequence data. Analyses based on these sequences are one of the most important means of assessing their potential for biological inference. The amount of available sequence data has made their manipulation tricky, especially for researchers without programming experience. Hence, the development of user-friendly software facilitates research using batch modes for sequence extraction, filtration, translation and conversions of file formats. The program package MEGA (Kumar et al., 1994), which was developed decades ago, has achieved worldwide usage. Although it has manipulation functions, such as sequence viewing and format conversion, it focused mainly on various statistical analyses of molecular evolution. Many sequence manipulations still require manual work or the use of other tools (i.e., Microsoft Office Excel). Examples include the concatenation of loci from multiple sequence files, the extraction of some gene sequences from a whole genome, and the filtering of very short sequences in an alignment. Another package, BioEdit (Hall, 1999), can handle most simple sequence editing and manipulation functions. However, it inefficiently handles batch processing and can only deal with one alignment file at a time. Herein, I provide the new program package 'FasParser' for manipulating sequence files. It has a user-friendly GUI and batch processing modes, which allows users to handle multiple sequence files in a simple way. Presently, the package has seven main programs/functions (Figure 1): (1) counting and viewing the differences between two sequences at the DNA and codon levels; (2) identifying overlapping columns of two alignments of a same gene; (3) sorting sequences according to ID, sequence length, or ID list provided by user; (4) concatenating sequences for a particular set of samples from multiple sequence files; (5) batch-translating protein -coding nucleic acid sequences into amino acids; (6) constructing alignments with different formats; and (7) extracting and filtering sequences according to ID or sequence length. FasParser is a standalone application that has been compiled and tested on Windows 7/10 operating systems. Only available computer memory limits the size of data to be analyzed.

Figure 1

Overview of the functions provided by FasParser

BATCH PROCESSING

This new package can batch process several commonly used procedures including merging sequences, translating, aligning and converting formats. For merging, it can obtain a "super sequence" by concatenating all the loci sequences for a particular set of samples. This is useful for phylogenetic inference. The translation program can obtain the amino acid sequences according to multiple genetic codes. In addition to the batch processing, it can also read single FASTA file or single DNA sequence (manual mode), thus providing a simple way to get the amino acid sequences. Alignment construction is one of the most important manipulations of sequences and the program can make use of three popular aligners for it: MUSCLE (Edgar, 2004), MAFFT (Katoh et al., 2002), and PRANK (LÖytynoja & Goldman, 2005). The first two programs can generate final alignments quickly and automatically recognize the type of sequence (DNA or amino acid). Although PRANK is slower than the others, it produces more accurate results (Jordan & Goldman, 2012) and can directly obtain final alignments at the codon-level. In addition, FasParser can convert alignments to different formats, for example from FASTA to PHYLIP, PAML, or NEXUS. Batch processing of these functions only needs a directory containing all the sequence files to be analyzed.

SEQUENCE COMPARISON AND MUTATION IDENTIFICATION

After constructing an alignment, it is often desirable to visualize the mutations or substitutions between two sequences, and/or identify overlapping regions generated by different aligners for the same gene. The programs "Cmp-2Seq" and "Cmp-2Align" address these issues. Cmp-2Seq counts and displays differences between two sequences at the levels of nucleotides and codons. Under the codon level, the program estimates the total number of sites with synonymous (S) and non-synonymous (N) substitutions for the first sequence and then calculates the number of synonymous and non-synonymous substitutions between the two sequences according to the NG86 method of Nei & Gojobori (1986). This function is useful in analyses, such as cancer genomic studies that focus on understanding the selective pressures following cell proliferation (Liu et al., 2012). Cmp-2Align identifies overlapping regions between two alignments using a simple but rigorous algorithm (Figure 2). Briefly, for each base of an alignment column, the program calculates its gap-free position in the raw sequence. Next, it transforms these positions to a string vector, like "1-2-2", meaning there are 3 sequences, and this column contains the first base of the first sequence, the second base of the second sequence and also the second base in the third sequence. Finally, the program extracts all columns with the same position-vectors between two alignments (Figure 2). This manipulation is useful for analyses such as the identification of regions informative for phylogenetic inference.

Figure 2

Algorithm used to compare different alignments

EXTRACTION AND FILTRATION

FasParser can also extract and filter a set of sequences from a raw FASTA file ("Fas-Filter") based on query IDs, as well as removing sequences according to a cutoff-length. Fas-Filter can cut a raw alignment by removing columns with gaps based on a cutoff value of gap frequency. Moreover, the program can also provide summary-statistics of a raw alignment, such as pointing out one or more too short sequences and calculating the length of gap-free blocks.

COMPARISONS BETWEEN FASPARSER WITH OTHER PROGRAMS

The FasParser package provides a graphic user interface (GUI) with several commonly used functions that perform sequence manipulations. This package remains limited in that it cannot perform phylogenetic inference, edit alignments and identify open reading frames (ORF) (Table 1). Therefore, FasParser is not a replacement of other packages, such as MEGA. Nonetheless, new functions to FasParser are in the process of development.

Table 1

Comparisons between FasParser with other programs

	MEGA	BioEdit	FasParser
GUI	√	√	√
Batch processing	×	×	√
Sequence viewing	√	√	√
Alignment comparison	×	×	√
Translation	√	√	√
format conversion	√	√	√
Extraction & filtration	×	√	√
Sequence editing	√	√	×
Phylogeny inference	√	√	×
ORF identification	×	√	×

Comparisons between FasParser with other programs

ACKNOWLEDGEMENTS

Special thanks to Prof. Robert W. Murphy, Dr. Adeniyi Charles Adeola and Lotanna Micah Nneji for the modifications of this manuscript, and also our colleagues for their suggestions on the improvement of FasParser.

7 in total

1. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform.

Authors: Kazutaka Katoh; Kazuharu Misawa; Kei-ichi Kuma; Takashi Miyata
Journal: Nucleic Acids Res Date: 2002-07-15 Impact factor: 16.971

2. The effects of alignment error and alignment filtering on the sitewise detection of positive selection.

Authors: Gregory Jordan; Nick Goldman
Journal: Mol Biol Evol Date: 2011-11-01 Impact factor: 16.240

3. An algorithm for progressive multiple alignment of sequences with insertions.

Authors: Ari Löytynoja; Nick Goldman
Journal: Proc Natl Acad Sci U S A Date: 2005-07-06 Impact factor: 11.205

4. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions.

Authors: M Nei; T Gojobori
Journal: Mol Biol Evol Date: 1986-09 Impact factor: 16.240

5. MEGA: Molecular Evolutionary Genetics Analysis software for microcomputers.

Authors: S Kumar; K Tamura; M Nei
Journal: Comput Appl Biosci Date: 1994-04

6. Deciphering the signature of selective constraints on cancerous mitochondrial genome.

Authors: Jia Liu; Li-Dong Wang; Yan-Bo Sun; En-Min Li; Li-Yan Xu; Ya-Ping Zhang; Yong-Gang Yao; Qing-Peng Kong
Journal: Mol Biol Evol Date: 2011-11-29 Impact factor: 16.240

7. MUSCLE: a multiple sequence alignment method with reduced time and space complexity.

Authors: Robert C Edgar
Journal: BMC Bioinformatics Date: 2004-08-19 Impact factor: 3.169

7 in total

15 in total

1. Identifying a Major QTL Associated with Salinity Tolerance in Nile Tilapia Using QTL-Seq.

Authors: Xiao Hui Gu; Dan Li Jiang; Yan Huang; Bi Jun Li; Chao Hao Chen; Hao Ran Lin; Jun Hong Xia
Journal: Mar Biotechnol (NY) Date: 2018-01-09 Impact factor: 3.619

2. Species groups distributed across elevational gradients reveal convergent and continuous genetic adaptation to high elevations.

Authors: Yan-Bo Sun; Ting-Ting Fu; Jie-Qiong Jin; Robert W Murphy; David M Hillis; Ya-Ping Zhang; Jing Che
Journal: Proc Natl Acad Sci U S A Date: 2018-10-22 Impact factor: 11.205

3. Identifying a Long QTL Cluster Across chrLG18 Associated with Salt Tolerance in Tilapia Using GWAS and QTL-seq.

Authors: Dan Li Jiang; Xiao Hui Gu; Bi Jun Li; Zong Xian Zhu; Hui Qin; Zi Ning Meng; Hao Ran Lin; Jun Hong Xia
Journal: Mar Biotechnol (NY) Date: 2019-02-08 Impact factor: 3.619

4. Genome-Wide QTL Analysis Identified Significant Associations Between Hypoxia Tolerance and Mutations in the GPR132 and ABCG4 Genes in Nile Tilapia.

Authors: Hong Lian Li; Xiao Hui Gu; Bi Jun Li; Chao Hao Chen; Hao Ran Lin; Jun Hong Xia
Journal: Mar Biotechnol (NY) Date: 2017-07-11 Impact factor: 3.619

5. Molecular cloning and expression pattern of IGFBP-2a in black porgy (Acanthopagrus schlegelii) and evolutionary analysis of IGFBP-2s in the species of Perciformes.

Authors: Xinyi Zhang; Zhiyong Zhang; Zhenpeng Yu; Jiayi Li; Shuyin Chen; Ruijian Sun; Chaofeng Jia; Fei Zhu; Qian Meng; Shixia Xu
Journal: Fish Physiol Biochem Date: 2019-08-15 Impact factor: 2.794

6. Evolutionary impacts of purine metabolism genes on mammalian oxidative stress adaptation.

Authors: Ran Tian; Chen Yang; Si-Min Chai; Han Guo; Inge Seim; Guang Yang
Journal: Zool Res Date: 2022-03-18

7. Influence of Pliocene and Pleistocene climates on hybridization patterns between two closely related oak species in China.

Authors: Yao Li; Xingwang Zhang; Lu Wang; Victoria L Sork; Lingfeng Mao; Yanming Fang
Journal: Ann Bot Date: 2022-01-28 Impact factor: 4.357

8. Stepped Geomorphology Shaped the Phylogeographic Structure of a Widespread Tree Species (Toxicodendron vernicifluum, Anacardiaceae) in East Asia.

Authors: Lu Wang; Yao Li; Shuichi Noshiro; Mitsuo Suzuki; Takahisa Arai; Kazutaka Kobayashi; Lei Xie; Mingyue Zhang; Na He; Yanming Fang; Feilong Zhang
Journal: Front Plant Sci Date: 2022-06-02 Impact factor: 6.627

9. Genomic insights into body size evolution in Carnivora support Peto's paradox.

Authors: Xin Huang; Di Sun; Tianzhen Wu; Xing Liu; Shixia Xu; Guang Yang
Journal: BMC Genomics Date: 2021-06-09 Impact factor: 3.969

10. AutoSeqMan: batch assembly of contigs for Sanger sequences.

Authors: Jie-Qiong Jin; Yan-Bo Sun
Journal: Zool Res Date: 2018-03-18