Literature DB >> 35990812

CoCoView - A codon conservation viewer via sequence logos.

Beatriz Rodrigues Estevam¹, Diego Mauricio Riaño-Pachón¹.

Abstract

Sequence logos are a simple way to display a set of aligned sequences, and they are useful to identify conserved patterns. Since their introduction, several tools have been developed for generating these representations at the single residue level (amino acids or nucleotides). We have developed a tool to build sequence logos of protein-coding sequences at the codon level, allowing more accurate analysis of coding-sequences as they represent synonymous and non-synonymous changes instead of showing only changes that imply on amino acid substitutions. We built CoCoView on top of the Logomaker Python API. It creates codon sequence logos from a multiple sequence alignment of protein-coding sequences. Some properties of the data and the generated logos can be controlled by the end-users, such as data redundancy, plot type and alphabet color. • Split aligned sequences into codon positions; • For each position compute codon frequency and information content; • Use the computed information to plot the graphic.

Entities: Chemical

Keywords: Codons representation; Consensus sequence; Conserved patterns; Information theory

Year: 2022 PMID： 35990812 PMCID： PMC9382315 DOI： 10.1016/j.mex.2022.101803

Source DB: PubMed Journal: MethodsX ISSN： 2215-0161

Specifications table

Method details

Background information

Introduced by Schneider and Stephens (1990) sequence logos are composed of stacks of letters for each position of the multiple sequence alignment, following the conceptual bases of information theory [1], [2], [3]. The height of the stack (Rseq [Eq. 1]) is proportional to the conservation of the position; it is defined as the difference between the maximum possible entropy (Smax), defined as log2 of the number of symbols, and the observed entropy (H(l) [Eq. 2]). The height of a given base/amino acid/codon within the stack (Height [Eq. 3]), is measured by the product of its frequency and Rseq [Eq. 3] [1,4].Where f(n,l) represents the frequency of the symbol n (nucleotide, codon, or amino acid) at position l. N is the number of distinct symbols for a given alphabet (nucleotides, codons, or amino acids). Following this, Smax, for DNA and RNA that both have 4 nitrogenated bases, is log2(4) = 2 bits; for proteins with 20 different amino acids it is log2(20) ≈ 4.32 bits and for 64 codons it is log2(64) = 6 bits. Notice that when allowing for ambiguous nucleotides, the number of possible ‘codons’ would be higher, and so the Smax. Codons have an important role in biology, they are the information unit in protein-coding sequences, during the process of translation. Changes in codon usage can have important functional consequences, for instance, even changes between synonymous codons can impact protein folding [5] or can affect the rate of protein elongation [6]. Analyzing codon usage on a positional basis allows the identification of consensus/conserved sequences and their variants in DNA regions that represent active, cleavage, and allosteric sites in proteins, and also to analyze regulatory regions, as in mRNA sites that enhance or repress protein translation [7] and mRNA splicing regions [8]. There is a lack of current and easy-to-use tools to visualize codon variation on a positional basis, as previous implementations are no longer available [9]. We developed CoCoView, exploiting Logomaker [10] to create codon sequence logos.

Materials and methods

We developed CoCoView as a single python v3 script, tested on v3.7 and v3.9, to generate the codon sequence logos. It is available at https://github.com/labbces/CoCoView and runs on the command-line interface. CoCoView relies on some external libraries that should be installed in advance: argparse [11], pandas [12], matplotlib [13], logomaker [10], json [14], and biopython [15]. We are using Logomaker as a base due to its flexibility, and also because among other features, it offers the possibility to transform probability matrices into bit matrices and to define where each symbol or glyph will be located on the plot [10].

Input

CoCoView only requires a file with aligned nucleotide sequences in FASTA format that must contain aligned sequences whose length is multiple of three, it assumes that the sequence starts with a complete codon. It also has some command-line switches that can alter the behavior of the program, we will describe these later. As output two files are produced, the matrix computed, either with bits or probabilities, which was used to build the logo and the sequence logo in either png or pdf format.

Command-line arguments for CoCoView

Required, input FASTA file: “fastaFile”: CoCoView only requires a single input file. The script can only deal with single nucleotide symbols following the modern IUPAC nucleotide code nomenclature for incompletely specified bases [16]. Ambiguous nucleotides can pose problems to define the codons, so CoCoView allows the user to filter out sequences based on the fraction of ambiguous nucleotides present, using the argument “degreeOfUncertainty”, see below. We recommend using at least 40 sequences to avoid underestimation of entropy [4]. Optional, –prefixFileName: CoCoView produces two output files. One of them is a matrix that can have bits or probabilities (see –matrixLogoType) and that is used to build the codon logo. The other output file is the codon logo in figure format (see –logoFormat). The value of this argument is used as a prefix to create these two output files. Optional, –imageTitle: This argument is a string that will appear as the title at the top of the sequence logo. If not provided by the user a title will be automatically generated from the input file name Optional, –matrixLogoType: CoCoView builds the codon logo based on a matrix, which can be: a probability matrix: A matrix of N (rows) x M (columns), in which N are the codon positions in the multiple sequence alignment, and M are the different codons. Each cell has the proportion (probability) of a given codon in a given position. The sum of all codon proportions for a given position must add to 1. a bit matrix, default option: This is a transformation of the probability matrix, maintaining the same geometry, using the conceptual framework in equations 1 to 3. Each cell in the matrix represents the Height [Eq. 3] of a given codon in a given position, in bit units. Optional, –alphaColor: CoCoView can use four different palettes of colors for the codon logos. Codons can be colored following the properties of their corresponding amino acids.The options are: “weblogo_protein (default)”, “charge”, “chemistry” and “hydrophobicity”. Optional, –degreeOfUncertainty: Ambiguous nucleotides are allowed in the input sequence, however when they are present there is uncertainty about the amino acids they code for. With this argument the user can filter out sequences that have a proportion of ambiguous nucleotides greater than degreeOfUncertainty, using a floating-point number between 0 and 100. For example, a degreeOfUncertainty set to 30% will exclude all sequences of length equal to 12 that have at least 4 ambiguous nucleotides. Optional, –datasetType: If duplicated sequences are present in the input dataset, setting this argument to ‘nonreduntant’ will remove duplicates from the analyses. This option is useful for small datasets. When very large datasets are used (thousands of sequences with hundreds/thousands of residues), users are advised to use third-party tools to generate non-redundant sequence sets, eg., cd-hit [17] or UCLUST [18]. Setting ‘nonreduntant’ may be of interest when the user wants to visualize less frequent codons. Default value ‘redundant’.

Method validation - brief example

Transcription factors are proteins that bind DNA and regulate the expression of target genes. AP2 is a transcription factor involved in the regulation of growth and development, fruit ripening, defense response, and metabolism in plants [19]. In order to illustrate the benefits of a per-codon variation representation, we generated sequence logos using WebLogo [4] (per nucleotide analysis, Fig. 1A) and CoCoView (per codon analysis) for a region of the multiple sequence alignment of the coding sequences of AP2 from Nicotiana tabacum (Fig. 1B). In Fig. 1A, please note positions 10th to 12th, which represent the 4th codon of that region of the CDS, one could incorrectly draw the conclusion that the triplet “GAT '' is common at that position, based on the conservation of the individual nucleotides. However, when looking at the sequence logo based on condons on Fig. 1B, it is clear that “GAT'' is not common at all at this position.

Fig. 1

CoCoView logo based on a multiple sequence alignment of a region of AP2 transcription factor coding sequences from Nicotiana tabacum. (A) Sequence logo generated using WebLogo [4], representing a per-nucleotide analysis. (B) Sequence logo generated using CoCoView (per-codon analysis). A per-nucleotide analysis could erroneously suggest that some codons are common, which can be ruled out on a per-codon visualization. Exemplified by the codon “GAT”, at the position highlighted in gray on both sequence logos, which can be interpreted as a common codon in the per-nucleotide analysis. However, in the per-codon analysis, this codon does not occur at this position.

Conclusion

Here we presented CoCoView, a method to construct sequence logos using codons, which allows for a more detailed analysis of sequence conservation.

CRediT authorship contribution statement

Beatriz Rodrigues Estevam: Software, Writing – original draft, Writing – review & editing. Diego Mauricio Riaño-Pachón: Conceptualization, Software, Resources, Writing – review & editing, Supervision, Project administration, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Subject area	Bioinformatics

More specific subject area	Sequence analysis
Name of your method	CoCoView: A Codon Conservation Viewer via Sequence Logos
Name and reference of original method	Consensus sequence display via Sequence logos [1].
Resource availability	CoCoView.py and additional information are available on project's GitHub: https://github.com/labbces/CoCoView

13 in total

1. Search and clustering orders of magnitude faster than BLAST.

Authors: Robert C Edgar
Journal: Bioinformatics Date: 2010-08-12 Impact factor: 6.937

2. An evolutionarily conserved mechanism for controlling the efficiency of protein translation.

Authors: Tamir Tuller; Asaf Carmi; Kalin Vestsigian; Sivan Navon; Yuval Dorfan; John Zaborske; Tao Pan; Orna Dahan; Itay Furman; Yitzhak Pilpel
Journal: Cell Date: 2010-04-16 Impact factor: 41.582

3. Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984.

Authors: A Cornish-Bowden
Journal: Nucleic Acids Res Date: 1985-05-10 Impact factor: 16.971

4. Silent substitutions predictably alter translation elongation rates and protein folding efficiencies.

Authors: Paige S Spencer; Efraín Siller; John F Anderson; José M Barral
Journal: J Mol Biol Date: 2012-06-12 Impact factor: 5.469

5. Biopython: freely available Python tools for computational molecular biology and bioinformatics.

Authors: Peter J A Cock; Tiago Antao; Jeffrey T Chang; Brad A Chapman; Cymon J Cox; Andrew Dalke; Iddo Friedberg; Thomas Hamelryck; Frank Kauff; Bartek Wilczynski; Michiel J L de Hoon
Journal: Bioinformatics Date: 2009-03-20 Impact factor: 6.937

6. CodonLogo: a sequence logo-based viewer for codon patterns.

Authors: Virag Sharma; David P Murphy; Gregory Provan; Pavel V Baranov
Journal: Bioinformatics Date: 2012-05-17 Impact factor: 6.937

7. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

Review 8. Multiple regulatory roles of AP2/ERF transcription factor in angiosperm.

Authors: Chao Gu; Zhi-Hua Guo; Ping-Ping Hao; Guo-Ming Wang; Zi-Ming Jin; Shao-Ling Zhang
Journal: Bot Stud Date: 2017-01-03 Impact factor: 2.787

9. A Bioinformatics-Based Alternative mRNA Splicing Code that May Explain Some Disease Mutations Is Conserved in Animals.

Authors: Wen Qu; Pablo Cingolani; Barry R Zeeberg; Douglas M Ruden
Journal: Front Genet Date: 2017-04-11 Impact factor: 4.772

10. Logomaker: beautiful sequence logos in Python.

Authors: Ammar Tareen; Justin B Kinney
Journal: Bioinformatics Date: 2019-12-10 Impact factor: 6.937