| Literature DB >> 32816922 |
Mengchi Wang1, David Wang2, Kai Zhang1, Vu Ngo1, Shicai Fan2,3, Wei Wang4,2,5.
Abstract
Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener's method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.Entities:
Keywords: consensus; information theory; motif; sequence logo; transcription factor binding
Mesh:
Substances:
Year: 2020 PMID: 32816922 PMCID: PMC7536857 DOI: 10.1534/genetics.120.303597
Source DB: PubMed Journal: Genetics ISSN: 0016-6731 Impact factor: 4.562
Figure 1Overview of sequence Motto and comparison with sequence logo. Given a motif PWM as the input, Motto outputs a consensus that minimizes information loss. Here we show how the sequence Motto of the human transcription factor P73 is determined.
Figure 3Converted sequence Mottos recapitulate motif occurrence sites of 1156 common human and mouse transcription factors (TFs) in the human genome (hg19). (A) The averaged area under the precision-recall curve (auPRC) using Motto (default method with minimal JSD, ambiguity penalty at -P = 0.2, and at -P = 0.5) compared with existing alternative methods. P-value determined by paired t-test. (B) Comparison in three examples TFs showing the differences of consensus sequences [shown in IUPAC (Johnson 2010) coding for better alignment] and performances.
Figure 2Example usage using human CTCF (upper panel) and lipoprotein binding sites from Bailey and Elkan (1994) (lower panel). The original PWM is shown in a sequence logo. Different Motto options resulted in various consensus sequence output at each position. In particular, “-m/–method” specifies the method: Motto (default), MSE (minimal mean square error), Cavener (Cavener 1987), or Max (using maximal frequency at each position); “-s/–style” specifies the output style: IUPAC (Johnson 2010) (single character for nucleotide combinations), regex (regular expression), or compact (convert [ACGT] to N in regex); “-t/–trim” is an option for trimming off the flanking Ns; “-p/–penalty” specifies a weight between 0 and 1 that penalizes ambiguity at each position (for details see Materials and Methods).