| Literature DB >> 27835644 |
Sophia S Liu1, Adam J Hockenberry1,2, Andrea Lancichinetti1, Michael C Jewett1,2,3,4, Luís A N Amaral1,3,5.
Abstract
The existence of over- and under-represented sequence motifs in genomes provides evidence of selective evolutionary pressures on biological mechanisms such as transcription, translation, ligand-substrate binding, and host immunity. In order to accurately identify motifs and other genome-scale patterns of interest, it is essential to be able to generate accurate null models that are appropriate for the sequences under study. While many tools have been developed to create random nucleotide sequences, protein coding sequences are subject to a unique set of constraints that complicates the process of generating appropriate null models. There are currently no tools available that allow users to create random coding sequences with specified amino acid composition and GC content for the purpose of hypothesis testing. Using the principle of maximum entropy, we developed a method that generates unbiased random sequences with pre-specified amino acid and GC content, which we have developed into a python package. Our method is the simplest way to obtain maximally unbiased random sequences that are subject to GC usage and primary amino acid sequence constraints. Furthermore, this approach can easily be expanded to create unbiased random sequences that incorporate more complicated constraints such as individual nucleotide usage or even di-nucleotide frequencies. The ability to generate correctly specified null models will allow researchers to accurately identify sequence motifs which will lead to a better understanding of biological processes as well as more effective engineering of biological systems.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27835644 PMCID: PMC5106001 DOI: 10.1371/journal.pcbi.1005184
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Target nucleotide composition of test sequences.
| Nucleotide Usage | G% | C% | A% | T% |
|---|---|---|---|---|
| Uniform | 25 | 25 | 25 | 25 |
| GC Rich | 30 | 30 | 20 | 20 |
| AT Rich | 15 | 15 | 35 | 35 |
| C Rich | 20 | 40 | 20 | 20 |
Fig 1The multinomial method does not generate random sequence with the desired nucleotide composition.
We tested the accuracy of the multinomial method by generating 500 sequences that were 2500 amino acid long, with uniform amino acid usage with four different target nucleotide contents (unsaturated color). Our results (saturated color) demonstrate that the multinomial method is unable to attain the specified individual nucleotide composition and also unable to attain the desired GC content.
Fig 2Random sequences generated using the maximum entropy approach are unbiased with a mean equal to the target GC content.
We generated 500 random sequences, with equiprobable amino acid usage and 2500 amino acids in length. We used matching colors for target GC content (dashed line) and observed GC content distribution.
Fig 3The GC ratio of random sequences generated using the maximum entropy approach coincides exactly with desired GC content over a wide range of GC ratios.
When generating nucleotide sequences from an amino acid sequence with uniform amino acid usage, we can accurately achieve a GC content between the range of 30% and 64% (top). By altering the amino acid composition of the translated sequence, a lower and higher range of GC contents can be obtained (middle and bottom). At each GC content, the average GC content of 500 randomly generated sequences with amino acid length of 2500 was taken. The y = x line (shown in gray dotted line) indicates the ideal case. The simulated results for the multinomial and maximum entropy method are shown in black jagged and solid lines respectively.