| Literature DB >> 25925571 |
Neil R Adames1, Mandy L Wilson1, Gang Fang2, Matthew W Lux3, Benjamin S Glick4, Jean Peccoud5.
Abstract
Synthetic biologists rely on databases of biological parts to design genetic devices and systems. The sequences and descriptions of genetic parts are often derived from features of previously described plasmids using ad hoc, error-prone and time-consuming curation processes because existing databases of plasmids and features are loosely organized. These databases often lack consistency in the way they identify and describe sequences. Furthermore, legacy bioinformatics file formats like GenBank do not provide enough information about the purpose of features. We have analyzed the annotations of a library of ∼2000 widely used plasmids to build a non-redundant database of plasmid features. We looked at the variability of plasmid features, their usage statistics and their distributions by feature type. We segmented the plasmid features by expression hosts. We derived a library of biological parts from the database of plasmid features. The library was formatted using the Synthetic Biology Open Language, an emerging standard developed to better organize libraries of genetic parts to facilitate synthetic biology workflows. As proof, the library was converted into GenoCAD grammar files to allow users to import and customize the library based on the needs of their research projects.Entities:
Mesh:
Year: 2015 PMID: 25925571 PMCID: PMC4446419 DOI: 10.1093/nar/gkv272
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Description of the different datasets
| Dataset | Numbers | Comment |
|---|---|---|
| SnapGene File Library | 1901 files | The entire collection of annotated sequence files available from the SnapGene web site. |
| Collections | 13 collections | The different groups of files in the SnapGene File Library (Supplementary Table S1). |
| Non-Redundant File Library | 1718 files | The SnapGene File Library after removal of duplicated sequences, feature orientation variants and topological variants of the same plasmid. |
| Non-Redundant Plasmid Library | 1557 plasmid files | Subset of the Non-Redundant File Library after removal of single feature sequence files. |
| Feature Library | 21 594 features | All the features extracted from the files in the Non-Redundant File Library. |
| Non-Redundant Feature Library | 2046 features | Content of the Feature Library after removing duplicate features found in multiple files. |
| Standard Features Library | 1943 features | Content of the Non-Redundant Feature Library after disambiguation of feature variants. |
| Expression Host | 12 hosts | The SnapGene files are associated with 12 different expression hosts but some files do not include any host information. |
| SBOL Files | 14 files | 12 files corresponding to each of the expression hosts, 1 file for features with unspecified hosts, and a file with all the features. |
| GenoCAD Grammar | 1967 parts 17 parts libraries 112 categories 67 rules | The GenoCAD grammar includes all the Standard Features to which we added 24 new parts not annotated in the SnapGene files, removed 14 CDS that differ only by the stop codon from other parts, and added 4 sequence delimiter parts. Parts are organized in 112 categories and |
| 17 libraries. | Finally, 67 rules define relations between categories. | |
| 112 categories | ||
| 67 rules |
Figure 1.Correlation between plasmid length and number of features per plasmid. (A) All plasmids. Types of plasmid are indicated by color in the figure legends. Panels (B)–(F) are grouped by lab host with the specific type of plasmid indicated in color as in panel (A). Outliers with low or high feature densities are labeled. The outlined data points denote plasmids that had three or more additional features detected by SnapGene that were not annotated in the original downloaded files. The outlined circles show the original feature densities for these plasmids and the outlined triangles show the updated feature densities.
Statistics for non-coding and protein coding feature variants
| Feature | No. of variantsa | No. of occurrences | No. of bp changesb | Total length (bp) | Changes/ Variant | Changes/ 1000 bp | Length only variantsc |
|---|---|---|---|---|---|---|---|
| Non-coding features | |||||||
| AmpR promoter | 12 | 1110 | 12 | 1154 | 1.0 | 10.4 | 3 (25.0%) |
| CMV enhancer | 15 | 519 | 15 | 4954 | 1.0 | 3.0 | 5 (35.7%) |
| CMV promoter | 10 | 511 | 29 | 2039 | 2.9 | 14.2 | 3 (30.0%) |
| SV40 promoterd | 23 | 897 | 28 | 4613 | 1.2 | 6.1 | 7 (30.4%) |
| f1/M13 ori | 22 | 651 | 85 | 9773 | 3.9 | 8.7 | 3 (13.6%) |
| ori | 22 | 1490 | 48 | 12 689 | 2.2 | 3.8 | 2 (9.1%) |
| IRES | 16 | 82 | 21 | 8767 | 1.3 | 2.4 | 4 (25.0%) |
| Total | 120 | 5260 | 238 | 43 989 | 2.0 | 5.4 | 27 (22.5%) |
| Coding features | |||||||
| AmpR/bla(M) | 23 | 1065 | 161 | 19 734 | 7.0 | 8.2 | 4 (17.4%) |
| CmR | 16 | 211 | 25 | 10 605 | 1.6 | 2.4 | 1 (6.3%) |
| HygR | 14 | 101 | 282 | 14 376 | 20.1 | 19.6 | 3 (21.4%) |
| KanR | 19 | 119 | 131 | 15 474 | 6.9 | 8.5 | 0 (0.0%) |
| NeoR/KanR | 23 | 354 | 66 | 18 312 | 2.9 | 3.6 | 2 (8.7%) |
| PuroR | 11 | 75 | 131 | 6627 | 11.9 | 19.8 | 0 (0.0%) |
| lacZ-α | 74 | 144 | 14 | 27 102 | 0.2 | 0.5 | 70 (95.0%) |
| MBP | 10 | 36 | 40 | 11 022 | 4.0 | 3.6 | 1 (10.0%) |
| Totale | 116 | 1961 | 836 | 96 150 | 7.2 | 8.7 | 11 (9.5%) |
| 190 | 2105 | 850 | 123 152 | 4.5 | 6.9 | 81 (42.6%) | |
| 24 | 280 | 106 | 15 394 | ||||
aAfter consolidation of identical features upon correction of sequence or annotation errors.
bBase pair changes relative to the consensus sequence, including missense mutations and indels, but excluding differences in feature borders.
cVariants that differ from the consensus only by their borders. It does not include variants missing only START or STOP codons.
dIncludes all variants of SV40 ori, SV40 enhancer and SV40 promoter.
eValues in bold exclude lacZ-α variants as the majority of these differ only in their in-frame multiple cloning sites.
Types of variations in protein coding features
| Feature | Synonymous codon changesa | Conservative residue changesa | Non-conservative residue changesa | Variants no aa changesb |
|---|---|---|---|---|
| AmpR/bla(M) | 73% | 17% | 10% | 39% |
| CmR | 84% | 16% | 0% | 81% |
| HygR | 87% | 1% | 12% | 57% |
| KanR | 54% | 6% | 40% | 32% |
| NeoR/KanR | 62% | 14% | 24% | 65% |
| PuroR | 94% | 1% | 5% | 73% |
| lacZ-αc | 93% | 7% | 0% | 95% |
| MBP | 27% | 2% | 71% | 60% |
| Total no. | 641 | 54 | 156 | 134 |
| Total Mean | 75% | 7% | 18% | 71% |
aPercentage of all bp changes including mismatches and indels but excluding border differences.
bPercentage of variants that produce no changes in the translated protein.
cExcluding the multiple cloning sites.
Figure 2.Structure of a plasmid to tag S. cerevisiae genes with a fluorescent protein. (A) Map of the empty vector and insert derived from the GenBank files exported from GenoCAD. (B) Structure of the same plasmid represented using SBOLv icons.