Literature DB >> 21803841

Genomic structure of the cyanobacterium Synechocystis sp. PCC 6803 strain GT-S.

Naoyuki Tajima¹, Shusei Sato, Fumito Maruyama, Takakazu Kaneko, Naobumi V Sasaki, Ken Kurokawa, Hiroyuki Ohta, Yu Kanesaki, Hirofumi Yoshikawa, Satoshi Tabata, Masahiko Ikeuchi, Naoki Sato.

Abstract

Synechocystis sp. PCC 6803 is the most popular cyanobacterial strain, serving as a standard in the research fields of photosynthesis, stress response, metabolism and so on. A glucose-tolerant (GT) derivative of this strain was used for genome sequencing at Kazusa DNA Research Institute in 1996, which established a hallmark in the study of cyanobacteria. However, apparent differences in sequences deviating from the database have been noticed among different strain stocks. For this reason, we analysed the genomic sequence of another GT strain (GT-S) by 454 and partial Sanger sequencing. We found 22 putative single nucleotide polymorphisms (SNPs) in comparison to the published sequence of the Kazusa strain. However, Sanger sequencing of 36 direct PCR products of the Kazusa strains stored in small aliquots resulted in their identity with the GT-S sequence at 21 of the 22 sites, excluding the possibility of their being SNPs. In addition, we were able to combine five split open reading frames present in the database sequence, and to remove the C-terminus of an ORF. Aside from these, two of the Insertion Sequence elements were not present in the GT-S strain. We have thus become able to provide an accurate genomic sequence of Synechocystis sp. PCC 6803 for future studies on this important cyanobacterial strain.

Entities: Chemical Disease Species

Mesh：

Substances：

Year: 2011 PMID： 21803841 PMCID： PMC3190959 DOI： 10.1093/dnares/dsr026

Source DB: PubMed Journal: DNA Res ISSN： 1340-2838 Impact factor: 4.458

Introduction

The nucleotide sequence of the genome of the cyanobacterium Synechocystis sp. PCC 6803 was determined by Kazusa DNA Research Institute in 1996 as the first genome of photosynthetic organism.[1] After that, this strain has been serving as a standard of cyanobacteria in various areas of research, such as photosynthesis, stress response and metabolism.[2] However, the sequenced strain (called Kazusa strain in the present study) is different from the stock in Pasteur Culture Collection (called PCC strain in the present study). In fact, the Kazusa strain is a derivative of a ‘glucose-tolerant’ strain, which was obtained by J.G.K. Williams in DuPont Institute.[3] The published sequence of the Kasuza strain included some genes inactivated by a putative point mutation, a putative frame shift, or an Insertion Sequence (IS) insertion, such as a one in the pilC gene. The mutation within the coding sequence of the pilC gene was pointed out to be a possible reason for the non-motility of the Kazusa strain.[4] A 154 bp deletion was also found in the GT strain with respect to the PCC strain.[5] The location of some IS elements in the Kazusa strain is known to be different with respect to other GT and PCC strains.[6] Even within the PCC strains, different strains having different light responses have been isolated.[2] All these slightly different strains bear the common strain name PCC 6803, but we need to recognize differences in exact strains used in various studies. For this purpose, we will have to pinpoint the differences in genome sequences of various different strains. One of the authors (N.S.) constructed 40 site-directed mutants in a previous work on comparative genomics of plants and cyanobacteria[7] using the laboratory stock of Synechocystis GT strain (called GT-S). We thought that this strain should be identical to the Kazusa strain, because it originated in the late 1980s from the strain owned by Dr T. Omata, which was also the source of the Kazusa strain. However, in view of the small but significant differences in genome sequence as reported earlier, it was important to establish the genetic background of our strain to assess correctly the phenotype of the above-mentioned mutants. Therefore, we attempted to analyse the genome sequence of the strain GT-S and to compare it with the reference sequence of the Kazusa strain. We found significant differences with respect to the database sequence, but we were finally convinced that the differences in the real sequences were minimal.

Materials and methods

Strain and genomes

Synechocystis GT-S strain was originally a gift from Dr Tatsuo Omata (Nagoya University, but he was in Riken Institute then) in the late 1980s, and then maintained in Sato laboratory as frozen glycerol stocks. In the present study, we used the stock originally frozen in the early 1990s. The cells were grown in the BG-11 medium at 32°C with aeration as described before.[8] The cells were harvested by centrifugation, and then washed twice with 4 M NaI to remove extracellular polysaccharide, and then, treated with lysozyme. DNA was released by treatment with proteinase K and sodium N-dodecanoylsarcosinate, extracted with phenol and chloroform and purified by CsCl ultracentrifugation.[9] As a reference, we also used an aliquot of the DNA of the original Kazusa strain, which had been stored as a stock in Kazusa DNA Research Institute.

Sequencing and data analysis

Genomic DNA was sheared by ultrasonic treatment and sequenced by a genome sequencer FLX instrument (Roche Diagnostics, Indianapolis, IN, USA) according to the manufacturer's protocol (this is usually referred to as ‘454 sequencing’). To find its genomic origin, namely, main genome or plasmids, each read was analysed by BLASTN[10] software version 2.2.18 using the sequences of the four plasmids as well as the main genome as targets (the accession numbers are given in Supplementary Table S5). The options were: -FF −e 0.0001 −v 2 −b 2 −m 8 −C F (no filtering, cut-off E-value = 0.0001, output and list sequences = 2, table-formatted output, no compositional adjustments). In the table-formatted output, only the first line corresponding to the highest identity was selected for each read, which was assigned to the genome shown therein. The authentic reads assigned for genomic DNA obtained in this way were mapped onto the reference sequence of the Kazusa strain (GenBank and RefSeq accession numbers: BA000022 and NC_000911 for the main genome) by the inGAP software version 2.3.1.[11] Unfortunately, the details of internal algorithm of the software are not clear, and there is no option related to the detection of SNPs. Therefore, all putative SNPs detected by default settings were analysed. Plasmids were also analysed by using respectively assigned reads. A list of putative SNPs was obtained as an output. Homology of affected open reading frames (ORFs) with orthologues in other cyanobacteria was analysed by the cluster data of CyanoClust database[12] prepared by the Gclust software.[13] Processing of DNA and protein sequences was performed with the SISEQ software version 1.59.[14] Sequence alignments were constructed with the Clustal X software version 2.0.9.[15] Genomic sequence was manipulated by the Artemis software version 13.0.[16]

Sequence confirmation

For each putative SNP, a genomic region of 200–300 bp was amplified (see Supplementary Table S1 for primer sequences). For each putative IS element, a genomic region of ∼300 or 1500 bp was amplified (see Supplementary Table S2 for primer sequences). The amplification of a long DNA was to overcome repeated sequences. DNA templates of both GT-S and Kazusa strains were used. The products were sequenced by conventional Sanger sequencing, using the sequencing services of MACROGEN Japan Corp. (Tokyo, Japan) or FASMAC Co. Ltd. (Atsugi, Japan).

Results

Identification of SNPs

We obtained 197 912 reads having an average length of 399.3 bases for the GT-S strain by 454 sequencing. Without the preliminary classification of reads, 68 single-nucleotide polymorphisms (SNPs) were obtained for the main genome, but many of them were not correct, because of the presence of highly homologous genes in plasmids. Then the reads were allocated to the main genome and the four plasmids by homology analysis as described in Section 2.2. The 173 217 reads that were classified as reads for the main genome were mapped to the reference sequence NC_000911. Using the default settings of inGAP software (see Supplementary data for the list of options), the entire genome was covered by at least one read, except four small regions (Supplementary Table S3). The analysis of such gap regions was performed separately, as described below. As a result, 31 putative SNPs were detected by the inGAP analysis. All of them were selected as highly probable SNPs for experimental validation. Each of the putative SNPs was checked by PCR amplification and Sanger sequencing of both strands. Twenty-two SNPs (Table 1) were finally identified as the differences of the sequence of the GT-S strain with respect to the database sequence NC_000911 (identical to BA000022 with respect to the DNA sequence). To verify that these represent real differences of the two strains, we analysed, by Sanger sequencing, the DNA of the Kazusa strain, which had been stored in small aliquots. Surprisingly, all the putative SNP sites were found identical in the Kazusa strain and the GT-S strain except No. 8 (Table 1). The SNP No. 8 is the mutation within the pilC gene, which had been reported earlier.[4] The two putative SNPs in the psbA3 coding region were identical to the corresponding sites of the psbA2 gene. Since the correct psbA3 sequence had been published before the genome sequence,[17] these putative SNPs are probably sequencing artefacts in NC_000911. A putative SNP site in the psaA gene also matches the previously published sequence.[18] In other cases, we have no clear explanation, and might be sequencing errors and/or mutations in cosmid clones used in the original sequencing.

Table 1.

List of putative SNPs

No.	Site	Gene	CyanoClust cluster no.	Database	GT-Kazusa	GT-S	Amino acid change	Annotation	Ref.
1	943495	psaA	16	G	A	A	V→I	P700 apoprotein subunit Ia	18
2	1012958	No gene	—	G	T	T	N/A	—
3	1364187	pyrF	784	A	G	G	None	Orotidine 5’ monophosphate decarboxylase
4	1819782	psbA3	18	A	G	G	None	Photosystem II D1 protein	17
5	1819788			A	G	G	None
6	2092571	sll0422	1760	A	T	T	L→ter	Asparaginase
7	2198893	sll0142	15	T	C	C	None	Cation or drug efflux system protein
8	2204584	gspF+pilC	917+7792	G	G	—	Frame shift	Pilin biogenesis protein	4
9	2301721	slr0168	6624	A	G	G	K→E	Hypothetical protein
10	2350285.5	No gene	—	—	A	A	N/A	—
11	2360245.5	slr0364	26 765+19 649	—	C	C	Frame shift	Hypothetical protein
12	2409244	sll0762	2611	C	—	—	Frame shift	Hypothetical protein
13	2419399	ycf22	779	T	—	—	Frame shift	Hypothetical protein
14	2544044.5	ssl0787	2596	—	C	C	Frame shift	Hypothetical protein
15	2602717	slr0468	31358	C	A	A	H→Q	Hypothetical protein
16	2602734	slr0468	31358	T	A	A	I→N	Hypothetical protein
17	2748897	No gene	—	C	T	T	N/A	—
18	3096187	ssr1175	796	T	C	C	I→T	Transposase
19	3110189	No gene	—	G	A	A	N/A	—
20	3110343	sll0665	1448	G	T	T	P→Q	Transposase
21	3142651	sps	2831	A	G	G	None	Sucrose phosphate synthase
22	3260096	No gene	—	C	—	—	N/A	—

GT-Kazusa and GT-S are Synechocystis sp. PCC 6803 strain GT in Kazusa DNA Research Institute and Sato Laboratory. ‘Site’ and ‘Database’ refers to the sequences in BA000022 or NC_000911. Insertion site numbers represent the last position of insertion site + 0.5. N/A indicates that the amino acid change is not applicable because SNP site is not in an ORF.

List of putative SNPs GT-Kazusa and GT-S are Synechocystis sp. PCC 6803 strain GT in Kazusa DNA Research Institute and Sato Laboratory. ‘Site’ and ‘Database’ refers to the sequences in BA000022 or NC_000911. Insertion site numbers represent the last position of insertion site + 0.5. N/A indicates that the amino acid change is not applicable because SNP site is not in an ORF. Unfortunately, the mapping of reads on to the reference genome was not perfect using the obtained reads. In 14 short regions, no reads or at most two reads were mapped (Supplementary Tables S3 and S4). These regions were amplified by PCR for both GT-S and Kazusa strains (results not shown). Conventional sequencing of the PCR products confirmed that there is no sequence difference in 11 of these regions with respect to the database sequence. The remaining three regions having two reads were close to one another and located within a 3 kb region. Clean PCR amplification of this 3 kb region was not successful because of repeated sequences. However, the presence of two reads led us to tentatively conclude that there is no sequence difference in these regions.

Analysis of plasmids

Plasmids were also analysed by inGAP mapping. There were no putative SNPs in pSYSM and pSYSG (Supplementary Table S5). In pSYSA, four sites were reported as putative SNPs, but all of them represent sites having only two reads and one of the reads matched database sequence. Therefore, these were not considered as SNPs in pSYSA. In pSYSX, four sites within or near ssr6089 gene were detected as putative SNPs. Analysis using the CyanoClust database indicated that this plasmid contains 30 kb homologous regions, ssr6002–slr6038 and ssr6062–slr6094. The ssr6089 gene has a nearly identical homologue ssr6030. However, the sequence corresponding to the four putative SNPs were identical in the two genes in the database sequence NC_005232. Therefore, the SNP calling was not due to mixing of reads for homologous genes. The SNPs could possibly represent mutations in the strain GT-S, but final validation is hampered by high similarity of the long homologous regions.

Alteration of ORFs due to frame shift

There are five cases in which a single gene is split into a pair of genes as a result of frame shift. Figure 1A shows the site of putative SNP 12, namely the sll0762–sll0763 region. There is an extraneous C in the database sequence, and accordingly, the removal of this C results in fusion of the two ORFs. This new ORF encoding a hypothetical protein has well-conserved orthologues in other cyanobacteria (Anabaena, Cyanothece, Arthrospira etc.) as shown by the alignment of the cluster 2611 of the CyanoClust (Fig. 1B).

Figure 1.

Correction of ORFs due to a frame shift in the sll0762–sll0763 retion. (A) Output of an SNP site in the reference sequence of the Kazusa strain by the inGAP software. The upper DNA sequence indicates the reference sequence of the Kazusa strain (GenBank and RefSeq accession numbers: BA000022 and NC_000911), and the lower DNA sequence indicates the sequence of the GT-S strain. Each arrow represents a gene. Each arrowhead indicates an SNP site. (B) New alignment with a corrected sequence. Homology of affected ORFs with corresponding sequences in other cyanobacteria was analysed by the CyanoClust database version 4, and the cluster 2611 was found. Sequences were retrieved and a new alignment was obtained by the Clustal X software. ‘New_Sequence’ indicates the corrected sequence. Arrowhead indicates the nucleotide variations detected as putative SNP site. To correct the database sequence to obtain the GT-S genome sequence, we should combine (i) slr0162 (gspF) and slr0163 (pilC), (ii) slr0364 and slr0366, (iii) sll0762 and sll0763 (this is described above), (iv) sll0751 (ycf22) and sll0752, and (v) ssl0787 and ssl0788 (Supplementary Figs S1 and S2). In addition, the extended C-terminus of Sll0422 protein should be removed after correction for the nucleotide change (Supplementary Fig. S1). All these changes except (i) also apply to the real sequence of Kazusa strain.

Large indels

We also checked large indels (insertion/deletions). The exact sites of insertion of various IS elements have already been analysed.[6] Among them, ISY203b insertion between slr1862 and slr1863 and ISY203g insertion between sll1473 and sll1475 were found in the Kazusa strain but not in the GT-S strain. ISY203e insertion between ssl2982 and slr1636 was detected in both Kazusa and GT-S strains (Table 2) but not in another GT strain in Ikeuchi laboratory. It has also been known that a 154 bp element upstream of the slr2031 gene is deleted in the GT strains.[5] This deletion was shared by all GT strains analysed in the present study.

Table 2.

List of ISY203s detected in GT strains

IS name	Transposase gene	Database	GT-Kazusa	GT-S
ISY203b	sll1780	Yes	Yes	No
ISY203e	slr1635	Yes	Yes	Yes
ISY203g	sll1474	Yes	Yes	No

List of ISY203s detected in GT strains

Finally validated differences of the two strains

All previous description was based on the comparison using the database sequence as the sole reference. Given that there are a number of changes that have to be made for the database sequence, we summarize our results as the differences between the real sequences of GT-S and GT-Kazusa. The two sequences are essentially identical except a single frame-shift mutation in the pilC gene and two more insertions of ISY203 in GT-Kazusa with respect to GT-S.

Discussion

The present study revealed that a significant number of differences are present in the database sequence and the genome sequences of laboratory strains of the same ‘species’ Synechocystis sp. PCC 6803. The detailed analysis using the genomic DNA of both Kazusa and GT-S strains indicated that the detected 21 putative SNPs were, in fact, differences in the database sequence, but not real differences in the two genomes. The final balance sheet indicates that we found a single nucleotide change and two IS insertions between the Kazusa strain and the GT-S strain. The time of separation of the two strains may be estimated as the mid-1980s according to the opinions of concerned people, which are now quite obscure. The time interval until the DNA isolation for sequencing may be roughly estimated as about 10 years for the Kazusa strain. The GT-S strain was stocked in the early 1990s, and re-plated in 2010 for the present analysis. The effective time interval from the separation of the two sub-strains was also about 10 years. The results suggest that nucleotide change (mutation) could be kept to a minimum (only one, in this case) if due attention is paid for maintenance of strains, but IS mobilization may be more frequent (two events). The rapid mobilization of IS could be limited to the particular element ISY203, but we do not know the actual trigger of activation of this IS element. We, therefore, should be careful about IS activation in the maintenance of laboratory stocks. We will need a convenient way of detecting a mobilized ISY203 to be sure about our research using the GT strain. The nucleotide changes as a result of re-sequencing caused significant effects on gene annotation. As mentioned, five genes had been thought split into two by a single nucleotide difference before this analysis. The length of another gene was also changed. The IS element inserted in the sll1474 (ccaS) gene is known to inactivate it.[6] Altogether, the nucleotide changes (whether sequencing errors or real mutations) have an important impact on molecular biological researches using cyanobacteria or other bacteria. A single run of new generation sequencing with some additional PCR experiments can establish identity of the organism that is being used in the laboratory. This will become a standard of molecular genetics in microbiology. The genomic database is very important in not only experimental studies but also computational analysis. The use of correct sequence is a prerequisite for detailed comparative genomics research. The 21 sites per 3.6 Mb genome are significantly large number for present-day level of genome analysis. The correction of the standard sequence will be especially useful in Synechocystis sp. PCC 6803, which is a standard cyanobacterial strain in various areas of research such as photosynthesis and stress response among others. We hope our data deposited as a new separate entry will be useful for all those who are using this cyanobacterium in various researches.

Databases

The genome sequence of the strain GT-S was deposited in the DDBJ/GenBank/EMBL database under the accession number AP012205.

Supplementary data

Supplementary data are available at www.dnaresearch.oxfordjournals.org.

Funding

This work was supported in part by the Global Center of Excellence (GCOE) Program ‘From the Earth to “Earths”’ from the MEXT, Japan.

15 in total

1. Experimental analysis of recently transposed insertion sequences in the cyanobacterium Synechocystis sp. PCC 6803.

Authors: S Okamoto; M Ikeuchi; M Ohmori
Journal: DNA Res Date: 1999-10-29 Impact factor: 4.458

2. Artemis: sequence visualization and annotation.

Authors: K Rutherford; J Parkhill; J Crook; T Horsnell; P Rice; M A Rajandream; B Barrell
Journal: Bioinformatics Date: 2000-10 Impact factor: 6.937

3. SISEQ: manipulation of multiple sequence and large database files for common platforms.

Authors: N Sato
Journal: Bioinformatics Date: 2000-02 Impact factor: 6.937

4. Type IV pilus biogenesis and motility in the cyanobacterium Synechocystis sp. PCC6803.

Authors: D Bhaya; N R Bianco; D Bryant; A Grossman
Journal: Mol Microbiol Date: 2000-08 Impact factor: 3.501

5. Nucleotide sequence of the psbA3 gene from the cyanobacterium Synechocystis PCC 6803.

Authors: J Metz; P Nixon; B Diner
Journal: Nucleic Acids Res Date: 1990-11-25 Impact factor: 16.971

6. Orthogenomics of photosynthetic organisms: bioinformatic and experimental analysis of chloroplast proteins of endosymbiont origin in Arabidopsis and their counterparts in Synechocystis.

Authors: Masayuki Ishikawa; Makoto Fujiwara; Kintake Sonoike; Naoki Sato
Journal: Plant Cell Physiol Date: 2009-02-18 Impact factor: 4.927

7. Gclust: trans-kingdom classification of proteins using automatic individual threshold setting.

Authors: Naoki Sato
Journal: Bioinformatics Date: 2009-01-21 Impact factor: 6.937

8. DNA transformation.

Authors: R D Porter
Journal: Methods Enzymol Date: 1988 Impact factor: 1.600

9. CyanoClust: comparative genome resources of cyanobacteria and plastids.

Authors: Naobumi V Sasaki; Naoki Sato
Journal: Database (Oxford) Date: 2010-01-08 Impact factor: 3.451

10. inGAP: an integrated next-generation genome analysis pipeline.

Authors: Ji Qi; Fangqing Zhao; Anne Buboltz; Stephan C Schuster
Journal: Bioinformatics Date: 2009-10-30 Impact factor: 6.937

20 in total

1. MarR-type transcriptional regulator ChlR activates expression of tetrapyrrole biosynthesis genes in response to low-oxygen conditions in cyanobacteria.

Authors: Rina Aoki; Tomoya Takeda; Tatsuo Omata; Kunio Ihara; Yuichi Fujita
Journal: J Biol Chem Date: 2012-02-28 Impact factor: 5.157

2. Three Substrains of the Cyanobacterium Anabaena sp. Strain PCC 7120 Display Divergence in Genomic Sequences and hetC Function.

Authors: Yali Wang; Yuan Gao; Chao Li; Hong Gao; Cheng-Cai Zhang; Xudong Xu
Journal: J Bacteriol Date: 2018-06-11 Impact factor: 3.490

3. Enhancing photosynthesis at high light levels by adaptive laboratory evolution.

Authors: Marcel Dann; Edgardo M Ortiz; Moritz Thomas; Arthur Guljamow; Martin Lehmann; Hanno Schaefer; Dario Leister
Journal: Nat Plants Date: 2021-05-03 Impact factor: 15.793

4. Transcription regulation of plastid genes involved in sulfate transport in Viridiplantae.

Authors: Vassily A Lyubetsky; Alexander V Seliverstov; Oleg A Zverkov
Journal: Biomed Res Int Date: 2013-08-29 Impact factor: 3.411

5. Identification of substrain-specific mutations by massively parallel whole-genome resequencing of Synechocystis sp. PCC 6803.

Authors: Yu Kanesaki; Yuh Shiwa; Naoyuki Tajima; Marie Suzuki; Satoru Watanabe; Naoki Sato; Masahiko Ikeuchi; Hirofumi Yoshikawa
Journal: DNA Res Date: 2011-12-22 Impact factor: 4.458

6. Identification of Specific Variations in a Non-Motile Strain of Cyanobacterium Synechocystis sp. PCC 6803 Originated from ATCC 27184 by Whole Genome Resequencing.

Authors: Qinglong Ding; Gu Chen; Yuling Wang; Dong Wei
Journal: Int J Mol Sci Date: 2015-10-12 Impact factor: 5.923