Literature DB >> 31964893

Four high-quality draft genome assemblies of the marine heterotrophic nanoflagellate Cafeteria roenbergensis.

Thomas Hackl1,2, Roman Martin3,4, Karina Barenhoff5, Sarah Duponchel5, Dominik Heider3, Matthias G Fischer6.   

Abstract

The heterotrophic stramenopile Cafeteria roenbergensis is a globally distributed marine bacterivorous protist. This unicellular flagellate is host to the giant DNA virus CroV and the virophage mavirus. We sequenced the genomes of four cultured C. roenbergensis strains and generated 23.53 Gb of Illumina MiSeq data (99-282 × coverage per strain) and 5.09 Gb of PacBio RSII data (13-45 × coverage). Using the Canu assembler and customized curation procedures, we obtained high-quality draft genome assemblies with a total length of 34-36 Mbp per strain and contig N50 lengths of 148 kbp to 464 kbp. The C. roenbergensis genome has a GC content of ~70%, a repeat content of ~28%, and is predicted to contain approximately 7857-8483 protein-coding genes based on a combination of de novo, homology-based and transcriptome-supported annotation. These first high-quality genome assemblies of a bicosoecid fill an important gap in sequenced stramenopile representatives and enable a more detailed evolutionary analysis of heterotrophic protists.

Entities:  

Mesh:

Year:  2020        PMID: 31964893      PMCID: PMC6972860          DOI: 10.1038/s41597-020-0363-4

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

The diversity of eukaryotes lies largely among its unicellular members, the protists. Yet, genomic exploration of eukaryotic microbes lags behind that of animals, plants, and fungi[1]. One of these neglected groups is the Bicosoecida within the Stramenopiles, which contains the widespread marine heterotrophic flagellate, Cafeteria roenbergensis[2-6]. The aloricate biflagellated cells lack plastids and feed on bacteria and viruses by phagocytosis[2]. C. roenbergensis has been used a model system for many years to study protistan grazing on bacteria[7-9]. The organism appears to be diploid based on sequencing data analysis[10], reproduces by binary fission and has no known sexual cycle. To our knowledge, the only other bicosoecid with a sequenced genome is Halocafeteria seosinensis[11], and the most closely related sequenced organisms from other stramenopile groups are the Placidozoa Blastocystis hominis and Incisomonas marina, as well as members of the Labyrinthulea and the more distant Oomycota and Ochrophyta[12,13]. In addition to these cultured representatives, single-cell genomics of uncultured marine stramenopiles increasingly illuminates the genomic landscape of this diverse group of protistan grazers[14,15]. The phylogenetic position of Cafeteria at the base of stramenopiles and the paucity of genomic data among unicellular heterotrophic grazers thus make C. roenbergensis an interesting object for genomic studies. Heterotrophic flagellates of the Cafeteria genus are subject to infection by various viruses, including the lytic giant Cafeteria roenbergensis virus (CroV, family Mimiviridae) and its associated virophage mavirus (family Lavidaviridae)[16-18]. We recently showed that mavirus can exist as an integrated provirophage in C. roenbergensis and provide resistance against CroV infection on a host-population level[10]. Genomic studies of Cafeteria will reveal new insight into the importance of endogenous viral elements for the evolution and ecology of this group. Here we present whole-genome shotgun sequencing data and high-quality assemblies of four cultured clonal strains of C. roenbergensis: E4-10P, BVI, Cflag and RCC970-E3. The strains were individually isolated from four different locations (Fig. 1a). CrCflag and CrBVI were obtained from coastal waters of the Atlantic Ocean at Woods Hole, MA, USA (1986) and the British Virgin Islands (2012). CrE4-10P was collected from Pacific coastal waters near Yaquina Bay, Oregon, USA (1989). CrRCC970-E3 was obtained from open ocean waters of the South Pacific, collected about 2200 km off the coast of Chile during the BIOSOPE cruise[19] (2004) (Table 1). In addition, we also sequenced strain CrE4-10M1, an isogenic variant of CrE4-10P carrying additional integrated mavirus genomes previously described by Fischer and Hackl[10]. CrE4-10M1 read data was used to support the CrE4-10P genome assembly after mavirus-containing data was removed.
Fig. 1

Sampling locations and phylogenetic relationship of Cafeteria roenbergensis strains. (a) Map representing the sampling sites of the four C. roenbergensis strains around the Americas. (b) Maximum likelihood tree reconstructed from a concatenated alignment of 123 shared single-copy core genes for the four C. roenbergensis strains and their outgroup Halocafeteria seosinensis. Numbers next to internal nodes indicate bootstrap support based on 100 iterations. The branch to the outgroup represented by a dashed line has been shortened for visualization.

Table 1

Strain and sample information.

SpeciesStrainLocationCoordinatesYearBiosampleRoscoff ID
C. roenbergensisE4-10PNorth Pacific, 5 km west of Yaquina Bay, Oregon44.62N 124.06W1989SAMN12216681RCC:4624
C. roenbergensisE4-10M1North Pacific, 5 km west of Yaquina Bay, Oregon44.62N 124.06W1989SAMN12216695RCC:4625
C. roenbergensisBVIThe British Virgin Islands18.42N 64.61W2012SAMN12216698
C. roenbergensisCflagNorth Atlantic Ocean, Woods Hole, MA41.52N 70.67W1986SAMN12216699
C. roenbergensisRCC970-E3South Pacific Ocean, 2200 km off the coast of Chile30.78S 95.43W2004SAMN12216700RCC:4623
Sampling locations and phylogenetic relationship of Cafeteria roenbergensis strains. (a) Map representing the sampling sites of the four C. roenbergensis strains around the Americas. (b) Maximum likelihood tree reconstructed from a concatenated alignment of 123 shared single-copy core genes for the four C. roenbergensis strains and their outgroup Halocafeteria seosinensis. Numbers next to internal nodes indicate bootstrap support based on 100 iterations. The branch to the outgroup represented by a dashed line has been shortened for visualization. Strain and sample information. Overall we generated 23.53 Gbp of raw short read data on an Illumina MiSeq platform with 99–282 × coverage per strain, and 5.09 Gbp of raw long read data on a PacBio RS II platform with 13–45 × coverage per strain (Table 2)[20]. Based on 19-mers frequencies, we estimate a haploid genome size for C. roenbergensis of approximately 40 Mbp (Fig. 2). We first generated various draft assemblies with different assembly strategies and picked the best drafts for further refinement (see Technical Validation)[21]. After decontamination, assembly curation and polishing, we obtained four improved high-quality draft assemblies with 34–36 Mbp in size and contig N50s of 148–460 kbp (Table 3)[22-25]. The genomes have a GC-content of 70–71%, and 28% of the overall sequences were marked as repetitive.
Table 2

Sequencing information and library statistics.

Cafeteria strainInstrumentLibrary layout# LibrariesLibrary size (Gbp)CoverageSRA study accessionSRA run accession
E4-10PIllumina MiSeqpaired26.85171SRP215872SRR9724619
E4-10PPacBio RS IIsingle20.5213SRP215872SRR9724618
E4-10M1Illumina MiSeqpaired24.45111SRP215872SRR9724621
E4-10M1PacBio RS IIsingle21.332SRP215872SRR9724620
BVIIllumina MiSeqpaired14.31108SRP215872SRR9724615
BVIPacBio RS IIsingle31.845SRP215872SRR9724614
CflagIllumina MiSeqpaired13.9499SRP215872SRR9724617
CflagPacBio RS IIsingle20.9524SRP215872SRR9724616
RCC970-E3Illumina MiSeqpaired13.98100SRP215872SRR9724623
RCC970-E3PacBio RS IIsingle20.5213SRP215872SRR9724622
Fig. 2

K-mer frequency distribution and estimated genome size of Cafeteria roenbergensis strain E4-10P. Frequency distribution of 19-mers in the quality-trimmed MiSeq read set of CrE4-10P. The major peak at ~120 × coverage corresponds to the majority of homozygous k-mers of the diploid (2n) genome, the smaller peak at half the coverage comprises haplotype-specific (1n) k-mers. Small peaks at 3n and 4n represent regions of higher copy numbers. Low-coverage k-mers derive from sequencing errors and bacterial contamination. Cumulatively, the k-mer distribution suggests an approximate haploid genome size of 40 Mbp.

Table 3

Assembly and annotation statistics.

Assembly# ContigsTotal size (bp)Contig N50 (bp)% GC% Repeats% BUSCOs# ProteinsGenbank Accession
CrE4-10P21835,335,825402,89270.527.883.88364VLTO01000000
CrBVI17036,327,047460,46770.127.783.28483VLTN01000000
CrCflag27034,521,237231,39470.527.982.88018VLTM01000000
CrRCC970-E339633,988,271148,31170.527.981.87857VLTL01000000
Sequencing information and library statistics. K-mer frequency distribution and estimated genome size of Cafeteria roenbergensis strain E4-10P. Frequency distribution of 19-mers in the quality-trimmed MiSeq read set of CrE4-10P. The major peak at ~120 × coverage corresponds to the majority of homozygous k-mers of the diploid (2n) genome, the smaller peak at half the coverage comprises haplotype-specific (1n) k-mers. Small peaks at 3n and 4n represent regions of higher copy numbers. Low-coverage k-mers derive from sequencing errors and bacterial contamination. Cumulatively, the k-mer distribution suggests an approximate haploid genome size of 40 Mbp. Assembly and annotation statistics. We annotated 82–84% of universal eukaryotic single-copy marker genes in each genome (Fig. 3). The majority of the missing markers are consistently absent from all four genomes, suggesting poor representation in reference databases or the complete lack of these genes from the group rather than problems with the quality of the underlying assemblies as the most likely explanation. A maximum-likelihood phylogeny reconstructed from a concatenated alignment of 123 shared single-copy markers suggests that CrRCC970-E3 and CrCflag diverged most recently (Fig. 1b). The exact placement of the other two strains within the group could only be determined with low bootstrap support. These observations are consistent with average nucleotide identity (ANI) and comparisons of the ribosomal DNA operon. All four strains are within 99% ANI, with CrCflag and CrRCC970-E3 being most similar (99.67%). All other pairwise comparisons are within 99.05% to 99.22% ANI (Supplementary Fig. 1)[21]. The 18S rDNA sequences of all four strains are identical. For the full operon, we observe small differences in the two more divergent strains CrBVI and CrE4-10P (CrBVI: 1 substitution in the 28S, 2 in the ITS1; CrE4-10P: 1 substitution in the ITS2). We also found the gene copy number of the operon to vary widely amongst the strains. Based on read coverage we estimate that CrCflag carries 18 haploid copies of the operon, CrRCC970-E3 21, CrBVI 63 and CrE4-10P 83. Although variation on this scale has been observed as a common feature of marine eukaryotic plankton[26], it is surprising to find this much variation among such closely related strains. Finally, we identified 4 single nucleotide polymorphisms (SNPs) in the rDNA operon across all four strains; 3 unique to CrBVI, 1 unique in CrE4-10P. These 4 SNPs match the sites of the 4 divergent substitutions described above. Interestingly, at all four sites, we observe that the minor alternative variant with a relative frequency of 26–27% corresponds to the nucleotide found at this position in the other 3 strains.
Fig. 3

Completeness assessment of Cafeteria genome assemblies based on single-copy orthologs. (a) Abundance of 303 single-copy core gene markers (BUSCOs) in different categories and assemblies. (b) Distribution of BUSCOs missing in at least one assembly (black tiles).

Completeness assessment of Cafeteria genome assemblies based on single-copy orthologs. (a) Abundance of 303 single-copy core gene markers (BUSCOs) in different categories and assemblies. (b) Distribution of BUSCOs missing in at least one assembly (black tiles). To analyze the genomic capabilities of C. roenbergensis, we annotated 7857–8483 protein-coding genes per strain using a combination of de novo, homology- and RNA-seq-supported gene prediction and homology-based functional assignments against UniProtKB/Swiss-Prot and the EggNOG database. The nuclear genomes contained on average 1.67–1.89 introns per gene. In addition, all four assemblies comprise a curated circular mitochondrial genome that was annotated independently using a non-standard genetic code with UGA coding for tryptophan instead of a stop codon[27]. We anticipate that the genomic data presented here will enable detailed studies of C. roenbergensis to shed more light on the biology of this ecologically important group of marine grazers. For instance, no sexual modes of reproduction have been described for C. roenbergensis, which does not preclude their existence. The Cafeteria genome may thus provide insights into possible sexual processes and how its evolution is influenced by mobile genetic elements.

Methods

Strain maintenance, sample preparation, and sequencing

We selected and sequenced four strains of C. roenbergensis isolated from different locations in the Atlantic (Woods Hole, MA, USA; British Virgin Islands) and the Pacific (Yaquina Bay, OR, USA; South Pacific Ocean, 2200 km off the coast of Chile) (Table 1). All flagellate strains were grown in f/2 artificial seawater medium in the presence of either sterilized wheat grains (stock cultures) or 0.05% (w/v) yeast extract (for rapid growth), to stimulate the growth of a mixed bacterial community, which serve as a food source for C. roenbergensis. Each C. roenbergensis culture was subject to three consecutive rounds of single-cell dilution to obtain clonal strains as described previously[10]. For genome sequencing, suspension cultures of 2 L for each strain were grown to approximately 1 × 106 cells/mL, then diluted two-fold with antibiotics-containing medium (30 µg/mL Streptomycin, 60 µg/mL Neomycin, 50 µg/mL Kanamycin, 50 µg/mL Ampicillin, 25 µg/mL Chloramphenicol) and incubated for 24 h at 22 °C and 60 rpm shaking to reduce the bacterial load. Cultures were filtered through a 100 μm Nitex mesh to remove larger aggregates and centrifuged in various steps to further remove bacteria. First, the cultures were centrifuged for 40 min at 6000 × g and 20 °C (F9 rotor, Sorvall Lynx centrifuge), and the cell pellets were resuspended in 50 mL of f/2 artificial seawater medium, transferred to 50 mL polycarbonate tubes and centrifuged for 10 min at 4500 × g, 20 °C in an Eppendorf 5804 R centrifuge. The supernatant was discarded and the cell pellet was resuspended in 50 mL of PBS (phosphate-buffered saline) medium. This washing procedure was repeated 10 times until the supernatant was clear, indicating that most bacteria had been removed. In the end, the flagellates were pelleted and resuspended in 2 mL of PBS medium. Genomic DNA from approximately 1 × 109 cells of strains was isolated using the Blood & Cell Culture DNA Midi Kit (Qiagen, Hilden, Germany). The genomes were sequenced on an Illumina MiSeq platform (Illumina, San Diego, California, USA) using the MiSeq reagent kit version 3 at 2 × 300-bp read length configuration. The E4-10P genome was sequenced by GATC Biotech AG (Constance, Germany) with the standard MiSeq protocol. The E4-10M1, BVI, Cflag and RCC970-E3 genomes were prepared and sequenced at the Max Planck Genome Centre (Cologne, Germany) with NEBNext High-Fidelity 2 × PCR Master Mix chemistry and a reduced number of enrichment PCR cycles (six) to reduce AT-bias. We also sequenced genomic DNA of all strains on a Pacific Biosciences RS II platform (2–3 SMRT cells each, Max Planck Genome Centre, Cologne, Germany).

Assembly, decontamination, and refinement

MiSeq reads were trimmed for low-quality bases and adapter contamination using Trimmomatic[28]. PacBio reads were extracted from the raw data files with DEXTRACTOR[29]. Proovread[30] was used for the hybrid correction of the PacBio reads with the respective trimmed MiSeq read sets. The K-mer analysis was carried out with jellyfish[31] and custom R scripts to plot the distribution and estimate the genome size[21]. To determine the best assembly strategy, we assessed draft assemblies generated with different approaches as described in the Technical Validation section. The improved high-quality drafts presented here were assembled using Canu v1.8[32] from raw PacBio reads only for CrE4-10P, CrBVI, and CrCflag, and from raw and Illumina-corrected PacBio reads for CrRCC970-E3. For the latter strain, raw and corrected versions of the same PacBio reads were used together to mitigate low PacBio coverage in this particular sample and obtain a more contiguous assembly. Following the initial assembly, we used Redundans[33] to remove redundant contigs, which were reconstructed as individual alleles due to high heterozygosity. To reduce misassemblies, we further broke up contigs at unexpected drops in coverage based on reads mapped with minimap2[34] and identified with the custom Perl script bam-junctions[21] and bedtools[35]. After exploring different approaches (see Technical Validation) bacterial contamination was identified and removed based on taxonomic assignments generated with Kaiju[36] and the script tax-resolve[21] using the ETE 3 python library[37]. To obtain these assignments, each contig was split into 500 bp fragments, which we classified against the NCBI non-redundant protein database. Contigs with more than 50% of fragments annotated as bacteria were excluded from the assembly. Finally, to remove base-level errors we polished the assemblies in two rounds by mapping back first PacBio, then Illumina reads with minimap2[34], and by generating consensus sequences from the mappings with Racon[38].

Gene prediction and functional annotation

Repetitive regions were detected with WindowMasker[39]. Only repetitive regions with a minimum length of 100 bp were retained. tRNA genes were predicted with tRNAscan2.0[40] with a minimum score of 70. Gene prediction was performed with the BRAKER pipeline[41,42], which utilizes BLAST[43,44], Augustus[45,46] and GeneMark-ES[47,48]. Augustus and GeneMark-ES gene models were trained with publicly available transcriptomic data of C. roenbergensis E4-10P as extrinsic evidence (http://datacommons.cyverse.org/browse/iplant/home/shared/imicrobe/projects/104/samples/2026)[49]. Prior to gene prediction, splice sites were detected by HISAT[50] and processed with samtools[51,52]. Protein functions were assigned with two different approaches: (1) by blastx best hit against the UniProtKB/Swiss-Prot database v14/01/2019[53] and (2) by eggNOG-mapper[54] best match against the EggNOG v4.5.1 database[55]. Only results with an E-value of 10−3 or lower were retained. In addition, blastx hits with bitscores below 250, percentage identities below 30, or raw scores below 70 were ignored.

Mitochondrial genome curation and annotation

To obtain a complete and correctly annotated mitochondrial genome, we first mapped the whole genome assemblies against an existing C. roenbergensis mitochondrial reference genome (NCBI accession NC_000946.1) to identify the mitochondrial contig using minimap2[34]. We then extracted the contig, trimmed overlapping ends of the circular sequence with seq-circ-trim and reset the start to the same location as the reference genome - the large subunit ribosomal RNA gene - with seq-circ-restart[21]. Gene annotation was carried out with Prokka[56] and with an adjusted non-standard translation code. Predicted tRNA and coding genes completely overlapping other coding regions were manually removed guided by the reference genome annotation.

PCR and reverse-transcription PCR conditions

Genomic DNA (gDNA) was extracted from 200 μl of suspension culture with the QIAamp DNA Mini kit (Qiagen, Hilden, Germany) following the manufacturer’s instructions for DNA purification of total DNA from cultured cells, with a single elution step in 100 μl of double-distilled (dd) H2O and storage at −20 °C. For extraction of total RNA, 500 μl of suspension culture were centrifuged for 5 min at 4,500 g, 4 °C. The supernatants were discarded and the cell pellets were immediately flash-frozen in N2(l) and stored at −80 °C until further use. RNA extraction was performed with the Qiagen RNeasy Mini Kit following the protocol for purification of total RNA from animal cells using spin technology. Cells were disrupted with QIAshredder homogenizer spin columns and an on-column DNase I digest was performed with the Qiagen RNase-Free DNase Set. RNA was eluted in 50 μl of RNase-free molecular biology grade water. The RNA was then treated with 1 μl TURBO DNase (2 U/μl) for 30 min at 37 °C according to the manufacturer’s instructions (Ambion via ThermoFisher Scientific, Germany). RNA samples were analyzed for quantity and integrity with a Qubit 4 Fluorometer (Invitrogen via ThermoFisher Scientific, Germany) using the RNA Broad Range and RNA Integrity and Quality kit respectively. For cDNA synthesis, 6 μl of each RNA sample was reverse transcribed using the Qiagen QuantiTect Reverse Transcription Kit according to the manufacturer’s instructions. This protocol included an additional DNase treatment step and the reverse transcription reaction using a mix of random hexamers and oligo(dT) primers. Control reactions to test for gDNA contamination were done for all samples by adding ddH2O instead of reverse transcriptase to the reaction mix. The cDNA was diluted fivefold with RNase-free H2O and analyzed by PCR with gene-specific primers (Table 4).
Table 4

Primers used for the validation of the intron-exon structure in two genes of C. roenbergensis strain RCC970-E3.

Primer nameForward (5′ to 3′)Reverse (5′ to 3′)
TBP-PCR#1CCGCGATGCTTCTGCCTCCACGCGCAGTCGAGATTCACAGT
TBP-PCR#2GCCATCACCAAGCACGGGATCACGCGCAGTCGAGATTCACAGT
60sRP-PCR#3CGCAACCAGACCAAGTTCCACGGTACGCCAGAGCATGCGGGA
60sRP-PCR#4CGCACTGAGGAGGTGAACGTCGCGGGTTGGTGTTCCGCTTC
Primers used for the validation of the intron-exon structure in two genes of C. roenbergensis strain RCC970-E3. PCR amplifications were performed using 2 ng of gDNA template or 2 μl of the diluted cDNA in a 20 μl reaction mix containing 10 μl Platinum™ II Hot-Start PCR Master Mix (Invitrogen via ThermoFisher Scientific, Germany), 4 μl Platinum GC Enhancer and 0.2 μM of each primer. The following cycling conditions were used in a ProFlex PCR System (Applied Biosystems via ThermoFisher Scientific, Germany): 2 min denaturation at 94 °C and 35 cycles of 15 s denaturation at 94 °C, 20 s annealing at 60 °C (for all primers) and 20 s extension at 68 °C. For product analysis, 1 μl of each reaction were mixed with loading dye and pipetted on a 1% (w/v) agarose gel supplemented with GelRed. The marker lanes contained 0.5 μg of GeneRuler 100 bp DNA Ladder (Fermentas, Thermo-Fisher Scientific, USA). The gel was electrophoresed for 1 h at 100 V and visualized on a ChemiDoc MP Imaging System (BioRad, Germany).

Comparative and phylogenetic analysis

Phylogenetic relationships among the C. roenbergensis strains and Halocafeteria seosinensis (Genbank accession LVLI00000000) were reconstructed from the concatenated alignment of 123 shared single-copy orthologous genes. Orthologs were identified with BUSCO[57,58] with the eukaryotic lineage dataset. Only orthologs present in all five genomes with a BUSCO score of at least 125 and a minimum covering sequence length of 125 bp were taken into account. Orthologous protein sequences were aligned with MAFFT[59] and trimmed for poorly aligned regions with trimAl[60]. The phylogenetic tree was computed with RAxML using the GAMMA model of rate heterogeneity and automatically determined amino acid substitution models for each partition. The bootstrap confidence values were computed with 100 iterations of rapid bootstrapping. The tree was rooted with H. seosinensis as the phylogenetic outgroup using phytools[61] and visualized with ggtree[62]. The rDNA operon was annotated with barrnap (https://github.com/tseemann/barrnap), sequences processed with seqkit[63] and aligned with muscle[64]. Reads were aligned with minimap2[34] for coverage estimation using samtools and SNP calling with bcftools[65]. The haploid copy number of the rDNA operon was estimated by comparing the median coverage across the operon to the median coverage across the 50 longest contigs per assembly. Average nucleotide identities (ANI) were obtained with fastANI[66].

Data Records

The raw Illumina and PacBio sequencing reads are available from the NCBI Sequence Read Archive[20]. Accession numbers, library size, and coverage statistics can be found in Table 2. The curated and annotated assemblies for CrE4-10P, CrBVI, CrCflag and CrRCC970-E3 have been deposited as Whole Genome Shotgun projects at DDBJ/ENA/GenBank under the accessions VLTO00000000, VLTN00000000, VLTM00000000, VLTL00000000. The versions described in this paper are VLTO01000000[22], VLTN01000000[23], VLTM01000000[24], VLTL01000000[25]. These records also each comprise the curated and annotated mitochondrial genome of the respective strain on a single contig denoted with the suffix “mito”. The final assemblies including annotations and the draft assemblies we initially generated to determine the best assembly strategy together with custom code used in the analysis are available from GitHub (http://github.com/thackl/cr-genomes) and Zenodo[21].

Technical Validation

Overall sequencing quality of MiSeq and PacBio read data was assessed with FastQC v0.11.3 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). To choose the best assemblies for further refinement, we evaluated different assemblers and alternative assembly strategies. In particular, we assembled draft genomes using MiSeq reads and corrected PacBio data with SPAdes[67] in diploid mode, and generated assemblies from raw PacBio reads with Flye[68] and wtdbg2[69]. Using QUAST[70], BUSCO[58], and the misassembly detection procedure described in Material and Methods, we assessed contiguity, completeness, and quality of the different assemblies. We found that the Illumina-based assemblies, in general, were less complete and less contiguous than the PacBio-based assemblies[21]. The PacBio assemblies differed primarily in the number of potential misassembly sites, with Canu being least prone to this potential issue. Therefore, we selected assemblies generated with Canu for further processing and analysis. Because the read data was obtained from non-axenic cultures, we screened the assemblies carefully for contaminations with a custom R script. Initially, we considered four different criteria: tetra-nucleotide frequencies, coverage, GC-content, and taxonomic assignments (Supplementary Figs. 2–5). Tetra-nucleotide frequencies and GC-content were computed with seq-comp[21]. Medium contig coverage was determined based on MiSeq reads mapped with minimap2 using bam-coverage[21]. Taxonomic assignments were generated with Kaiju[36] (See Materials and Methods for details). We found that all four criteria generated similar and consistent results. All identified contaminations were classified as bacterial. Viral signatures could all be attributed to endogenous viral elements expected to be present in Cafeteria genomes. From this analysis, we established a simple rule for decontamination of the assemblies: Contigs with more than 50% of annotated regions classified as bacteria were excluded. To further assess the completeness and quality of our assemblies, we used BUSCO[57,58] to detect universal eukaryotic orthologous genes (Fig. 1). We found that all our PacBio-based assemblies contain 82–84% of the expected 303 orthologs. Moreover, most missing orthologs are absent from all four assemblies suggesting poor representation in the database or the complete lack of some of these markers from this group as a systemic issue, rather than assembly problems, which would affect different genes in different assemblies. To validate the automated gene predictions, we spot-checked the intron-exon structure for two genes using regular and reverse-transcription PCR (Supplementary Fig. 6). We selected two intron-containing genes, namely genes coding for a TATA-binding protein (locus tag: FNF27_01237) and a 60S ribosomal protein (locus tag: FNF28_01226). Primers were designed to amplify the intronic regions and PCR with these primers resulted in long amplicons when using gDNA as the template and a shorter amplicon when using cDNA as the template. The cDNA was obtained after reverse transcription of total RNA from C. roenbergensis strain RCC970-E3.
Measurement(s)DNA • mitochondrial_DNA • sequence_assembly • sequence feature annotation
Technology Type(s)DNA sequencing • genome assembly • sequence annotation
Sample Characteristic - OrganismCafeteria roenbergensis
Sample Characteristic - Environmentmarine water body
Sample Characteristic - LocationNorthwest Atlantic Ocean • Carribean Sea • Southeast Pacific Ocean • North East Pacific Ocean
  48 in total

1.  A survey of flagellate diversity at four deep-sea hydrothermal vents in the Eastern Pacific Ocean using structural and molecular approaches.

Authors:  M S Atkins; A P Teske; O R Anderson
Journal:  J Eukaryot Microbiol       Date:  2000 Jul-Aug       Impact factor: 3.346

2.  Confusing selective feeding with differential digestion in bacterivorous nanoflagellates.

Authors:  J Boenigk; A C Matz; K Jurgens; H Arndt
Journal:  J Eukaryot Microbiol       Date:  2001 Jul-Aug       Impact factor: 3.346

3.  A virophage at the origin of large DNA transposons.

Authors:  Matthias G Fischer; Curtis A Suttle
Journal:  Science       Date:  2011-03-03       Impact factor: 47.728

4.  Ocean plankton. Eukaryotic plankton diversity in the sunlit ocean.

Authors:  Colomban de Vargas; Stéphane Audic; Nicolas Henry; Johan Decelle; Frédéric Mahé; Ramiro Logares; Enrique Lara; Cédric Berney; Noan Le Bescot; Ian Probert; Margaux Carmichael; Julie Poulain; Sarah Romac; Sébastien Colin; Jean-Marc Aury; Lucie Bittner; Samuel Chaffron; Micah Dunthorn; Stefan Engelen; Olga Flegontova; Lionel Guidi; Aleš Horák; Olivier Jaillon; Gipsi Lima-Mendez; Julius Lukeš; Shruti Malviya; Raphael Morard; Matthieu Mulot; Eleonora Scalco; Raffaele Siano; Flora Vincent; Adriana Zingone; Céline Dimier; Marc Picheral; Sarah Searson; Stefanie Kandels-Lewis; Silvia G Acinas; Peer Bork; Chris Bowler; Gabriel Gorsky; Nigel Grimsley; Pascal Hingamp; Daniele Iudicone; Fabrice Not; Hiroyuki Ogata; Stephane Pesant; Jeroen Raes; Michael E Sieracki; Sabrina Speich; Lars Stemmann; Shinichi Sunagawa; Jean Weissenbach; Patrick Wincker; Eric Karsenti
Journal:  Science       Date:  2015-05-22       Impact factor: 47.728

5.  Host genome integration and giant virus-induced reactivation of the virophage mavirus.

Authors:  Matthias G Fischer; Thomas Hackl
Journal:  Nature       Date:  2016-12-07       Impact factor: 49.962

6.  Genome sequence of the stramenopile Blastocystis, a human anaerobic parasite.

Authors:  France Denoeud; Michaël Roussel; Benjamin Noel; Ivan Wawrzyniak; Corinne Da Silva; Marie Diogon; Eric Viscogliosi; Céline Brochier-Armanet; Arnaud Couloux; Julie Poulain; Béatrice Segurens; Véronique Anthouard; Catherine Texier; Nicolas Blot; Philippe Poirier; Geok Choo Ng; Kevin S W Tan; François Artiguenave; Olivier Jaillon; Jean-Marc Aury; Frédéric Delbac; Patrick Wincker; Christian P Vivarès; Hicham El Alaoui
Journal:  Genome Biol       Date:  2011-03-25       Impact factor: 13.583

7.  Osmoadaptative Strategy and Its Molecular Signature in Obligately Halophilic Heterotrophic Protists.

Authors:  Tommy Harding; Matthew W Brown; Alastair G B Simpson; Andrew J Roger
Journal:  Genome Biol Evol       Date:  2016-08-03       Impact factor: 3.416

8.  Single-cell genomics of multiple uncultured stramenopiles reveals underestimated functional diversity across oceans.

Authors:  Yoann Seeleuthner; Samuel Mondy; Vincent Lombard; Quentin Carradec; Eric Pelletier; Marc Wessner; Jade Leconte; Jean-François Mangot; Julie Poulain; Karine Labadie; Ramiro Logares; Shinichi Sunagawa; Véronique de Berardinis; Marcel Salanoubat; Céline Dimier; Stefanie Kandels-Lewis; Marc Picheral; Sarah Searson; Stephane Pesant; Nicole Poulton; Ramunas Stepanauskas; Peer Bork; Chris Bowler; Pascal Hingamp; Matthew B Sullivan; Daniele Iudicone; Ramon Massana; Jean-Marc Aury; Bernard Henrissat; Eric Karsenti; Olivier Jaillon; Mike Sieracki; Colomban de Vargas; Patrick Wincker
Journal:  Nat Commun       Date:  2018-01-22       Impact factor: 14.919

9.  A Phylogenomic Framework to Study the Diversity and Evolution of Stramenopiles (=Heterokonts).

Authors:  Romain Derelle; Purificación López-García; Hélène Timpano; David Moreira
Journal:  Mol Biol Evol       Date:  2016-08-10       Impact factor: 16.240

10.  The others: our biased perspective of eukaryotic genomes.

Authors:  Javier del Campo; Michael E Sieracki; Robert Molestina; Patrick Keeling; Ramon Massana; Iñaki Ruiz-Trillo
Journal:  Trends Ecol Evol       Date:  2014-04-11       Impact factor: 17.712

View more
  6 in total

Review 1.  Emerging Diversity of Channelrhodopsins and Their Structure-Function Relationships.

Authors:  Elena G Govorunova; Oleg A Sineshchekov; John L Spudich
Journal:  Front Cell Neurosci       Date:  2022-01-24       Impact factor: 5.505

2.  Draft Genome Sequence of the Freshwater Diatom Fragilaria crotonensis SAG 28.96.

Authors:  Brittany N Zepernick; Alexander R Truchon; Eric R Gann; Steven W Wilhelm
Journal:  Microbiol Resour Announc       Date:  2022-08-17

3.  Crystal structures of FNIP/FGxxFN motif-containing leucine-rich repeat proteins.

Authors:  Trevor Huyton; Mamta Jaiswal; Waltraud Taxer; Matthias Fischer; Dirk Görlich
Journal:  Sci Rep       Date:  2022-09-30       Impact factor: 4.996

4.  Estimating the quality of eukaryotic genomes recovered from metagenomic analysis with EukCC.

Authors:  Paul Saary; Alex L Mitchell; Robert D Finn
Journal:  Genome Biol       Date:  2020-09-10       Impact factor: 13.583

5.  Gene expression during bacterivorous growth of a widespread marine heterotrophic flagellate.

Authors:  Ramon Massana; Aurelie Labarre; David López-Escardó; Aleix Obiol; François Bucchini; Thomas Hackl; Matthias G Fischer; Klaas Vandepoele; Denis V Tikhonenkov; Filip Husnik; Patrick J Keeling
Journal:  ISME J       Date:  2020-09-12       Impact factor: 10.302

6.  Niche adaptation promoted the evolutionary diversification of tiny ocean predators.

Authors:  Francisco Latorre; Ina M Deutschmann; Aurélie Labarre; Aleix Obiol; Anders K Krabberød; Eric Pelletier; Michael E Sieracki; Corinne Cruaud; Olivier Jaillon; Ramon Massana; Ramiro Logares
Journal:  Proc Natl Acad Sci U S A       Date:  2021-06-22       Impact factor: 11.205

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.