Literature DB >> 31086480

Scanning RNA-Seq and RAD-Seq approach to develop SNP markers closely linked to MALE STERILITY 1 (MS1) in Cryptomeria japonica D. Don.

Saneyoshi Ueno¹, Kentaro Uchiyama¹, Yoshinari Moriguchi², Tokuko Ujino-Ihara¹, Asako Matsumoto¹, Fu-Jin Wei¹, Maki Saito³, Yumi Higuchi⁴, Norihiro Futamura¹, Hiroyuki Kanamori⁵, Yuichi Katayose⁵, Yoshihiko Tsumura^1,6.

Abstract

Cryptomeria japonica is a major forestry tree species in Japan. Male sterility of the species is caused by a recessive gene, which shows dysfunction of pollen development and results in no dispersed pollen. Because the pollen of C. japonica induces pollinosis, breeding of pollen-free C. japonica is desired. In this study, single nucleotide polymorphism (SNP) markers located at 1.78 and 0.58 cM to a male sterility locus (MS1) were identified from an analysis of RNA-Seq and RAD-Seq, respectively. SNPs closely linked to MS1 were first scanned by a method similar to MutMap, where a type of index was calculated to measure the strength of the linkage between a marker sequence and MS1. Linkage analysis of selected SNP markers confirmed a higher efficiency of the current method to construct a partial map around MS1. Allele-specific PCR primer pair for the most closely linked SNP with MS1 was developed as a codominant marker, and visualization of the PCR products on an agarose gel enabled rapid screening of male sterile C. japonica. The allele-specific primers developed in this study would be useful for establishing the selection of male sterile C. japonica.

Entities: Chemical Disease Species

Keywords: conifer; expressed sequences; gymnosperm; marker-assisted selection; non-model organisms; pinales; restriction-site-associated DNA

Year: 2019 PMID： 31086480 PMCID： PMC6507728 DOI： 10.1270/jsbbs.17149

Source DB: PubMed Journal: Breed Sci ISSN： 1344-7610 Impact factor: 2.086

Introduction

Male sterility in sugi (Cryptomeria japonica) was first described by Taira et al. (Taira ) in Toyama prefecture, Japan, and increased efforts to identify sterile trees led to the discovery of 23 male sterile sugi individuals from all over Japan (Saito 2010). These sterile trees have one of four sterile alleles of MS1, MS2, MS3, and MS4, which show Mendelian inheritance (Miyajima , Taira , Yoshii and Taira 2007). Because no mature pollen is produced and dispersed from sterile trees, they help decrease pollen allergy, which causes heavy social and economic losses every year in Japan. The breeding of male sterile trees would be accelerated if marker assisted selection (MAS) was possible. To assure the precision of selection, the distance between molecular markers and a candidate gene has to be close enough. Long-lasting efforts to develop molecular markers in sugi have resulted in a high-density linkage map with 2,560 markers in 11 linkage groups and 1,266.2 cM in length (Moriguchi ). The majority of these markers are based on randomly selected SNPs from expressed sequence tags (ESTs) or transcriptome contigs (Ueno ) by RNA Sequencing (RNA-Seq). However, randomly designing molecular markers and searching for those linked with target genes are inefficient ways to identify male sterile individuals with MAS (Moriguchi ). To overcome such drawback, MMAPPR (Hill ) and a series of MutMap methods (Abe , Fekih , Takagi ) have been developed using the association between a target phenotype and candidate SNPs over the genome before developing genetic markers. In MutMap, SNP index is calculated as the “ratio between the number of reads of a mutant SNP and the total number of reads corresponding to the SNP” (Abe ) in a bulked DNA sequencing of mutant and wild-type phenotypes. When we consider a simple Mendelian recessive trait in a backcross family, the SNP indices from mutant and wild-type individuals are expected to be approximately 1.0 and 0.5, respectively. The original MutMap method utilizes whole genome sequencing reads from bulked samples and needs a reference genome sequence of a target species. Although an increasing number of genome sequences have been published in recent years, construction of a reference genome sequence for sugi, a species with large genome (11 Gb) (Hizume ), and repetitive elements, has been still difficult. There are several methods of analysis to reduce the complexity of a genome. One of these includes the use of ESTs (Adams ) or a transcriptome (Velculescu ), whereby only expressed parts of a genome are sampled for downstream analysis. Another method to reduce the genome complexity is a construction of a reduced representation library from genomic DNA, such as AFLP (Vos ) and RAD-Seq (Baird ) techniques. In these systems, restriction enzymes are used to cut specific recognition sites with four to eight bases, resulting in a collection from a small portion of a genome. These methods of complexity reduction by sampling specific parts of a genome may help genetic analysis for species with a large genome and without a reference genome sequence. In this study, we applied a method similar to MutMap to develop SNP markers that are closely linked to a male sterility gene using transcriptomic reference as well as genomic pseudo-reference constructed from RAD-Seq for sugi. Then, we screened and identified markers closely linked to one of the male sterile genes (MS1) in the genome, without any reference genome sequences available to date.

Materials and Methods

Plant materials and phenotyping of male sterility

The mapping pedigree (T5 family) consisted of 173 individuals and was derived from a backcross between ‘Toyama-1’ and ‘T1NK4F1’ trees (‘T1NK4F1’ is male fertile, Ms1/ms1) (Moriguchi et al. 2014). ‘T1NK4F1’ was an F1 hybrid between ‘Toyama-1’ (male sterile, ms1/ms1) and ‘Nakakubiki-4’ (male fertile, Ms1/Ms1). Male flowers were induced by gibberellic acid according to the manufacturer (KYOWA HAKKO BIO CO., LTD, Tokyo, Japan) instructions. Pollen production in each individual was visually assessed using a stereomicroscope in the following spring. Induction of male flowers and phenotyping of pollen production were conducted at Niigata Prefectural Forest Research Institute (Murakami, Niigata, Japan).

RNA-Seq from male flower buds

The T5 family at the age of three years was planted in a nursery at the Forestry and Forest Products Research Institute, Tsukuba, Japan, in April 2011. Male flowers were induced by applying gibberellic acid according to the manufacturer’s (KYOWA HAKKO BIO CO., LTD) instructions in July. On the 6th, 11th, 14th, 19th, and 24th of October 2011, male flower buds from sterile and fertile individuals were sampled. Due to the limited availability of male flower buds from each individual, we sampled two to three of them from each individual and mixed together with buds from other individuals. We mixed buds up to 50 individuals in each phenotype category (sterile and fertile). Male flower bud samples were stored at −80°C until RNA extraction. They were then powdered in liquid nitrogen with mortar and pestle, and total RNA was extracted by the method of Le Provost et al. (le Provost ). About 10 spoonfuls of the powder were added to 1 mL of CTAB extraction buffer. The extracted RNA was further purified by SV total RNA purification system (Promega, Madison WI). For sequencing, an equal amount of RNA from each sampling date was mixed for each category (sterile and fertile), and the sequencing library was constructed by Hokkaido System Science Co., Ltd. (Sapporo, Japan) and then sequenced on a MiSeq (Illumina, San Diego, CA) in 2 × 250 bp paired ends at the National Agriculture and Food Research Organization, Tsukuba, Japan. Average depth was insufficient in a preliminary read mapping to full length cDNA (FLcDNA) sequences (Futamura ) (Supplemental Text 1). Only 9,478 (42.0%) FLcDNA sequences had more than 20× average depth. Therefore, additional sequence was collected for each category (sterile and fertile) with HiSeq2000 (Illumina) in 2 × 101 bp paired-end sequencing by Hokkaido System Science Co., Ltd.

RAD-Seq

DNA was extracted from each individual of the T5 family according to a modified CTAB method (Doyle and Doyle 1987). A total 20 individuals (8 sterile and 12 fertile trees) and the parents (‘Toyama-1’ and ‘T1NK4F1’) were used in the double digest RAD-Seq (ddRAD) experiment (Peterson ). Briefly, genomic DNA was double-digested by restriction enzymes, SphI and MluCI. The digested fragments were size-selected using BluePippin agarose gel (Sage Science, Beverly, MA) and ligated with adapters (Fluidigm, South San Francisco, CA). After PCR amplification with adapter-specific primer pairs, an equal amount of DNA from each sample was mixed to construct a RAD library. The quality of the library was checked by sequencing on MiSeq in 2 × 300 bp paired ends and finally sequenced in 2 × 101 bp paired ends on HiSeq2000 by Hokkaido System Science Co., Ltd.

Selection of candidate markers

RNA-Seq reads were cleaned with CLC Genomics Workbench (GW) (Qiagen, Hilden, Germany) 7.0.3 or Trim_Galore program (http://www.bioinformatics.babraham.ac.uk/projects/trim_galore/). The cleaned reads were mapped against 22,539 (36.0 Mb) full-length cDNA (FLcDNA) sequences (Futamura ) (Supplemental Text 1) using GW 8.5.1 with more stringent parameters than those of default (length fraction = 0.85 and similarity fraction = 0.9). Read mappings were carried out for each sample and merged into a BAM alignment file for sterile and fertile samples. Candidate SNP sites were detected from the BAM files using VarScan (ver. 2.4.3) (Koboldt ) with default parameter settings (min coverage = 8, min reads 2 = 2, min varfreq = 0.2, min avgqual = 15, and P value threshold = 0.01) to generate a VCF (variant call format) file. Read depths (number of reads) for both alleles (X and Y) at any SNP sites were extracted from the VCF file. SNPs from RNA-Seq were also searched from contigs assembled de novo by Trinity (Grabherr ) using the cleaned RNA-Seq reads. The contig sequences were used as references, onto which the reads were mapped, and the SNPs were detected as described above using the same parameter settings. For RAD-Seq, reads from each individual were combined according to male flower phenotypes (sterile and fertile), which were then processed by dDocent (version 2.17) pipeline (Puritz ). In the reference construction in dDocent, we set “CUTOFF” of four reads within a sample, and “CUTOFF2” of two which indicated that at least two samples (sterile and fertile samples in the present case) had that contig sequence. After mapping the reads onto the references, the resulting BAM files were then used to detect SNPs, and the read depth at each SNP site was extracted in the same manner as done for the abovementioned RNA-Seq data. To select candidate markers linked to MS1, we applied a procedure similar to MutMap. The read depth for each allele at each SNP site was assumed to be proportional to the number of alleles in the genotype. Heterozygous (XY) and homozygous (XX) genotypes were expected to produce reads from X allele in ratios of 50% and 100%, respectively, to total reads mapped at each focal SNP site. The difference (D) in the X allele percentage between heterozygous (fertile) and homozygous (sterile) samples was calculated for each SNP site and averaged over all SNP sites on the same reference sequence (full length cDNA (FLcDNA), RNA-Seq, or RAD-Seq reference). The average value is referred to as Dav hereafter. Only the Dav was calculated for the reference with at least two SNP sites and with a maximum distance among all SNPs of ≥50 bp (Supplemental Fig. 1). The threshold distance among SNPs was determined arbitrarily as 50 bp in the current study. Because C. japonica is allogamous, we expected four possible cases for linkage phase of a two-alleles SNP site and MS1 and the resulting values of Dav in our backcross family (Fig. 1). Theoretically, a maximum of three alleles in a SNP site is possible in backcross family, but these cases are expected to be rare, and we focused on cases for two-alleles SNP sites here. The frequency for three-alleles SNP sites was estimated as 0.66% with four screening panel individuals in our previous re-sequencing study (Uchiyama ). In the current procedure described above, we focused on cases where Dav ≈ 0.5 (Fig. 1B, 1D). Simple simulation, where a Mendelian marker is in complete linkage equilibrium with MS1, indicated that the average difference (D̄) was less than 0.1, when the number of backcross samples exceeded 20 (ms1/ms1:Ms1/ms1 = 10:10) (Supplemental Fig. 2, Supplemental Text 2). In the current study, about 50 male sterile and 50 male fertile samples were included in the RNA-Seq experiment, whereas the RAD-Seq had nine sterile and 13 fertile samples including their parents.

Fig. 1

Schematic representation of calculation of D at a SNP site with two alleles and linkage phase between male sterility locus (MS1) in a backcross family. “X” and “Y” represent SNP alleles. No recombination is assumed between the SNP site and MS1 in this example. Dav is an averaged value of D over SNP sites in a reference sequence (contig). D is defined as the difference in X allele frequency between male sterile and wild-type pedigree members. Four possible cases (A–D) in terms of allelic linkage between an SNP site and MS1 in backcross scheme are presented. In case C, the SNP alleles in the parents are not separated in the mapping family and D cannot be calculated.

SNP assay and linkage mapping

The SNPs on some of the reference sequences with higher Dav values were assessed by the EP1 system (Fluidigm) (Wang ). We selected SNPs with higher Dav values sequentially, but if there were any adjacent SNPs within 30 bp of target SNP, it was not suitable for SNPType Assay Design (Fluidigm) and discarded. In total, 48, 33, and 48 SNP assays from FLcDNA, RNA-Seq, and RAD-Seq reference sequences were tested. We used T5 family (173 individuals) for linkage mapping. DNA was extracted for these individuals from the needle tissue using a modified CTAB method (Doyle and Doyle 1987) and used for SNPType assay by the Fluidigm EP1 system according to the manufacturer’s instructions. Fluidigm SNP Genotyping Analysis (version 4.1.2) was used for genotype calling under the default setting. For linkage mapping, SNP genotypes were combined with 27 marker genotypes we had obtained in our previous study (Moriguchi ). After removing loci with a large number of missing data points (more than 10 individuals) or extreme segregation distortion (P < 0.001), a partial linkage map for the ninth linkage group (LG9) where the male sterility gene (MS1) exists, was constructed by JoinMap v4.0 (Kyazma, Wageningen, The Netherlands) with the parameter BC (backcross) and two rounds of map calculation according to Moriguchi . Map distance was calculated by the Kosambi mapping function, and the linkage map for LG9 was drawn by MapChart ver. 2.0 (Voorrips 2002).

Allele-specific PCR and agarose gel electrophoresis

After identification of the most closely linked markers to MS1 on the partial linkage map, we constructed allele-specific PCR (ASP) system or Amplification Refractory Mutation System (ARMS) (Newton ) for an easy identification of male sterile trees by agarose gel electrophoresis. The ASP primers were designed using Primer3Plus (Untergasser ) based on the method described by Wangkumhang et al. (2007). Template sequence for primer design contained two SNPs, and we designed ASP primers for both of these (Fig. 2). Because two SNPs were located on 3′ and 5′ ends of the sequence, it was not possible to design ARMS primers (Newton ), which target both alleles in a SNP with oppositely directed allele-specific primers. The reaction mixture, totaling 10 μL, contained 3 μL of 2 × Multiplex (Qiagen), 0.2 μM of each forward and reverse primers, and 1 μL (5–10 ng) of DNA, which was then amplified in a GeneAmp PCR System 9700 (Applied Biosystems, Foster City, CA) with the following thermal conditions: initial denaturation at 95°C for 15 min, 4 cycles of 95°C for 30 s, 64°C for 90 s with −1°C per cycle and 72°C for 30 s, and 34 cycles of 95°C for 30 s, 60°C for 90 s, and 72°C for 30 s. The PCR products were visualized by 2% agarose gel electrophoresis and ethidium bromide staining.

Fig. 2

Allele-specific PCR primers to discriminate male sterile and fertile individuals with agarose gel electrophoresis. There are two SNP sites on dD_Conig_3995 measuring 375 bp in length. Each SNP site is used to target a specific SNP allele coupled with either Ms1 (wild-type) or ms1 (male sterile).

Registration of sequence data

All the sequence data used in the present study were registered in DDBJ with the accession numbers mentioned in Supplemental Text 1.

Results

Scanning RNA-Seq

In total, 49.6 Gb of transcriptome reads were collected (5.0 Gb from MiSeq and 44.6 Gb from HiSeq). As a preliminary study, MiSeq collected 2.7 Gb from male sterile and 2.3 Gb from male fertile flower bud samples. Due to adapter or low-quality sequences, 40.1% of bases were removed. Initial mapping of these reads onto FLcDNA reference sequences showed that 67.0% of the reads were successfully mapped, with 32.3× average depth. Only 9,478 (42.0%) FLcDNA sequences had more than 20× average depth. We therefore collected additional reads using HiSeq2000, which produced 22.9 and 23.2 Gb sequencing reads from fertile and sterile samples, respectively. After the removal of 1.41 Gb (3.1%) of reads by Trim_Galore program, the remaining 22.1 and 22.5 Gb reads from sterile and fertile samples, respectively, were mapped against FLcDNA reference, resulting in 30.8 Gb (68.2%) of mapped bases with 507.6× average depth. In addition, 21,457 (95.2%) FLcDNA sequences reached more than 20× average depth. We identified 223,726 candidate variant (223,418 SNPs and 13,369 indels) sites in total, which indicated 9.9 variations per reference, and the ratio of transition- and transversion-type SNPs (Ts/Tv) was 1.49. A total 14,072 (62.4%) references had two or more SNPs, and the maximum distance among the SNPs was ≥50 bp, for which Dav was evaluated, ranging from 0.0 to 0.95 with an average of 0.112. De novo assembly of the cleaned HiSeq reads produced 232,103 contigs with a total length of 177.5 Mbp, onto which cleaned reads from sterile and fertile samples (22.1 and 22.5 Gb, respectively) were separately mapped, resulting in 42.9 Gb (96.1%) of the mapped bases with a 126.7× average depth. In total, 46,303 (19.9%) RNA-Seq contigs had more than 20× average depth. We identified 517,225 candidate variant (517,140 SNPs and 85 indels) sites, indicating 2.2 variations per reference (i.e. 517,225 candidate variants per 232,103 contigs), and the ratio of transition- and transversion-type SNPs (Ts/Tv) was 1.59. A total of 40,573 (17.5%) references had two or more SNPs, and the maximum distance among the SNPs was ≥50 bp, for which Dav was evaluated, ranging from 0.0 to 0.95 with an average of 0.121.

Scanning RAD-Seq

We collected 2.3 and 2.8 Gb of sequences from sterile and fertile trees, respectively. The dDocent pipeline removed 712 Mb (13.8%) of adapter or low-quality bases, producing 16,455 reference contigs (3.3 Mb in length). Most of reference sequences were 212 bp in length with 10 “N”s in the middle, where forward and reverse reads were joined, but 992 of the references were merged contigs from forward and reverse reads. The pipeline then mapped 979.9 Mb (22.0%) of reads onto the reference. In total, 98,887 candidate SNP sites were detected from 9,183 (55.8%) contigs, resulting in an average number of 10.8 SNPs per reference, and the ratio of transition- and transversion-type SNPs (Ts/Tv) was 2.56. A total of 6,638 references had two or more SNPs, and the maximum distance among the SNPs was ≥50 bp, for which Dav was evaluated, ranging from 0.0 to 0.70 with an average of 0.113. The depth for these SNP sites ranged from 5 to 4,013 with 85.9× average depth. Three sets of SNPType assays (129 SNPs including 48, 33 and 48 SNPs from FLcDNA, RNA-Seq, and RAD-Seq references, respectively) were constructed based on Dav, whose ranges were 0.013–0.54, 0.37–0.50, and 0.30–0.56, respectively for FLcDNA, RNA-Seq and RAD-Seq references (data not shown). The assay was successful [number of missing genotypes ≤10, and no severe segregation distortion (P ≥ 0.001)] for 29 (60.4%) 24 (72.7%), and 22 (45.8%) SNPs originating from FLcDNA, RNA-Seq, and RAD-Seq, respectively. Linkage analysis using JoinMap included 26 backcross-type markers (eight, 13, and five from FLcDNA, RNA-Seq and RAD-Seq, respectively) and indicated that 16 markers were mapped onto LG9, where MS1 exists (Table 1, Fig. 3). The closest markers were CSFL022_N13-194, c65725_g1_i1-123, and dDContig_3995-165 from FLcDNA, RNA-Seq, and RAD-Seq, showing 2.95, 1.78, and 0.58 cM to MS1, respectively. The total LG9 map distance increased by 6.8 cM with the 16 markers added to the previous map (Moriguchi ) (Supplemental Fig. 3).

Table 1

SNP markers on linkage group 9 (LG9) around male sterility locus (MS1) in C. japonica

Marker origin	Marker ID	Dav	Map position on LG9 (cM)	Distance to MS1 (cM)
FLcDNA	CLFL025_C07-615	0.346	45.44	23.52
	CFFL008_J05-175	0.231	44.28	22.35
	CFFL025_E16-1334	0.422	5.90	16.03
	CLFL044_G04-1079	0.247	5.90	16.03
	CMFL035_K17-90	0.334	13.71	8.22
	CLFL018_I12-1304	0.515	17.24	4.69
	CSFL023_N16-2950	0.412	17.24	4.69
	CSFL022_N13-194	0.454	18.99	2.94

RNA-Seq	c76407_g8_i1-82	0.371	10.12	11.81
	c76501_g12_i1-230	0.445	10.12	11.81
	c60295_g2_i1-1291	0.480	17.24	4.69
	c66322_g1_i7-1924	0.483	18.99	2.94
	c65725_g1_i1-123	0.432	20.16	1.77

RAD-Seq	dDContig_13452-169	0.329	70.36	48.44
	dDContig_6119-92	0.422	28.01	6.09
	dDContig_3995-165	0.467	22.51	0.58

Fertility	MS1	NA	21.93	0.00

Dav represents levels of linkage between marker and MS1. The Dav value of 0.5 is a theoretical prerequisite for complete linkage with MS1.

NA: Not available

Fig. 3

A partial linkage map for linkage group 9 (LG9) around the male sterility locus MS1. Marker names and centimorgan distances (Kosambi) are indicated to the right and left of the linkage group, respectively.

Allele-specific PCR system using agarose gels

Two pairs of allele-specific PCR primers (Table 2, Fig. 2) were designed targeting dD_Contig_3995, which was the most closely linked to MS1 based on the abovementioned mapping result. First, we used a single allele-specific primer pair in a single PCR reaction and confirmed allele-specific amplification from Ms1/Ms1, Ms1/ms1, and ms1/ms1 individuals (data not shown). Two pairs of allele-specific PCR primers were then combined, and multiplex PCR was performed, which enabled the detection of both male sterile- and wild-type alleles in a single PCR reaction (Fig. 4).

Table 2

Allele-specific PCR primers for dD_Contig_3995 discriminating male sterile and fertile trees on agarose gels (Figs. 2, 4)

Primer ID	Primer sequence#	Expected PCR product size (bp)	Coupling allele
dD3995_47wild	CTGCCTGCCAAAAATAGCACGG	121	Ms1
dD3995_47LSP137	GGCGGCTCATAATGTGGTAAAAA

dD3995_165ms	TCAGGATTCCTTCTCGGTTTTCCG	168	ms1
dD3995_165LSP	GATTTGCATGTCATACCATCACG

Underlined bases indicate artificial mismatch introduced to increase specificity.

Fig. 4

Agarose gel electrophoresis to discriminate male sterile and fertile C. japonica in T5 family. Sample DNAs were amplified using primers shown in Table 2. PCR products were separated on 2% agarose gel and visualized by ethidium bromide staining. M: DNA Ladder One (nacalai tesque, Kyoto, Japan) (numbers on the left are molecular sizes in base pairs), 1: ‘Toyama-1’ with ms1/ms1, 2: ‘T5NK4F1’ with Ms1/ms1, 3: ‘Nakakubiki-4’ with Ms1/Ms1.

Discussion

Detection and validation of candidate SNPs

SNP detection is a fundamental step to develop genetic markers. We detected 223,726, 517,140, and 98,887 candidate SNPs from the FLcDNA, RNA-Seq and RAD-Seq references, respectively. A total of 129 were selected for validation to perform a Fluidigm SNPType assay, which resulted in 75 (58.1%) successful markers. When the genotyping scatter plots for unsuccessful markers were inspected, some of the markers appeared to display a crushed or monomorphic pattern (Supplemental Fig. 4). Markers with a monomorphic pattern (seven, one and one candidate SNP from FLcDNA, RNA-Seq, and the RAD-Seq reference, respectively) indicates that we detected false SNP sites, which may have occurred due to a sequencing or data analysis error, null alleles (amplification failure in PCR due to SNPs at the primer binding site or third SNP allele), duplicated genomic sequences (paralogues), or RNA editing (Piskol ). The higher monomorphic pattern rate in the candidate SNPs from the FLcDNA reference suggests that some of the reads were forced to be mapped onto paralogous sequences. In the coding region of C. japonica, one exon SNP for every 106.7 bp was reported by re-sequencing four genotypes with the Sanger method (Uchiyama ). The candidate SNP frequency from 35.96 Mbp of the FLcDNA reference in the current study based on 223,418 SNPs was roughly estimated as one candidate for every 161 bp. Although we detected 517,140 candidate SNPs from 177.5 Mb of RNA-Seq reference, most (80.1%) of them had less than 20× average depth, for which the SNP frequency calculation may be inappropriate. The ratio of transition- and transversion-type SNPs (Ts/Tv) can be considered an SNP quality metric because if all of the mutations occur at the same probability, we expect a Ts/Tv ratio ≈0.5, while the transversion type mutation rate is lower in general and Ts/Tv ratios of 2.15 and 3.27 has been reported in the final call set for human genome-wide and exome known SNPs, respectively (DePristo ). Similar Ts/Tv ratio values (1.21–1.51) as the current study (1.49 and 1.59) have been reported in RNA-Seq data from coniferous species (Chen , Plomion , Ueno ). To gain further insights into the validity of the candidate SNPs, the RAD-Seq reads were mapped onto the FLcDNA references (hereinafter referred to as “RAD-FL mapping”). SNPs were detected by VarScan (ver. 2.4.3) (Koboldt ) with default parameter settings, and the resulting VCF file was compared with that created from RNA-Seq reads’ mapping onto the same FLcDNA reference (hereinafter referred to as “RNA-FL mapping”). The comparison was made using “bcftools stats” and plot-vcfstats in bcftools ver. 1.7-1 (Li and Durbin 2009). In total, 2,014 candidate SNPs were common between RAD-FL and RNA-FL mapping. These candidates were distributed among 456 (2.0%) FLcDNA reference sequences. The Ts/Tv ratio for these common SNPs was 2.03. The mapping of RAD-Seq reads onto a genic reference helps validate the SNPs detected by RNA-Seq.

Scanning linked markers from pooled data

Male sterility in C. japonica has been investigated with an aim of identifying the responsible genes of the phenotype and for a better understanding of the molecular mechanisms of male sterility (Futamura , Moriguchi , 2016, Tsubomura ). Analysis of EST (Expressed Sequence Tag) or RNA-Seq with high-throughput sequencers are both a method to capture the whole atlas of expressed genes (transcriptome) in a tissue (Weber 2015). Differentially expressed genes between male fertile and sterile strobili have been identified (Futamura , Tsubomura ). Although the male sterility in C. japonica is caused by a single gene, none of the differentially expressed genes has directly dictated the phenotype and linkage analysis indicates that they are different from MS1 (Ihara et al. unpublished results). In the current study, we focused on the SNPs detected from high-throughput sequencing in the mapping family of MS1 and identified closely linked markers (Fig. 3). Due to the backcross family we used, fertile and sterile genotypes are heterozygotes and homozygotes, respectively. If we had used other families such as selfed mapping family, homozygous fertile and sterile individuals would have been separated. Homozygous genotypes are easier to genotype, and larger values of expected score (Dav ≈1) would be preferable to differentiate both phenotypes. Unfortunately, in most of conifer tree species, which are allogamous, selfing leads to serious inbreeding depression, and construction of such mapping families is difficult. Heterozygote genotype calling is therefore required in allogamous species, which needs a higher (15×, for example) average depth (Song ). Because of the expensive sequencing costs to perform individual genotyping, especially for RNA-Seq, pooling strategy helped to reduce the cost, and selecting linked SNPs with appropriate Dav increased the efficiency of developing the markers. Of 21 analyzed SNPs from FLcDNA and RNA-Seq in the current study, 13 (62%) SNPs were mapped onto LG9 (Table 1). If we had randomly selected SNPs, a simple calculation would suggest 9.1% of markers on the chromosome, because C. japonica has 11 chromosomes. Considering the size of LG9 (96.9 cM or 7.7% of total map distance) (Moriguchi ), the number of markers expected on LG9 may be <9.1%. Scanning RNA-Seq data increased the efficiency of partial linkage map construction by more than six times, compared with the method of developing markers randomly from the entire genome. Similarly, RAD-Seq data resulted in three (60%) linked markers of five analyzed SNPs. In the current study, we designed SNP assays with Dav ranges of 0.013–0.54, 0.37–0.50, and 0.30–0.56 for FLcDNA, RNA-Seq and RAD-Seq references, respectively, to design 129 Fluidigm SNPType assay primer pairs in total. If we had selected assays with Dav >0.4, for example, the efficiency to construct the partial linkage map would have been higher. To gain some insight into Dav critical threshold values, the values were calculated for 771 (Moriguchi ) and 2,160 (Moriguchi ) expressed genic markers over 11 linkage groups, which were mapped in the YI mapping family (Supplemental Text 3). As expected, markers in LG9 where MS1 existed revealed higher Dav values, although MS1 was not mapped on the YI map and its exact location was unclear. If we defined the Dav critical threshold values as 1% of the upper tail of the Dav distribution, they were 0.343 and 0.312 for markers from Moriguchi and Moriguchi , respectively, and the rate of false positives was <50%. These results agree well with the observed results (Table 1), where 7 (33.3%) of 21 assayable markers from the FLcDNA and RNA-Seq references were not mapped on LG9. Notably, the empirical estimates of the Dav distribution and the rate of false positives in Supplemental Text 3 were based on RNA-Seq reads. If we apply scanning RAD-Seq for the empirical estimates, these critical threshold values may change. The broader the Dav range considered, the larger the number of markers linked to MS1. However, in practice, the Dav threshold values are determined by the cost and time required for the specific assay method. In the preliminary simulation (Supplemental Fig. 2), the difference (D) (for Fig. 1D) was calculated in cases where there was no linkage between a marker and MS1 in the backcross (BC1) pedigree, indicating that D approached zero as the number of BC1 offspring (N) increased, and that the average difference (D̄) was <0.1 when N > 20. Overall concordance was observed when the observed D values were compared with the simulated values. The observed/ simulated D̄ values were 0.1050/0.0456, 0.0995/0.0456, and 0.1035/0.0987 for SNPs from FLcDNA (N = 100), RNA-Seq (N = 100), and RAD-Seq (N = 20), respectively (Supplemental Fig. 5). The D̄ values for the observed values may be higher than those for the simulated values, because some of the linked markers were included in the real situation. The observed D values from RAD-Seq are closer to the simulated values, while those based on gene expression (FLcDNA and RNA-Seq) are not. This may have been caused by unequal contribution from each allele during gene expression and/or a contribution from paralogues. In the present study, Dav values with at least two SNPs and a maximum distance among all SNPs ≥ 50 bp on a reference were considered (Supplemental Fig. 1), because we were afraid that Dav ≈0.5 may be obtained by chance even though there was no linkage between the markers and MS1. However, as the simple simulations (Supplemental Fig. 2) indicated that the D values (which could be considered as Dav for a reference sequence with a single SNP site) rerely exceeded 0.4 when at least 10 individuals were pooled from each category. If we considered all Dav values including Dav from a sequence with a single SNP site, the number of Dav, from the values of which linkage was evaluated, would have increased by 3,778 (23.6 %), 14,756 (36.4 %), 2,456 (37.0 %) for cases from FLcDNA, RNA-Seq, and RAD-Seq, respectively. To take advantage of many collected reads as possible and to avoid overlooking linked markers to target character (MS1), Dav for a reference sequence with a single SNP site should be considered (Supplemental Text 3). The larger the Dav is, the closer the marker was to MS1 (Table 1). Regression of the distance (cM) to MS1 on the Dav value was significant (P = 0.0119, Supplemental Fig. 6). Markers in which Dav >0.475 had a high probability to link to MS1 within 5 cM. Mishima reported the SNP marker (reCj19250_1927) for the putative causative MS1 gene, which showed homology to AT5G62190.1 (PRH75 DEAD box RNA helicase). The Dav value for reCj19250 (which corresponds to the RNA-Seq contig reference of c66093_g1_i1 in the current study) was 0.517. These data support the validity of the scanning method used in the present study. This method may be valid to search closely linked markers not only to qualitative trait loci, but also to quantitative trait loci with the largest effect. Although pooling the sequence reads according to the phenotype to screen linked SNP candidates reduced the cost of sequencing in this study, the development of new sequencing technologies could lead to a continuous reduction in the sequencing cost in an unprecedented manner (Delseny ) and make individual genotyping by sequencing (GBS) a routine task in breeding research. However, variation in the amount of sequencing reads for each individual causes uncertain genotypes for samples with low data volume, which is still problematic. Similarly, RNA-Seq data include expressional variation, and genes or alleles with low expression also cause biased data. In our preliminary analysis, we attempted to directly call the individual genotypes from RAD-Seq, estimated the error rate assuming Mendelian inheritance (Utsunomiya ), and found an average Mendelian error (allelic drop in) rate of 6.9% (data not shown). Genotyping error inflates map distance and may lead to an incorrect marker order along chromosomes. Developing closely linked markers needs accurate linkage maps, and map construction from genotypes from GBS with a high error rate is probably inappropriate. Instead, scanning closely linked SNPs from pooled sequencing reads, at first, is robust against errors, and the SNP-specific markers developed achieve higher precision and scalability (from one sample to thousands of samples), enabling marker-assisted selection in the downstream applications.

Utility of the linked marker to MS1

The most closely linked marker developed in the present study (dD_Contig3995-165) was located 0.58 cM to the male sterility locus MS1 (Fig. 3). The results were obtained from the genotypes of the mapping family. One of the 173 pedigree members showed recombination between MS1 and the marker. This implies that discrimination of male sterile and fertile individuals based on the marker from 1,000 seedlings produced by backcross contains five to six individuals of false identification (1,000/173 ≈ 5.8). More information with addition of a second marker located on the opposite side of MS1 to dD_Contig3995-165 will increase the accuracy of phenotype prediction. Selection of male sterile individuals at the embryo stage is required for mass production of male sterile trees by tissue culture from immature zygotic embryos, where marker-assisted selection (MAS) is a prerequisite. The marker developed in this study (Table 2) is useful in such a project, which is now in progress. The discovery of two closely linked markers that cross the target gene (MS1) increases the precision of MAS, while the method developed in this study will be useful for identifying other male sterile genes and their closely linked markers.

3 in total

1. Identification and genetic diversity analysis of a male-sterile gene (MS1) in Japanese cedar (Cryptomeria japonica D. Don).

Authors: Yoichi Hasegawa; Fu-Jin Wei; Saneyoshi Ueno; Asako Matsumoto; Kentaro Uchiyama; Tokuko Ujino-Ihara; Tetsuji Hakamata; Takeshi Fujino; Masahiro Kasahara; Takahiro Bino; Katsushi Yamaguchi; Shuji Shigenobu; Yoshihiko Tsumura; Yoshinari Moriguchi
Journal: Sci Rep Date: 2021-01-15 Impact factor: 4.379

2. Construction of a reference transcriptome for the analysis of male sterility in sugi (Cryptomeria japonica D. Don) focusing on MALE STERILITY 1 (MS1).

Authors: Fu-Jin Wei; Saneyoshi Ueno; Tokuko Ujino-Ihara; Maki Saito; Yoshihiko Tsumura; Yuumi Higuchi; Satoko Hirayama; Junji Iwai; Tetsuji Hakamata; Yoshinari Moriguchi
Journal: PLoS One Date: 2021-02-25 Impact factor: 3.240

3. An Improved and Simplified Propagation System for Pollen-Free Sugi (Cryptomeria japonica) via Somatic Embryogenesis.

Authors: Tsuyoshi E Maruyama; Momi Tsuruta; Saneyoshi Ueno; Kiyohisa Kawakami; Yukiko Bamba; Yoshinari Moriguchi
Journal: Front Plant Sci Date: 2022-02-08 Impact factor: 5.753

3 in total