| Literature DB >> 29931324 |
Meifang Qi1,2, Zijuan Li1,2, Chunmei Liu1,2, Wenyan Hu1,2, Luhuan Ye1,2, Yilin Xie1,2, Yili Zhuang1,2, Fei Zhao1,2, Wan Teng2,3, Qi Zheng2,3, Zhenjun Fan1,4, Lin Xu1,2, Zhaobo Lang2,5, Yiping Tong2,3, Yijing Zhang1,2.
Abstract
Genetic diversity in plants is remarkably high. Recent whole genome sequencing (WGS) of 67 rice accessions recovered 10,872 novel genes. Comparison of the genetic architecture among divergent populations or between crops and wild relatives is essential for obtaining functional components determining crucial traits. However, many major crops have gigabase-scale genomes, which are not well-suited to WGS. Existing cost-effective sequencing approaches including re-sequencing, exome-sequencing and restriction enzyme-based methods all have difficulty in obtaining long novel genomic sequences from highly divergent population with large genome size. The present study presented a reference-independent core genome targeted sequencing approach, CGT-seq, which employed epigenomic information from both active and repressive epigenetic marks to guide the assembly of the core genome mainly composed of promoter and intragenic regions. This method was relatively easily implemented, and displayed high sensitivity and specificity for capturing the core genome of bread wheat. 95% intragenic and 89% promoter region from wheat were covered by CGT-seq read. We further demonstrated in rice that CGT-seq captured hundreds of novel genes and regulatory sequences from a previously unsequenced ecotype. Together, with specific enrichment and sequencing of regions within and nearby genes, CGT-seq is a time- and resource-effective approach to profiling functionally relevant regions in sequenced and non-sequenced populations with large genomes.Entities:
Mesh:
Year: 2018 PMID: 29931324 PMCID: PMC6182137 DOI: 10.1093/nar/gky522
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Figure 1.Design of CGT-seq. (A) ChIP-seq read distribution of epigenetic marks surrounding genes in rice (japonica cultivar JZ-1560). Shown along the y axis is the read density normalized by the sequencing depth (RPM, read per million mapped read). Regions ranging from 1 kb up- to 1 kb down-stream of gene body were shown. TSS, transcription start site. TES, transcription end site. (B) RNA-seq and ChIP-seq read density of H3K4me3 and H3K27me3 marks surrounding 33 808 genes expressed in at least one tissue. Regions ranging from 6 kb up to 6 kb downstream of TSS was used. (C) Workflow for enrichment and de novo assembly of the core genome. ChIP-seq was performed for selected epigenetic marks, followed by DSN normalized library and massively parallel sequencing. The sequencing reads were assembled to contigs based on De Bruijn graph. Contigs from different modifications were merged together and scaffolds were constructed with paired-end information. Further gap filling and extension were guided by reads mapped to the scaffolds. *DSN normalized library was introduced as an improvement of uniformity in discussion.
Figure 2.Performance of CGT-seq. (A) Circos plot showing the high concordance between wheat genic regions and regions captured by CGT-seq enriched for H3K4me3 and H3K27me3 marks. The outermost circle depicts the ideograms of each chromosome. The rules indicate the length and position of each chromosome. The next two outermost circle represents the density of genes (track 2) and assembled scaffolds (track 3). Orange indicates high density and blue indicates low density. The two internal circles represent H3K27me3 (track 3) and H3K4me3 (track 4) ChIP-seq read density. (B) Genomic track illustrates the recovery of promoter and intragenic regions by CGT-seq for a representative gene NAM-A1. (C) Donut chart showing the fraction of all 110 790 annotated genes and promoter regions (TSS up 3 kb) covered by CGT-seq sequencing reads. (D) Fraction of annotated genes whose sequences are recovered by assembled scaffolds. X-axis represents the length of sequences in promoter (3 kb upstream of TSS) or gene body regions covered by the scaffold. Y-axis represents the fraction of annotated genes. (E) Box plot showing the distribution of sequencing depth in different genomic regions captured by CGT-seq. Promoter region is defined as above. Transcription termination region (TTR) is defined as 1 kb downstream of TES. The numbers on top of the box represent the median depth. (F) Fraction of scaffolds mapped to different genomic regions. Intergenic regions based on current annotation were further divided to those mapped by RNA-seq read, TEs and repetitive sequences, and other regions.
Figure 3.CGT-seq read saturation analysis. Assessment of capturing sensitivity for randomly sampled reads. 20, 40, 60, 80 Gb sequencing reads (half from H3K4me3 and half from H3K27me3) were randomly selected from 100 Gb sequencing reads. Shown is the fraction of annotated genes with ≥500 bp genic (blue) or promoter (orange) regions recovered by assembled scaffolds from sampled reads.
Figure 4.High accurate detection of intragenic and regulatory sequences and variants in heterogeneous regions. (A) Enriched protein domains for indica-japonica divergent genes recovered by CGT-seq in HHZ. Shown are enrichment P value (x-axis) and fold enrichment (y-axis) for genes with promoter (blue) or intragenic (pink) regions recovered. The size of the circle represents the number of recovered genes contain given domain. (B) Genomic tracks illustrate the recovery of the indica specific LRR gene by CGT-seq from HHZ in indica-japonica highly divergent region. Fifty base pairs single-end sequencing reads were used for assembly. (C) Genomic tracks illustrate the recovery of the indica rice specific GW5 locus by CGT-seq. Fifty base pairs single-end sequencing reads were used for assembly. (D) Concordance of SNVs identified by re-sequencing and identified from CGT-seq captured sequences. (E) Cumulative fraction of SNVs (y-axis) with re-sequencing depth less than or equal to the value on the x-axis. The re-sequencing only SNVs, CGT-seq only SNVs and common SNVs are plotted separately. The P value reflecting the significance of differential distribution is calculated based on Kolmogorov–Smirnov test. The dashed line indicates that >80% re-sequencing only and common SNVs have re-sequencing depth >10, while only around 45% CGT-seq only SNVs have re-sequencing depth >10. (F) Distribution of re-sequencing depth in respect to the density of SNVs. All 20 bp genomic regions harboring SNV(s) were collected and grouped by the number of SNVs. (G) PCR validation of CGT-seq only SNV region in HHZ. The grey area represents the region selected for PCR validation. The dark purple bars on the first track represents re-sequencing depth; the second track represents the pair-wise sequence comparison between Nipponbare reference and CGT-seq captured sequence in HHZ; the third track is the Sanger sequencing result. PCR results for other three randomly selected regions harboring CGT-seq only SNVs are shown in Supplemental Figure S7.
Figure 5.Overlap of assembled contigs between HHZ (indica rice variety) and JZ-1560 (japonica rice variety). H3K36me3 ChIP-seq data were used for assembly.
Figure 6.Donut chart showing the high concordance between CGT-seq captured regions, DHS and TF binding regions. The purple region in the outer circle represents the percentage of CGT-seq captured regions overlapping with DHS, and the blue region in the inner circle represents the percentage of CGT-seq captured regions overlapping with TF binding regions collected from nine ChIP-seq data sets (public data summarized in Supplemental Table S1B).