| Literature DB >> 31850053 |
Baoxing Song1,2,3, Qing Sang4, Hai Wang5, Huimin Pei1, XiangChao Gan2, Fen Wang1.
Abstract
With the broad application of high-throughput sequencing, more whole-genome resequencing data and de novo assemblies of natural populations are becoming available. For a particular species, in general, only the reference genome is well established and annotated. Computational tools based on sequence alignment have been developed to investigate the gene models of individuals belonging to the same or closely related species. During this process, inconsistent alignment often obscures genome annotation lift over and leads to improper functional impact prediction for a genomic variant, especially in plant species. Here, we proposed the zebraic striped dynamic programming algorithm, which provides different weights to genetic features to refine genome annotation lift over. Testing of our zebraic striped dynamic programming algorithm on both plant and animal genomic data showed complementation to standard sequence approach for highly diverse individuals. Using the lift over genome annotation as anchors, a base-pair resolution genome-wide sequence alignment and variant calling pipeline for de novo assembly has been implemented in the GEAN software. GEAN could be used to compare haplotype diversity, refine the genetic variant functional annotation, annotate de novo assembly genome sequence, detect homologous syntenic blocks, improve the quantification of gene expression levels using RNA-seq data, and unify genomic variants for population genetic analysis. We expect that GEAN will be a standard tool for the coming of age of de novo assembly population genetics.Entities:
Keywords: gene expression level quantification; genetic variants uniformization; genome annotation; genome-wide multiple-sequence alignment; weighted sequence alignment
Year: 2019 PMID: 31850053 PMCID: PMC6902276 DOI: 10.3389/fgene.2019.01046
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Example of inconsistent sequence alignments that affect variant functional inference. (A) Upper panel standard sequence alignment suggests that a 13-bp deletion disturbs the splice site and shifts the ORF of AT5G37190.1. Lower alignment panel suggests that a 13-bp deletion is located in the intron, and the splice site is conserved. (B) Upper panel shows the annotation of AT5G37190.1 of a1 allele by coordinate lift over using standard sequence alignment. Lower panel shows the annotation updated by our ZSDP algorithm, and the RNA-seq reads mapping support the ZSDP result.
Figure 2Correlation between the number of realigned transcripts by the ZSDP method and identical by state index for A. thaliana (A) and D. melanogaster (B).
Figure 3Schematic showing the pipeline for the protein-coding gene annotation of the non-reference genome. The annotation of upstream function would be supplemented with the downstream modules.
Figure 4Alignment rate of RNA-seq read to pseudo-genome (PG) sequence and Col-0 reference genome (RG) sequence.
Figure 5Sequence diversity of genes with significantly different read counts (GDRC) versus the whole-genome-wide (WGW) background.
Figure 6Dot plot of gene position in Col-0 against the position of genes transformed to other de novo assemblies. Dots in red are the result of standard sequence alignment lift over, and those in blue have been realigned by the ZSDP algorithm. (A) Project Arabidopsis thaliana Col-0 genome annotation to the Ler-0 accession genome sequence using standard sequence alignment. (B) Project Col-0 genome annotation to Ler-0 genome sequence of those genes could not be transformed using standard sequence alignment using ZSDP algorithm. (C) Highlighting the syntenic blocks between Col-0 and Ler-0 using the transformed annotations. (D) Project A. thaliana Col-0 annotation to the Cardamine hirsuta genome sequence using standard sequence alignment. (E) Project Col-0 annotation to C. hirsuta genome sequence of those genes could not be transformed using standard sequence alignment using ZSDP algorithm. (F) Highlighting the syntenic blocks between A. thaliana Col-0 and C. hirsuta using the transformed annotations.
Figure 7ZDP sequence alignment methods. When the reference genome sequence is aligned to the target accession genome sequence, different scoring strategies are used for distinct reference regions to construct the score matrix. The purpose is to align the exon regions preferentially.
Figure 8Sliding window sequence alignment method used for the long sequence alignment.