| Literature DB >> 34103331 |
Yun Zhang1, Chanhee Park1, Christopher Bennett1, Micah Thornton1, Daehwan Kim1.
Abstract
Sequencing technologies using nucleotide conversion techniques such as cytosine to thymine in bisulfite-seq and thymine to cytosine in SLAM seq are powerful tools to explore the chemical intricacies of cellular processes. To date, no one has developed a unified methodology for aligning converted sequences and consolidating alignment of these technologies in one package. In this paper, we describe hierarchical indexing for spliced alignment of transcripts-3 nucleotides (HISAT-3N), which can rapidly and accurately align sequences consisting of any nucleotide conversion by leveraging the powerful hierarchical index and repeat index algorithms originally developed for the HISAT software. Tests on real and simulated data sets show that HISAT-3N is faster than other modern systems, with greater alignment accuracy, higher scalability, and smaller memory requirements. HISAT-3N therefore becomes an ideal aligner when used with converted sequence technologies.Entities:
Year: 2021 PMID: 34103331 PMCID: PMC8256862 DOI: 10.1101/gr.275193.120
Source DB: PubMed Journal: Genome Res ISSN: 1088-9051 Impact factor: 9.043
Performance comparison for HISAT-3N, Bismark, BS-Seeker2, and BSMAP on 10 million simulated 100-bp paired-end BS-seq reads
Performance comparison for HISAT-3N, Bismark, BS-Seeker2, and BSMAP on 78 million real whole-genome paired-end BS-seq reads
Performance comparison between HISAT-3N and SLAM-DUNK on 10 million simulated 100-bp single-end SLAM seq reads
Performance comparison between HISAT-3N and SLAM-DUNK on 45 million real single-end SLAM seq reads
Summary of several nucleotide conversion sequencing methods
Figure 1.Repeat index enables faster 3-nt read alignment. (A) HISAT-3N aligns reads using two different strategies: (1) HISAT-3N can directly align reads to the whole genome using the genome index and output their mapped locations (A, left), and (2) HISAT-3N can use a repeat index to uniquely align reads to the repeat sequences regardless of how many locations to which they align on the genome (A, right). (B) Runtime comparison between direct mapping and repeat mapping strategy. The test data are 10 million simulated single-end BS-seq reads (0.2% per-base sequencing error rate).
Scalability comparison between HISAT-3N, Bismark, BS-Seeker2, and BSMAP on 10 million simulated 100-bp paired-end BS-seq reads (0.2% per-base sequencing error rate)
Figure 2.HISAT-3N alignment steps for BS-seq reads. (A) HISAT-3N converts each input read (READ) to two 3N reads: READ-3N and READ-RC-3N. READ-3N is READ with all thymine replaced by cytosine. READ-RC-3N is the reverse complement of READ, plus the replacement of cytosine with thymine. (B) HISAT-3N maps the two 3N reads to both REF-3N and REF-RC-3N references using prebuilt indexes. (C) After the 3-nt alignment, HISAT-3N compares the original read sequence (READ) to the original 4-nt references (REF and REF-RC) to identify unmethylated cytosine positions and recalculate an alignment score accordingly.