Literature DB >> 28039163

Training alignment parameters for arbitrary sequencers with LAST-TRAIN.

Michiaki Hamada^1,2,3, Yukiteru Ono⁴, Kiyoshi Asai^3,5, Martin C Frith^2,3,5.

Abstract

Summary: LAST-TRAIN improves sequence alignment accuracy by inferring substitution and gap scores that fit the frequencies of substitutions, insertions, and deletions in a given dataset. We have applied it to mapping DNA reads from IonTorrent and PacBio RS, and we show that it reduces reference bias for Oxford Nanopore reads. Availability and Implementation: the source code is freely available at http://last.cbrc.jp/. Contact: mhamada@waseda.jp or mcfrith@edu.k.u-tokyo.ac.jp. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Gene Mutation Species

Mesh：

Year: 2017 PMID： 28039163 PMCID： PMC5351549 DOI： 10.1093/bioinformatics/btw742

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The classic approach to pair-wise sequence alignment is to seek alignments that maximize a score, which is a sum of substitution and gap scores. This is equivalent to seeking alignments with maximum likelihood, using a statistical model with probabilities for each kind of substitution (e.g. ct), insertion, and deletion. This approach was developed several decades ago, mainly for proteins, but also for nucleotide sequences (Chiaromonte ; States ). It is arguably least suited to homology search, because different homologs of one protein have different levels of divergence, so that one set of parameters cannot be optimal for all homologs. Here, we are interested in aligning nucleotide sequences that differ mainly by sequencing error. Compared to homology search, it is more likely that a single set of substitution and gap probabilities will be a universal good fit, for one version of one sequencing technology, applied to one type of DNA. On the other hand, these probabilities may be quite different for different technologies, and even for different versions of the same technology. Moreover, these probabilities will differ for unusual types of DNA, such as 80%-AT Plasmodium genomes or PAR-CLIP data (Kerpedjiev ). Thus, it would be useful to have a tool that automatically determines suitable parameters for a given dataset. Although the score/model-based approach to alignment is well-known and classic, it has been surprisingly neglected in recent high-throughput DNA aligners (Kerpedjiev ). It is likely that accuracy is maximized by using scores that fit the substitution and gap frequencies in the data. In this study, we introduce a novel tool, LAST-TRAIN, to train alignment parameters from sequence data. We use it to train parameters for PacBio RS, IonTorrent and Nanopore. Finally, we show that it mitigates reference bias (haplotypes appearing in the reference genome tend to be over-estimated) for Oxford Nanopore reads.

2 Methods

LAST-TRAIN’s input is query (e.g. DNA reads) and reference (e.g. a genome) sequence datasets. It uses a standard iterative approach: it first aligns the sequences using some initial score parameters, then infers better score parameters from the alignments, then re-aligns and repeats until the parameters stop changing (Durbin ). It achieves adequate speed by an X-drop heuristic (Altschul ; Zhang ), it depletes paralogs using LAST-SPLIT (Frith and Kawaguchi, 2015), and it allows different insertion and deletion parameters and non-strand-symmetric substitution parameters. Details are in the Supplementary Material S1.

3 Results

3.1 Mitigating reference bias in haplotype phasing

Long-read DNA sequencing is a promising way to determine phasing between DNA variants. Most human cells have two copies, maternal and paternal, of each chromosome (except Y). Suppose that a patient has two variants in different exons of one gene, where each variant destroys the gene’s function but is present in just one chromosome (maternal or paternal). It is important to know whether they are in the same chromosome. A previous study attempted to determine the haplotypes of CYP2D6, a gene that affects metabolism of clinical drugs, in human sample NA12878, by PCR amplification of the relevant genomic region followed by Oxford Nanopore sequencing (Ammar ). This sample is known to have the two haplotypes CYP2D6*3 and CYP2D6*4, shown in Supplementary Table S14 (Numanagi ; Twist ), however the original study found a prominent third haplotype. A later re-analysis (Laver ) suggested two reasons for this. First, chimeric cross-overs between the two variants appeared during PCR amplification, producing two false haplotypes. Second, because one of the false haplotypes matches the reference genome, it was prominently detected after aligning the DNA reads to the reference. The latter phenomenon is termed reference alignment bias. We reasoned that, if our trained parameters produce more accurate alignments, they should reduce reference bias. Following Ammar , we used high-quality (2D) reads (SRA1748415): 7540 reads with average length 3486. First, we trained alignment parameters using these reads and human reference genome hg19, leading to the following parameters; the substitution score matrix is and the gap costs are 12 + 3k for a length-k deletion and 15 + 3k for a length-k insertion. Second, the haplotypes for two target polymorphism sites (Supplementary Materials S14) were predicted in the following (direct and simple) manner. (i) The best alignment was taken for each read after mapping the read to the reference genome (hg19). (Note that multiple maps are expected because there is a paralog of CYP2D6, whose similarity is about 94%.) (ii) Among the obtained alignments, alignments covering both of the target polymorphism sites in CYP2D6 were taken and then the count and frequency of each haplotype were computed from those alignments. The results are shown in Table 1, indicating that reference bias is lessened by aligning with trained parameters, compared to GraphMap (v0.3.0) (Sovic ), BLASR (Chaisson and Tesler, 2012) and LAST with manually-determined parameters. Specifically, the frequency of the chimeric reference haplotype is reduced, whereas the frequency of the chimeric non-reference haplotype is increased.

Table 1

Results of haplotype phasing with Nanopore long reads (NA12878)

		GraphMap		BLASR		LAST				LAST + LAST-TRAIN
						Manual (q = 1)		Manual (q = 2)		Training		Training+LAMA
Haplotype	Polymorphism	Count	Freq	Count	Freq	Count	Freq	Count	Freq	Count	Freq	Count	Freq
TT: CT	CYP2D6*4	207	11.9%	227	18.4%	340	20.1%	182	21.2%	327	27.3%	343	27.4%
TT: CC	(reference bias)	225	13.0%	329	26.6%	326	19.3%	164	19.1%	160	13.4%	134	10.7%
T−: CT		70	4.0%	31	2.5%	65	3.8%	36	4.2%	75	6.3%	78	6.2%
T−: CC	CYP2D6*3	226	13.0%	281	22.8%	232	13.7%	199	23.1%	199	16.6%	217	17.3%
Other		1006	58.1%	367	29.7%	726	43.1%	279	32.4%	436	36.4%	480	38.3%
Total		1734	100.0%	1235	100.0%	1689	100.0%	860	100.0%	1197	100.0%	1252	100.0%

In the first column, TX:CY indicates the phased haplotype where the 1st position (rs35742686) is ‘X’ (‘T’ in the reference genome) and the 2nd position (rs3892097) is ‘Y’ (‘C’ in the reference genome). See also Supplementary Table S14. The high frequency for TT:CC (the identical haplotype to the reference genome) is known as reference bias (Laver ). The values for ‘BLASR’ were computed from the mapping results in Ammar , where BLASR was used for mapping Nanopore reads to the reference genome. The column ‘training + LAMA’ shows the results of probabilistic alignment (Hamada ) using forward scores with the trained parameters by LAST-TRAIN. See Supplementary Materials S7 for the detailed command line options for every tool.

Results of haplotype phasing with Nanopore long reads (NA12878) In the first column, TX:CY indicates the phased haplotype where the 1st position (rs35742686) is ‘X’ (‘T’ in the reference genome) and the 2nd position (rs3892097) is ‘Y’ (‘C’ in the reference genome). See also Supplementary Table S14. The high frequency for TT:CC (the identical haplotype to the reference genome) is known as reference bias (Laver ). The values for ‘BLASR’ were computed from the mapping results in Ammar , where BLASR was used for mapping Nanopore reads to the reference genome. The column ‘training + LAMA’ shows the results of probabilistic alignment (Hamada ) using forward scores with the trained parameters by LAST-TRAIN. See Supplementary Materials S7 for the detailed command line options for every tool. In addition, we performed probabilistic alignment (Hamada ) with trained parameters, because probabilistic alignment tends to estimate columns in the alignment more accurately than conventional alignment. Specifically, we used LAMA alignment (lastal option ‘-j6’) with γ = 2 (Hamada ). In this case, we choose the best alignment using ‘forward’ scores (from summing the probabilities of all alignments in the X-drop algorithm) instead of conventional scores (from the single best alignment). Table 1 suggests that trained parameters with LAMA alignment improve the results further (lower frequency of TT:CC).

3.2 Further results

We applied LAST-TRAIN to PacBio, IonTorrent, and Oxford Nanopore DNA reads using several available datasets (Supplementary Materials S2). It successfully recovers known features, such as PacBio having more insertions than deletions (Supplementary Materials S4), and we established that a query sample size of 1–10 million bases is sufficient (Supplementary Materials S3). Moreover, evaluation on simulated datasets indicates that trained parameters slightly improved alignment accuracy (Supplementary Materials S5). Notice that all the trained parameters and their statistics are shown in supplementary information (Supplementary Materials S9). Finally, the Supplement has discussion of: last-train versus MarginAlign (Jain ), resisting the temptation of over-alignment, and use of sequence quality data (Supplementary Materials S8). Click here for additional data file.

	A	C	G	T
A	7	−11	−8	−16
C	−7	5	−8	−7
G	−5	−8	5	−8
T	−19	−12	−13	7

13 in total

1. Scoring pairwise genomic sequence alignments.

Authors: F Chiaromonte; V B Yap; W Miller
Journal: Pac Symp Biocomput Date: 2002

2. Probabilistic alignments with quality scores: an application to short-read mapping toward accurate SNP/indel detection.

Authors: Michiaki Hamada; Edward Wijaya; Martin C Frith; Kiyoshi Asai
Journal: Bioinformatics Date: 2011-10-05 Impact factor: 6.937

Review 3. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Authors: S F Altschul; T L Madden; A A Schäffer; J Zhang; Z Zhang; W Miller; D J Lipman
Journal: Nucleic Acids Res Date: 1997-09-01 Impact factor: 16.971

4. Split-alignment of genomes finds orthologies more accurately.

Authors: Martin C Frith; Risa Kawaguchi
Journal: Genome Biol Date: 2015-05-21 Impact factor: 13.583

5. Cypiripi: exact genotyping of CYP2D6 using high-throughput sequencing data.

Authors: Ibrahim Numanagić; Salem Malikić; Victoria M Pratt; Todd C Skaar; David A Flockhart; S Cenk Sahinalp
Journal: Bioinformatics Date: 2015-06-15 Impact factor: 6.937

6. Adaptable probabilistic mapping of short reads using position specific scoring matrices.

Authors: Peter Kerpedjiev; Jes Frellsen; Stinus Lindgreen; Anders Krogh
Journal: BMC Bioinformatics Date: 2014-04-09 Impact factor: 3.169

7. Improved data analysis for the MinION nanopore sequencer.

Authors: Miten Jain; Ian T Fiddes; Karen H Miga; Hugh E Olsen; Benedict Paten; Mark Akeson
Journal: Nat Methods Date: 2015-02-16 Impact factor: 28.547

8. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory.

Authors: Mark J Chaisson; Glenn Tesler
Journal: BMC Bioinformatics Date: 2012-09-19 Impact factor: 3.169

9. Fast and sensitive mapping of nanopore sequencing reads with GraphMap.

Authors: Ivan Sović; Mile Šikić; Andreas Wilm; Shannon Nicole Fenlon; Swaine Chen; Niranjan Nagarajan
Journal: Nat Commun Date: 2016-04-15 Impact factor: 14.919

10. Pitfalls of haplotype phasing from amplicon-based long-read sequencing.

Authors: Thomas W Laver; Richard C Caswell; Karen A Moore; Jeremie Poschmann; Matthew B Johnson; Martina M Owens; Sian Ellard; Konrad H Paszkiewicz; Michael N Weedon
Journal: Sci Rep Date: 2016-02-17 Impact factor: 4.379

25 in total

Review 1. Long-read sequencing for rare human genetic diseases.

Authors: Satomi Mitsuhashi; Naomichi Matsumoto
Journal: J Hum Genet Date: 2019-09-27 Impact factor: 3.172

2. lamassemble: Multiple Alignment and Consensus Sequence of Long Reads.

Authors: Martin C Frith; Satomi Mitsuhashi; Kazutaka Katoh
Journal: Methods Mol Biol Date: 2021

3. Rate variation in the evolution of non-coding DNA associated with social evolution in bees.

Authors: Benjamin E R Rubin; Beryl M Jones; Brendan G Hunt; Sarah D Kocher
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2019-06-03 Impact factor: 6.237

4. Long-read metagenomics of multiple displacement amplified DNA of low-biomass human gut phageomes by SACRA pre-processing chimeric reads.

Authors: Yuya Kiguchi; Suguru Nishijima; Naveen Kumar; Masahira Hattori; Wataru Suda
Journal: DNA Res Date: 2021-10-11 Impact factor: 4.477

5. NanoPipe-a web server for nanopore MinION sequencing data analysis.

Authors: Victoria Shabardina; Tabea Kischka; Felix Manske; Norbert Grundmann; Martin C Frith; Yutaka Suzuki; Wojciech Makałowski
Journal: Gigascience Date: 2019-02-01 Impact factor: 6.524

6. Nanopore-based single molecule sequencing of the D4Z4 array responsible for facioscapulohumeral muscular dystrophy.

Authors: Satomi Mitsuhashi; So Nakagawa; Mahoko Takahashi Ueda; Tadashi Imanishi; Martin C Frith; Hiroaki Mitsuhashi
Journal: Sci Rep Date: 2017-11-01 Impact factor: 4.379

7. Beyond similarity assessment: selecting the optimal model for sequence alignment via the Factorized Asymptotic Bayesian algorithm.

Authors: Taikai Takeda; Michiaki Hamada; John Hancock
Journal: Bioinformatics Date: 2018-02-15 Impact factor: 6.937

8. A survey of localized sequence rearrangements in human DNA.

Authors: Martin C Frith; Sofia Khan
Journal: Nucleic Acids Res Date: 2018-02-28 Impact factor: 16.971

9. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes.

Authors: Matthew B Hufford; Arun S Seetharam; Margaret R Woodhouse; Kapeel M Chougule; Shujun Ou; Jianing Liu; William A Ricci; Tingting Guo; Andrew Olson; Yinjie Qiu; Rafael Della Coletta; Silas Tittes; Asher I Hudson; Alexandre P Marand; Sharon Wei; Zhenyuan Lu; Bo Wang; Marcela K Tello-Ruiz; Rebecca D Piri; Na Wang; Dong Won Kim; Yibing Zeng; Christine H O'Connor; Xianran Li; Amanda M Gilbert; Erin Baggs; Ksenia V Krasileva; John L Portwood; Ethalinda K S Cannon; Carson M Andorf; Nancy Manchanda; Samantha J Snodgrass; David E Hufnagel; Qiuhan Jiang; Sarah Pedersen; Michael L Syring; David A Kudrna; Victor Llaca; Kevin Fengler; Robert J Schmitz; Jeffrey Ross-Ibarra; Jianming Yu; Jonathan I Gent; Candice N Hirsch; Doreen Ware; R Kelly Dawe
Journal: Science Date: 2021-08-06 Impact factor: 47.728

10. Sequencing and phasing cancer mutations in lung cancers using a long-read portable sequencer.

Authors: Ayako Suzuki; Mizuto Suzuki; Junko Mizushima-Sugano; Martin C Frith; Wojciech Makalowski; Takashi Kohno; Sumio Sugano; Katsuya Tsuchihara; Yutaka Suzuki
Journal: DNA Res Date: 2017-12-01 Impact factor: 4.458