Literature DB >> 27556417

Improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies.

Weixing Feng¹, Sen Zhao¹, Dingkai Xue¹, Fengfei Song¹, Ziwei Li¹, Duojiao Chen¹, Bo He², Yangyang Hao³, Yadong Wang⁴, Yunlong Liu^5,6.

Abstract

BACKGROUND: Ion Torrent and Ion Proton are semiconductor-based sequencing technologies that feature rapid sequencing speed and low upfront and operating costs, thanks to the avoidance of modified nucleotides and optical measurements. Despite of these advantages, however, Ion semiconductor sequencing technologies suffer much reduced sequencing accuracy at the genomic loci with homopolymer repeats of the same nucleotide. Such limitation significantly reduces its efficiency for the biological applications aiming at accurately identifying various genetic variants.
RESULTS: In this study, we propose a Bayesian inference-based method that takes the advantage of the signal distributions of the electrical voltages that are measured for all the homopolymers of a fixed length. By cross-referencing the length of homopolymers in the reference genome and the voltage signal distribution derived from the experiment, the proposed integrated model significantly improves the alignment accuracy around the homopolymer regions.
CONCLUSIONS: Besides improving alignment accuracy on homopolymer regions for semiconductor-based sequencing technologies with the proposed model, similar strategies can also be used on other high-throughput sequencing technologies that share similar limitations.

Entities: Chemical Disease Gene Species

Keywords: Alignment; Bayesian; Homopolymer; Ion Torrent/Proton

Mesh：

Year: 2016 PMID： 27556417 PMCID： PMC5001236 DOI： 10.1186/s12864-016-2894-9

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Background

The rapid development of high-throughput sequencing technologies leads to appearances of many innovative sequencing platforms [1, 2]. Ion Torrent and Ion Proton are semiconductor-based sequencing platforms that are primarily designed for personal genome sequencing [3, 4]. Different from sequencing techniques enriched with substitution errors [5, 6], Ion semiconductor sequencing platforms suffer from the inaccuracy in detecting the length of homopolymers repeats of the same nucleotide [7, 8]. These homopolymer errors often lead to the inaccurate local alignment results, and become a critical barrier against accurate detection of genomic variations [9-11] (http://www.broadinstitute.org/gatk/media/docs/Samtools.pdf). The sequencing chemistry for the Ion semiconductor-based technology is that the incorporation of a deoxyribonucleotide (dNTP) into a strand of DNA couples with the release of a hydrogen ion, which changes the pH of the solution and then leads to the electronic voltage pulse in the ion sensor. Multiple identical bases on the DNA strand often result in the detection of multiple times of the baseline voltage corresponding to the measurements at mononucleotide loci [12]. The difficulty on the homopolymer length identification mainly results from the inaccurate measurement on the magnitude of the voltage pulse, which follows a signal distribution that can be dependent on multiple factors including the type of nucleotide, the length of homopolymer, and the relative position in the DNA template. Thus far, several algorithms have been proposed in correcting the inaccurate homopolymer length identification, based on the raw data from the detected voltage signals for Ion semiconductor sequencing technologies. Lysholm designed a flow-space FAAST tool, where flowpeak information retrieved from detected voltage signals is utilized to improve accuracy of Smith-Waterman-Gotoh local alignment through correction of likely sequencing errors and thus obtain optimized homopolymer length [8]. However, since dedicatedly designed for the naïve Smith-Waterman-Gotoh algorithm, the method undertakes a heavy computing burden and limits the further application with other alignment programs. In addition, the parameter selection in the algorithm is ad hoc, and was not designed for maximizing the performance. Zeng designed a PyroHMMsnp algorithm, where a hidden Markov model (HMM) is built to recognize overcall or undercall status of homopolymers in a realignment process, and is used to deduce the most possible homopolymer lengths [7]. Similar with other refined alignment algorithms, this approach uses an EM-based strategy, which assumes the variant pattern on most of the reads at one specific loci follows the same distribution; this assumption maybe invalid for certain biological applications, such as the variant identification in cancer somatic tissues. In addition, PyroHMMsnp design does not have hidden state for mismatches, and therefore tends to mistakenly convert mismatches into INDELs. In this project, we aim to develop a simple computational strategy for improving the alignment accuracy by using the voltage signals, and relying only on the measurements of individual sequencing read. In addition to the measured electrical voltage signals, it is evident that the reference genome contains significant amount of prior information that are not adequately considered by other methods. This is under the assumption that only a small percentage of nucleotides are different between two individuals; for human, it is about 1 % of whole genome nucleotides [13, 14]. Based on such consideration, we proposed a Bayesian-based integrated model to merge these two information sources to improve performance of homopolyer length identification. We demonstrate that our algorithm significantly outperformed Torrent Suite, the software package coupled with Ion Torrent and Proton Sequencers for accurately identifying the length of the homopolymer repeats, and therefore improved sequence alignment accuracy.

Methods

Ion Torrent sequencing

Different from imaging-based sequencing platforms, Ion semiconductor technology detects nucleotide composition using electronic sensors. During the sequencing process, the sensor detects released hydrogen ions when nucleotide incorporation occurs. The sensor then detects the pH change caused by hydrogen release, and translates such chemical signal to electrical voltage signal, which is proportional to the number of captured ions. Since one type of nucleotides is sequenced in one machine cycle, if homopolymer exists, the detected voltage level should reflect the length of homopolymer. Despite this simple principle, practically, however, the detected electrical voltage follows a distribution, and in many cases, may not accurately recapitulate the length of homopolymer. In order to design a bioinformatics strategy for correcting the length of homopolymers, we first systematically evaluate the signal distribution of the detected electrical voltage for all the nucleotide positions that share the same homopolymer length, same homopolymer nucleotide type (A, C, G, or T), and similar positions in the sequence reads. The original voltage signals for different nucleotides were extracted from the SFF file, which is exported from the Torrent Suite package.

Bayesian inference of homopolymer length

We design a Bayesian-based model to infer the length of homopolymer based on the local genomic sequence context, including the homopolymer nucleotide type (Ni = A, C, G, or T), detected electrical voltage (V), and the nucleotide position in the sequencing reads (Pj). In the current model, nucleotide position were classified into several categories. In the equation, P(V|N,P,L) is the prior possibility of occurrence of a specific voltage V if given homopolymer length L under situation of nucleotide N and read position P, while P(L) and P(V) respectively represent the probability of a specific homopolymer length L, and the probability of a specific voltage V. Both these two probabilities can be statistically derived from the entire sequencing data. In summary, P(L|N,P,V) is the probability of occurrence of a specific homopolymer length L if given sequencing voltage V under sequencing context of nucleotide N and read position P.

Integrated model to identify homopolymer length

The performance of statistical-based inference model highly relies on fully understanding the sources of detection error, and their intervened relationships. Additional biological information can be used to increase the detection accuracy. For most of the biological applications, it is reasonable to assume that only a small percentage of the nucleotide positions represent true variants comparing to the reference genome. Therefore, combining the homopolymer length in the reference genome with the statistically-inferred homopolymer length can potentially improve the detection accuracy. We therefore construct an integrated model by defining a score S for the homopolymer length at a specific homopolymer loci: In the model, Pen(L|Seq_ref) is a penalty value when mismatch occurs between the reference genome sequence and the deduced Ion Torrent sequence for a given homopolymer length L. The penalty value is defined as 0 for perfect match, −1 for substitution, and −2 for insertions/deletions. In order to ensure that the two types of measurements staying in a similar scale, Bayesian posterior probability P(L|N,P,V) is converted into logarithmic form. In Eq. 2, W is weighting factor to balance the contribution of the Bayesian model-derived score, and reference genome-derived penalty. For one homopolymer, its length L can be determined as the candidate with the largest score Si: For a specific assay, the weighting factor w will be determined by minimizing the identification error for the homopolymers whose length is known, such as samples also detected using other technologies.

Results and discussion

Data preparing

We have tested our model on one HapMap human dataset, NA11881, of which both Ion Torrent data and Illumina sequencing data is available. The availability of such dataset enables training and testing a statistical model for refining the identification of homopolymer length. The Ion Torrent dataset was generated in the Center for Medical Genomics at Indiana University, of which a targeted genomic region of 59 genes were sequenced. The overall targeted genomic area covers 90,918 basepairs. To derive the length of the homopolymer repeats, the electrical voltage signal for each influx nucleotide machine cycle was retrieved from the SFF file, where the type of the nucleotide (A, C, G, or T) is determined. Among 452,161 Ion Torrent sequencing reads that passed quality control, our assay detected 1,430,986 homopolymers with >1.5 voltage units; these regions are defined as homopolymer candidates that are used in further analysis. In order to further characterize the homopolymer profiles being identified in our dataset, we further examine the nucleotide composition of all the detected homopolymers, and their relative loci in the sequenced reads (Fig. 1). We observed enrichment of A and T homopolymers in our dataset, and evenly distributed homopolymer locations (except for the last location due to the varying lengths of the Ion Torrent reads).

Fig. 1

Profile of retrieved homopolymers. Profile of retrieved homopolymers according to (a) nucleotide type and (b) position in the sequencing reads

Profile of retrieved homopolymers. Profile of retrieved homopolymers according to (a) nucleotide type and (b) position in the sequencing reads Illumina sequencing data from the same individual, NA11881, is downloaded from the 1000 Genomes database (http://www.1000genomes.org/data). Due to the chemistry differences, Illumina technology is more accurate in detecting homopolymer lengths. We therefore use the dataset from the Illumina platform as gold standard when refining the length of homopolymer repeats.

Distribution of detected voltage signals in homopolymer repeat regions

We examine the three factors that affect the distribution of the voltage signals on the homopolymer regions, the length of the homopolymer repeats, the types of homopolymer nucleotides, and the relative positions of the homopolymer repeats within a read. The homopolymer positions were classified into four zones depending on their distance from the beginning of the reads, Z1: 1–75 bp, Z2: 76–150 bp, Z3: 151–225 bp, and Z4: 226–300 bp. For the homopolymers with A nucleotides and appear in the first 75 bases, the retrieved signal distributions for each homopolymer length was demonstrated in Fig. 2. The ground truth for the homopolymer is derived from the Illumina dataset, which do not have apparent homopolymer issues. In Fig. 2, the horizontal axis is of voltage level and the vertical axis is of probability density for all the homopolymers of a fixed size. From left to right, there are five curves which correspond to the homopolymers with 2, 3, 4, 5, and 6 nucleotides. Here, the probability density of the voltages are fitted as in Gaussian distributions, where the mean values are 1.85, 2.78, 3.68, 4.64 and 5.57 respectively. It is observed that the standard deviation increases with homopolymer length. It increases from 0.14 for 2-base homopolymers to 0.38 for 6-base ones. A similar trend has been reported elsewhere [7]. This shift clearly suggests that the voltage signals become less specific with the homopolymer length increases. It is critical to consider these factors in the model for accurately inferring the homopolymer length. This is especially important for the sequencing reads with longer homopolymers.

Fig. 2

Prior possibilities of the detected voltages. Prior possibilities of the detected voltages when nucleotide type is A and position in the sequencing reads belongs to Z1

Prior possibilities of the detected voltages. Prior possibilities of the detected voltages when nucleotide type is A and position in the sequencing reads belongs to Z1 Besides homopolymer length, we also observed differences in signal distribution for homopolymers with different nucleotide composition (A, C, G, or T) and their positions in the sequencing reads. As shown in Fig. 3a, when fixing the homopolymer length (n = 4) and homopolymer position zone (Z1, position 1–75 in the sequencing reads), we observed slightly different signal distribution for the homopolymers with different nucleotide compositions. Specifically, the C homopolymers tend to have higher signal values with mean signal intensity at 3.74, as comparing to other three nucleotides with average value at 3.68. In addition, the standard deviation for the C homopolymers (stdev = 0.30) are also slightly larger than the other three types (stdev = 0.24). Similar inconsistency was also observed for homopolymers that locate at different positions zones in the sequencing reads (Fig. 3b). Using all the AAAA as an example, the average signals tend to be higher in the beginning of the reads, and decrease toward the end of the reads. The average signal for Z1 to Z4 is 3.68, 3.57, 3.54, and 3.57 respectively. All these results suggest that the derived voltage signal is dependent on the homopolymer nucleotide composition and its relative positions in the sequencing reads, and should be considered while inferring the length of the homopolymers.

Fig. 3

Other factors in Identification of homopolymer length. Other factors in identification of homopolymer length as (a) nucleotide type when homopolymer length is 4 and position in the sequencing reads belongs to Z1 and (b) position in the sequencing reads when homopolymer length is 4 and nucleotide type is A Motivated by these observations, we develop a Bayesian-based model in inferring the length of homopolymer based on the homopolymer length, their relative positions in the reads, and the detected voltage signal. Since the nucleotide composition includes A, C, G, T and the homopolymer positions are classified into four zones (Z1, Z2, Z3, Z4), in total, 16 Bayesian inference models are built based on the aforementioned prior signal distributions. In each model, the homopolymer length is identified if given a specific voltage level under a particular nucleotide type and position in sequencing read. In fact, after calculation of prior signal distributions of different kinds of homopolymer lengths, the length of homopolymer can be simply decided using naïve counting from the measured electrical voltage or the k-nearest neighbors algorithm. That is to identify the length of one homopolymer according to its nearest distance to the mean values of different prior signal distributions. In such way, the number of identification errors is 169,212, or 11.82 % of the whole 1, 430,986 homopolymers. Comparing to k-nearest neighbors algorithm, with our designed Bayesian inference models, the number of identification errors decrease to 71, 460, or 4.99 % of the whole homopolymers. However, our Bayesian inference result cannot outperform that from the Torrent Suite, where the number of identification errors is 29,623, or 2.07 % of the whole homopolymers. This is due to the fact that significant training has been included the Torrent Suite algorithm, which is proprietary, and uses a large amount of genomic features.

Identification of homopolymer length with Bayesian and reference genome information

Despite the superior performance of the Bayesian model comparing to naïve counting from the measured electrical voltage, both our model and output from the Torrent Suite, experience significant inconsistency based on our dataset with ground truth. Since genetic variants should only occur in a small percentage of genomic loci. We therefore hypothesize that using a combination of voltage signal with the guidance of the standard reference genome will significantly increase the detection accuracy. Using our proposed integrated model with Bayesian and reference genome information, we try to identify homopolymer length. In the integrated model, Eq 2, the weight parameter W was firstly optimized when the best identification result acquired (five cross validation) comparing to the results from the Illumina sequencing results. In Fig. 4a, the process of weight optimization is presented for the model under situation of nucleotide A and position Z1. When the weight is equal to 0, only reference genome information is referred in identification, while the weight equaling to 1 means only Bayesian inference information is used. Finally, the best weight value is equal to 0.28 when the least identification errors were found. The distribution of these errors is presented in Fig. 4b. Since the exact lengths of homopolymers were measured through Illumina platform, among 144,230 homopolymers under situation of nucleotide A and position Z1, lengths of 143,870 homopolymers were successfully identified by our proposed method with 360 errors. This is significant improvement comparing to using Bayesian model only. The performance also improved comparing to relying only on the reference genome, which enables to identify homopolymer-related variants from the sequencing data.

Fig. 4

Identification result of homopolymer lengths. Identification result of homopolymer lengths when nucleotide type is A and position in the sequencing read belongs to Z1. The result is presented as (a) frequency of identification errors and (b) distribution of identification result All the optimized weights and corresponding identification errors are listed in Table 1. In Table 1, comparing with other methods, the best identification result is obtained with our proposed approach, which is also presented in Fig. 5.

Table 1

Identification errors of homopolymer length with different methods

No	Nt	Pos	Count	Errors (%)
				KNN	Torrent suite	Bayesian	Reference	Proposed approach
				KNN	Torrent suite	Bayesian	Reference	Weight	Errors
1	A	1–75	144230	7.002	1.119	2.296	0.298	0.28	0.250
2	A	76–150	112776	12.121	1.651	4.722	0.489	0.34	0.453
3	A	151–225	97568	18.733	2.926	8.150	0.423	0.14	0.421
4	A	226–300	48033	22.292	4.655	10.259	0.535	0.24	0.510
5	C	1–75	88732	6.534	1.843	2.779	0.034	0.14	0.027
6	C	76–150	77650	10.382	2.489	4.595	0.556	0.36	0.121
7	C	151–225	63658	18.581	3.187	6.383	0.545	0.28	0.542
8	C	226–300	35736	17.910	4.600	6.159	0.926	0.30	0.923
9	G	1–75	97493	4.141	1.422	1.826	0.609	0.30	0.376
10	G	76–150	78192	14.874	1.623	3.864	0.322	0.32	0.152
11	G	151–225	64680	16.868	2.273	5.683	1.062	0.14	1.062
12	G	226–300	34116	18.754	2.492	7.985	0.147	0.12	0.147
13	T	1–75	156550	5.186	1.106	2.504	0.076	0.14	0.054
14	T	76–150	152034	11.446	1.571	5.780	0.342	0.30	0.297
15	T	151–225	111090	14.720	2.331	7.290	0.419	0.32	0.362
16	T	226–300	68448	13.912	3.315	8.240	0.723	0.28	0.599

“Count” means the number of each class of homopolymers. “KNN” means the method of K nearest neighbors. “Reference” means only reference information is used in the designed model(Weight = 0)

Fig. 5

Comparison of identification results among different identification methods. Comparison of identification results among different identification methods according to (a) all methods and (b) two methods of only using reference information and the proposed method

Identification errors of homopolymer length with different methods “Count” means the number of each class of homopolymers. “KNN” means the method of K nearest neighbors. “Reference” means only reference information is used in the designed model(Weight = 0) Comparison of identification results among different identification methods. Comparison of identification results among different identification methods according to (a) all methods and (b) two methods of only using reference information and the proposed method To show robustness of our proposed method, we also conducted analysis on one Ion Proton data(HapMap human dataset, NA12878) with the same pipeline and obtained the similar result (Additional file 1). Since more homopolymers retrieved in the Ion Proton data, their positions were classified into five zones depending on their distance from the beginning of the reads, Z1: 1–50 bp, Z2: 51–100 bp, Z3: 101–150 bp, Z4: 151–200 bp and Z5: 201–250 bp.

Conclusions

As an important category of sequencing platform, Ion semiconductor-based technology has been widely utilized due to its good performance of faster and cheaper sequencing. However, the technology is far from perfect and suffers from the problem of homopolymer uncertain length. With Bayesian inference and reference genome information, an integrated model was designed to resolve such a problem. Bayesian inference of homopolymer length was first calculated from detected voltage signals. Merged with reference genome sequences information, the homopolymer length was eventually deduced. Compared to several known algorithms, the proposed method presents a greatly improved performance. It should be noted that the proposed method is designed for refining the sequencing alignment based on individual sequencing read information. This is different from other approaches that rely on the coordinated information from all the reads that align to the same genomic region. Our strategy enables mapping the reads that contain variants in only a small percentage of DNA fragments, such as cancer genome. The general framework of our method can also be used for other sequencing technologies that contain significant amount of sequencing error around homopolymer regions, such as nanopore technology.

13 in total

1. Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors: Heng Li; Jue Ruan; Richard Durbin
Journal: Genome Res Date: 2008-08-19 Impact factor: 9.043

2. The development and impact of 454 sequencing.

Authors: Jonathan M Rothberg; John H Leamon
Journal: Nat Biotechnol Date: 2008-10 Impact factor: 54.908

3. The complete genome of an individual by massively parallel DNA sequencing.

Authors: David A Wheeler; Maithreyan Srinivasan; Michael Egholm; Yufeng Shen; Lei Chen; Amy McGuire; Wen He; Yi-Ju Chen; Vinod Makhijani; G Thomas Roth; Xavier Gomes; Karrie Tartaro; Faheem Niazi; Cynthia L Turcotte; Gerard P Irzyk; James R Lupski; Craig Chinault; Xing-zhi Song; Yue Liu; Ye Yuan; Lynne Nazareth; Xiang Qin; Donna M Muzny; Marcel Margulies; George M Weinstock; Richard A Gibbs; Jonathan M Rothberg
Journal: Nature Date: 2008-04-17 Impact factor: 49.962

Review 4. Progress in ion torrent semiconductor chip based sequencing.

Authors: Barry Merriman; Jonathan M Rothberg
Journal: Electrophoresis Date: 2012-12 Impact factor: 3.535

5. FAAST: Flow-space Assisted Alignment Search Tool.

Authors: Fredrik Lysholm; Björn Andersson; Bengt Persson
Journal: BMC Bioinformatics Date: 2011-07-19 Impact factor: 3.169

6. An analysis of the feasibility of short read sequencing.

Authors: Nava Whiteford; Niall Haslam; Gerald Weber; Adam Prügel-Bennett; Jonathan W Essex; Peter L Roach; Mark Bradley; Cameron Neylon
Journal: Nucleic Acids Res Date: 2005-11-07 Impact factor: 16.971

7. Comparison of sequencing platforms for single nucleotide variant calls in a human sample.

Authors: Aakrosh Ratan; Webb Miller; Joseph Guillory; Jeremy Stinson; Somasekar Seshagiri; Stephan C Schuster
Journal: PLoS One Date: 2013-02-06 Impact factor: 3.240

8. PyroHMMsnp: an SNP caller for Ion Torrent and 454 sequencing data.

Authors: Feng Zeng; Rui Jiang; Ting Chen
Journal: Nucleic Acids Res Date: 2013-05-21 Impact factor: 16.971

9. Using quality scores and longer reads improves accuracy of Solexa read mapping.

Authors: Andrew D Smith; Zhenyu Xuan; Michael Q Zhang
Journal: BMC Bioinformatics Date: 2008-02-28 Impact factor: 3.169

10. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.

Authors: Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal: Nucleic Acids Res Date: 2008-07-26 Impact factor: 16.971

9 in total

1. Discriminating a common somatic ASXL1 mutation (c.1934dup; p.G646Wfs*12) from artifact in myeloid malignancies using NGS.

Authors: Michael O Alberti; Sridhar Nonavinkere Srivatsan; Jin Shao; Samantha N McNulty; Gue Su Chang; Christopher A Miller; Jennifer B Dunlap; Fei Yang; Richard D Press; Qingsong Gao; Li Ding; Jonathan W Heusel; Eric J Duncavage; Matthew J Walter
Journal: Leukemia Date: 2018-06-29 Impact factor: 11.528

Review 2. Sequencing the CYP2D6 gene: from variant allele discovery to clinical pharmacogenetic testing.

Authors: Yao Yang; Mariana R Botton; Erick R Scott; Stuart A Scott
Journal: Pharmacogenomics Date: 2017-05-04 Impact factor: 2.533

3. Variable-order sequence modeling improves bacterial strain discrimination for Ion Torrent DNA reads.

Authors: Thomas M Poulsen; Martin Frith
Journal: BMC Bioinformatics Date: 2017-06-12 Impact factor: 3.169

4. Intelligent biology and medicine in 2015: advancing interdisciplinary education, collaboration, and data science.

Authors: Kun Huang; Yunlong Liu; Yufei Huang; Lang Li; Lee Cooper; Jianhua Ruan; Zhongming Zhao
Journal: BMC Genomics Date: 2016-08-22 Impact factor: 3.969

5. NanoAmpli-Seq: a workflow for amplicon sequencing for mixed microbial communities on the nanopore sequencing platform.

Authors: Szymon T Calus; Umer Z Ijaz; Ameet J Pinto
Journal: Gigascience Date: 2018-12-01 Impact factor: 6.524

6. Nanopore sequencing: An enrichment-free alternative to mitochondrial DNA sequencing.

Authors: Roxanne R Zascavage; Kelcie Thorson; John V Planz
Journal: Electrophoresis Date: 2018-12-13 Impact factor: 3.535

Review 7. Update on Molecular Diagnosis in Extranodal NK/T-Cell Lymphoma and Its Role in the Era of Personalized Medicine.

Authors: Ka-Hei Murphy Sun; Yin-Ting Heylie Wong; Ka-Man Carmen Cheung; Carmen Michelle Yuen; Yun-Tat Ted Chan; Wing-Yan Jennifer Lai; Chun David Chao; Wing-Sum Katie Fan; Yuen-Kiu Karen Chow; Man-Fai Law; Ho-Chi Tommy Tam
Journal: Diagnostics (Basel) Date: 2022-02-05

8. Mitochondrial Sequencing of Missing Persons DNA Casework by Implementing Thermo Fisher's Precision ID mtDNA Whole Genome Assay.

Authors: Daniela Cuenca; Jessica Battaglia; Michelle Halsing; Sandra Sheehan
Journal: Genes (Basel) Date: 2020-11-04 Impact factor: 4.096

9. Tatajuba: exploring the distribution of homopolymer tracts.

Authors: Leonardo de Oliveira Martins; Samuel Bloomfield; Emily Stoakes; Andrew J Grant; Andrew J Page; Alison E Mather
Journal: NAR Genom Bioinform Date: 2022-02-02

9 in total