Literature DB >> 32485813

Positional Correlation Natural Vector: A Novel Method for Genome Comparison.

Lily He¹, Rui Dong¹, Rong Lucy He², Stephen S-T Yau¹.

Abstract

Advances in sequencing technology have made large amounts of biological data available. Evolutionary analysis of data such as DNA sequences is highly important in biological studies. As alignment methods are ineffective for analyzing large-scale data due to their inherently high costs, alignment-free methods have recently attracted attention in the field of bioinformatics. In this paper, we introduce a new positional correlation natural vector (PCNV) method that involves converting a DNA sequence into an 18-dimensional numerical feature vector. Using frequency and position correlation to represent the nucleotide distribution, it is possible to obtain a PCNV for a DNA sequence. This new numerical vector design uses six suitable features to characterize the correlation among nucleotide positions in sequences. PCNV is also very easy to compute and can be used for rapid genome comparison. To test our novel method, we performed phylogenetic analysis with several viral and bacterial genome datasets with PCNV. For comparison, an alignment-based method, Bayesian inference, and two alignment-free methods, feature frequency profile and natural vector, were performed using the same datasets. We found that the PCNV technique is fast and accurate when used for phylogenetic analysis and classification of viruses and bacteria.

Entities: Chemical Disease Gene Species

Keywords: alignment-free; genome comparison; phylogenetic analysis; positional correlation natural vector

Mesh：

Year: 2020 PMID： 32485813 PMCID： PMC7312176 DOI： 10.3390/ijms21113859

Source DB: PubMed Journal: Int J Mol Sci ISSN： 1422-0067 Impact factor: 5.923

1. Introduction

Predicting the structures, functions, and evolutionary relationships of genes is a fundamental and vital aspect of modern biological research. Therefore, the comparison of genetic sequences is a pivotal step in many protocols and numerous approaches have been employed for this task. Most researchers use conventional alignment-based techniques for sequence comparison; these techniques involve sequence alignment based on selected scoring systems. The algorithms used are generally precise and highlight correlations among sequences. Several sequence alignment methods have been implemented via software packages, such as MrBayes [1]. However, alignment-based methods have disadvantages: they are slow and require a large amount of memory. Furthermore, based on previous studies, multiple sequence alignment (MSA)-based methods cannot be extended with using the huge datasets currently available [2]. Therefore, alignment-free (AF) methods may be used to overcome these problems [3]. Additionally, AF sequence comparison is drawing great interest driven by data-rich applications [4]. A notable common feature of AF approaches is the analysis of special numerical properties of the sequences being compared. High computational efficiency is observed when such techniques are applied to gene and protein data. A series of AF methods for sequence comparison has been developed. AF approaches include iterated-function systems [5], information theory [6], Fourier transformations [7], sequence representations based on chaos theory [8], and moments of the positions of the nucleotides [9,10]. The most widely used AF method is the k-mer-based method and has been published in many excellent journals [11,12,13,14,15,16,17,18,19]. This method involves the analysis of the frequency of strings of specific length k within sequences [20]. Several k-mer-based methods have been developed and applied for the phylogenetic analysis of bacteria and viruses. A notable example is feature frequency profiles (FFP) [21]. Although k-mer-based methods have been applied widely, they do not include positional correlations of nucleotides. However, it is significant to investigate the location for gene sequence comparison. Therefore, positional correlation is important for computational and analytical approaches. Recently, two methods based on moments of the positions of the nucleotides, namely natural vector (NV) [9] and multiple encoding vector (MEV) method [10], were proposed. They were successfully used for the classification and phylogenetic analysis of sequences. The NV method uses frequency, average site, and variance of site to compare sequences. Based on NV, the MEV method can add information about the chemical and physical properties of a nucleotide. The distribution of four bases is considered independently in these two methods. However, it has been reported that the four bases are correlated; in fact, the correlation of nucleotides is based on the widely applied hidden Markov model (HMM) [22]. In the present study, we propose a novel 18-dimensional numerical feature vector method to characterize DNA sequences. The method is named positional correlation natural vector (PCNV) to characterize DNA sequences. Our vector contains the frequency, average, and variance the locations of four bases. Furthermore, we added the position correlation of each pair of the four bases as important features. We tested the PCNV method using several datasets and compared it with the alignment-based Bayesian inference approach, which can be applied using MrBayes software [1], as well as two AF methods, FFP and NV.

2. Results

To demonstrate that PCNV is effective, we applied it to different datasets: the genomes of hepatitis C virus (HCV), hepatitis B virus (HBV), human papillomavirus (HPV), dengue virus (DENV), and 59 bacterial species. The length of the sequences studied ranged from thousands to millions of base pairs. For each dataset, the PCNVs of the sequences were computed using MATLAB R2016a and phylogenetic trees were reconstructed using MEGA 7. Finally, we evaluated the performance of our methods based on sensitivity, specificity, and accuracy. Computations were performed on a PC with Intel Core i7-6560U CPU @ 2.20 GHz and 8 GB RAM.

2.1. Phylogeny of HCV

Using our PCNV method, 82 HCVs are correctly clustered into six clades, as shown in Figure 1a [23]. Using the FFP method, the value of k for analysis of these viruses is 6. As shown in Figure 1b, some Genotype 6 and 1 HCVs are clustered together incorrectly in the FFP phylogenetic tree. Furthermore, a sequence in Genotype 4 is assigned to Genotype 6. PCNV produces better results for this dataset than FFP. The Bayesian inference method was also utilized for the evolutionary analysis of this dataset. Figure 1c shows that this method divides Genotype 3, shown in violet, into two groups.

Figure 1

(a) The Neighbor-Joining phylogenetic tree of 82 HCV genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 82 HCV genome sequences based on FFP method (k = 6). (c) The phylogenetic tree of 82 HCV genome sequences based on Bayesian inference method.

2.2. Phylogeny of HBV

Using PCNV, 152 HBVs are correctly divided into eight lineages, as shown in Figure 2a. The phylogenetic tree created using the FFP method is shown in Figure 2b. According to the HBV database, “AJ627224” belongs to Genotype D. However, according to the FFP method, it is related to Genotype B. The FFP method also cluster “FJ356715” and “FJ356716”, belonging to Genotype H, to Genotypes F and G, respectively. Therefore, for this dataset, our PCNV method is superior to FFP. The phylogenetic tree created using the traditional NV method is shown in Figure 2c. However, in this tree, three Genotype C viruses are classified into other groups. The phylogenetic tree created using Bayesian inference is shown in Figure 2d. It shows that “AB371164” belongs to Genotype H, as separated from Genotype H. This is an indication that the positional correlation between nucleotides can improve the accuracy of classification.

Figure 2

(a) The Neighbor-Joining phylogenetic tree of 152 HBV genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 152 HBV genome sequences based on FFP method (k = 5). (c) The Neighbor-Joining phylogenetic tree of 152 HBV genome sequences based on NV method. (d) The phylogenetic tree of 152 HBV genome sequences based on Bayesian inference method.

2.3. Phylogeny of DENV

As shown in Figure 3a, the phylogenetic tree constructed using PCNV classifies all viruses into the correct categories. However, as shown in Figure 3b, the NV method divides Genotype 1 into two clusters. Therefore, once again, it is clear that positional correlation between nucleotides can effectively improve the NV method.

Figure 3

(a) The Neighbor-Joining phylogenetic tree of 330 dengue viruses genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 330 dengue viruses genome sequences based on NV method.

2.4. Phylogeny of HPV

We found that PCNV categorizes the dataset into the correct biological groups in 0.78 s (Figure 4a; Table 1); this is much faster than the FFP method, which takes 35 s (Table 1). The Bayesian inference method divide Genotype 11 into two parts, as highlighted in cyan in Figure 4b.

Figure 4

(a) The Neighbor-Joining phylogenetic tree of 326 HPV genome sequences based on PCNV method. (b) The phylogenetic tree of 326 HPV genome sequences based on Bayesian inference method.

Table 1

Running time for PCNV, Bayesian inference, FFP, AFKS, and Muscle methods. “∼”, unable to compute on laptop.

Method	HCV	HBV	Dengue	HPV	Bacteria
	(82)	(152)	(330)	(326)	(59)
PCNV	0.33s	0.27s	0.66s	0.78s	53.71s
Bayesian	1097s	263s	217,353s	217,512s	∼
inference
FFP	11.11s	0.38s	49.40s	35.00s	larger than
	(k = 6)	(k = 5)	(k = 6)	(k = 6)	1 day (k = 11)
AFKS	70.21s	29.62s	429.87s	413.79s	larger than
	(k = 5)	(k = 4)	(k = 5)	(k = 5)	4 day (k = 9)
Muscle	753s	155s	3740s	4002s	∼

2.5. Phylogeny of Bacteria

The dataset consisted of 14 families, as shown in Figure 5a,b, of bacterial species with long genomes that ranged from 0.8 to 5 million bp. Using the PCNV method, the phylogenetic tree of these organisms was reconstructed. As shown in Figure 5a, the 59 bacterial species are divided into 14 families that are separated from each other. The 11-mer FFP method mixed these families (Figure 5b). Additionally, the run time for FFP is more than a day, which is far longer than the time required for PCNV. Bayesian inference takes even longer, to the extent that it is not possible to complete the analysis using this method in Muscle on a server equipped with an Intel Xeon E5-2667 v3 Processor and Linux Home Premium with 384 GB RAM (Table 1).

Figure 5

(a) The Neighbor-Joining phylogenetic tree of 59 bacteria genome sequences based on PCNV method. (b) The Neighbor-Joining phylogenetic tree of 59 bacteria genome sequences based on FFP method (k = 11).

2.6. Classification

Besides evolutionary analysis, PCNV can also be used for classification. Both FFP and Alignment-Free-Kmer-Statistics (AFKS) [16], based on the k-mer approach, can also be used for classification. However, the question of how to choose the value of k is not easily answered. In the present paper, for the FFP method, we set the k value as the minimum integer of , i.e., , where N is the maximum length of the sequences studied [21]. For the AFKS method, we used , where n is the number of sequences in the set S [16]. In the PCNV method, after computing the distance matrix using each approach, the one-nearest neighbor (1-NN) [24] method was used for predictions. The sensitivity, specificity, and accuracy of the predictions made using each method are shown in Table 2. It is clear that PCNV is superior to the other two algorithms in this study.

Table 2

Sensitivity (Sens), Specificity (Spec), and Accuracy (Acc) measures of classification are reported for the four virus datasets. For each dataset, the Ave. line displays average values for each measure.

	Nu-	Sens	Sens	Sens	Spec	Spec	Spec	Acc	Acc	Acc
	Type	mber	PCNV	FFP	AFKS	PCNV	FFP	AFKS	PCNV	FFP	AFKS
			(%)	(%)	(%)	(%)	(%)	(%)	(%)	(%)	(%)
HCV(82)	type1	16	100	62.5	50.0	100	87.9	86.4	100	62.5	50.0
	type2	18	100	55.6	94.4	100	93.8	98.4	100	55.6	94.4
	type3	20	100	80.0	90.0	100	93.5	96.8	100	80.0	90.0
	type4	12	100	50.0	33.3	100	97.1	90.0	100	50.0	33.3
	type5	4	100	50.0	75.0	100	96.2	97.4	100	50.0	75.0
	type6	12	100	50.0	83.3	100	91.4	98.6	100	50.0	83.3
	Ave.		100	58.0	71.0	100	93.3	94.6	100	58.0	71.0
HBV(152)	A	20	100	100	100	100	100	99.2	100	100	100
	B	15	100	100	40.0	100	100	96.4	100	100	40.0
	C	20	100	100	70.0	100	100	96.2	100	100	70.0
	D	13	100	100	76.9	100	100	97.1	100	100	76.9
	E	30	100	100	90.0	100	100	97.5	100	100	90.0
	F	22	100	100	72.7	100	100	93.8	100	100	72.7
	G	17	100	100	94.1	100	100	99.3	100	100	94.1
	H	15	100	100	80.0	100	100	97.1	100	100	80.0
	Ave.		100	100	78.0	100	100	97.1	100	100	78.0
Dengue(330)	type1	72	100	100	76.4	100	100	93.4	100	100	76.4
	type2	75	100	100	73.3	100	100	93.3	100	100	73.3
	type3	83	100	100	78.3	100	100	92.7	100	100	78.3
	type4	100	100	100	87.0	100	100	93.0	100	100	87.0
	Ave.		100	100	78.8	100	100	93.1	100	100	78.8
HPV(326)	6	24	100	100	75.0	100	100	97.7	100	100	75.0
	11	17	100	100	100	100	100	99.7	100	100	100
	16	99	100	100	92.9	100	100	96.5	100	100	92.9
	18	19	100	100	94.7	100	100	100	100	100	94.7
	31	23	100	100	82.6	100	100	99.0	100	100	82.6
	33	22	100	100	86.4	100	100	99.7	100	100	86.4
	35	26	100	100	88.5	100	100	99.0	100	100	88.5
	45	12	100	100	83.3	100	100	99.7	100	100	83.3
	52	22	100	100	81.8	100	100	98.4	100	100	81.8
	53	14	100	100	85.7	100	100	98.7	100	100	85.7
	58	37	100	100	94.6	100	100	99.7	100	100	94.6
	66	11	100	100	90.9	100	100	99.7	100	100	90.9
	Ave.		100	100	88.0	100	100	99.0	100	100	88.0

3. Discussion

In the present paper, we propose a novel 18-dimensional vector method for genome comparison. This PCNV method can be used to successfully define the distribution of nucleotides based on information on the frequency and position of DNA sequences. The correlation of position between two different bases is used in addition to the average position and variance of position of each base. As a result, a high-dimensional DNA genome sequence is reduced to an 18-dimensional numerical vector. Correlations in base distribution play a key role in sequence comparison. Usually, conventional alignment-based methods produce reasonable phylogenetic trees and are therefore widely applied. However, when the dataset volume is large or the sequences analyzed are very long, these methods become ineffective. The phylogenetic analysis results on several distinct datasets show that PCNV can quickly and accurately compare massive datasets of long DNA sequences. We also compared our method with three methods: the popular alignment-based Bayesian inference method, the alignment-free FFP, and Natural Vector methods. The results show that our method can construct more accurate evolutionary relationships among sequences. To demonstrate the computational advantage of PCNV, we compared our running time constructing phylogenetic trees with FFP, AFKS, Bayesian inference, and Muscle. Compared with the two extensively applied alignment-free methods FFP and AFKS, the running time of PCNV is smallest for all datasets and even takes less than 1 s, except on the bacteria dataset. For bacteria dataset, PCNV is extremely fast and only takes about 53 s, while FFP and AFKS take more than one day. Compared with the two alignment-based methods Bayesian inference and Muscle, PCNV takes much less time for all datasets. For bacteria dataset, Bayesian inference and Muscle cannot obtain phylogeny tree within several days. With long DNA sequences, particularly bacterial genomes, the Bayesian inference method was much slower than the PCNV approach, sometimes to the extent that it simply did not work, the main reason being that Bayesian inference method is based on alignment method, such as Muscle (Table 1). Even alignment-free methods such as FFP and AFKS were slow in comparison with PCNV, especially when the analyzed sequences were longer, as shown in Table 1. Furthermore, the MEV was used for comparison. Although both PCNV and MEDV studies aim to solve the problem of genetic sequence comparison, they have significant differences in the features extracted from the sequences. MEV method did not consider the position correlation feature which is an important source of information for genetic sequences. The novelty of our new method is that it designs six suitable features to characterize the correlation among nucleotide position in sequences. The second difference is that our new method does not categorize four types of nucleotides into three groups according to their chemical properties. The third difference is that our new method applies the popular neighbor-joining algorithm to construct phylogenetic trees, while the old study used the UPGMA algorithm which may produce misleading trees. To show the advantages of our new method (PCNV), MEV and PCNV were compared using all datasets studied used in for the present study. The Neighbor-Joining (NJ) algorithm is used in tree construction of both methods. The NJ trees built by the MEV method are shown in Figures S1–S5. As shown in Figure S1, there are six types of HCV dataset. Using the MEV method, the type 6 marked in pink is divided into two parts. In the NJ tree of the HBV dataset shown in Figure S2, the virus KX765843 belonging to clade C (marked in black) is incorrectly assigned to clade F (marked in red). Similarly, as shown in Figure S3, types 1 (marked in blue) and 3 (marked in navy) of dengue are mixed together. In Figure S4, two viruses from type 35 group (marked in gray) are categorized into other groups. Horizontal (or lateral) gene transfer (HGT) is a common phenomenon in bacteria. Due to the problem of HGT, Koski and Golding [25] even stated that genes appearing to be the most similar based on BLAST hits are often not the closest relatives each other phylogenetically. It means when there are HGT, if alignment is used, there may be mistakes. For example, given two distantly related bacteria that have exchanged a gene, a phylogenetic tree including those species will show them to be closely related because that gene is the same, even though most other genes are dissimilar. For bacteria, due to the extensively existing HGT, the phylogenetic tree based on alignment may be misleading. To get correct phylogenetic trees of bacteria, one main method is the 16s rRNA-based method, which constructs trees according to the alignment of 16s ribosomal RNA, i.e., 16s rRNA. The 16s rRNA gene tends to be conserved among bacteria with close phylogenetic distances, and thus is often not affected by HGT, but has enough variable differences. However, the method of 16sRNA loses some information in the whole genome. Our PCNV method uses the whole genome sequence and needs no sequence alignment, thus has the potential to be not affected by the HGT and obtain the correct phylogenetic relationship among bacteria. To show that our method may be not affected by HGT, we used another dataset of eight Yersinia genomes in a previous study [4]. The eight Yersinia genomes are too similar in sequence for classical phylogenetic inference, but share gene segments. The dataset includes two Yersinia pseudotuberculosis and six Yersinia pestis complete genomes. Using PCNV, we get the neighbor-joining tree of the eight bacteria shown in Figure 6. For the figure, we see that the two Yersinia pseudotuberculosis isolates form sisters and are separate from the six Yersinia pestis genomes.

Figure 6

The Neighbor-Joining phylogenetic tree of eight Yersinia genomes based on PCNV method.

Genetic distance is a measure of the genetic divergence between species or between populations within a species, whether the distance measures time from common ancestor or degree of differentiation. Several genetic distances have been proposed based on different evolutionary models. The genetic distance can only be applied to alignment results. A commonly used measure of genetic distance is the fixation index, which varies between 0 and 1. Our PCNA is an alignment-free approach; we measure the distance between species using Euclidean distance and we do not assume any evolutionary model. This distance is positively correlated with their genetic distance, since it can successfully measure the divergence between two species as well. In the neighbor-joining trees constructed by PCNV, sum of the length of the branches traversed from one species to another is equal to their distance and thus positively correlated with their genetic distance. Due to difference in mutation rates for species, for two given datasets, the average internal genetic distance in each dataset may be different enough. In this case, the average distance obtained with PCNV for two datasets may have a big difference, which leads to very different scales in the two derived phylogenetic trees. For the same dataset, different alignment-free method may produce different average distance and thus produce different scales of phylogenetic trees. Our new method has several limitations, for example we cannot work out the time of evolution. These limitations will need to be studied in the future.

4. Materials and Methods

4.1. Dataset

Five datasets were used to test and verify the new technique.

4.1.1. HCV

Hepatitis C is a liver infection caused by HCV. The World Health Organization (WHO) estimates that HCV infects 3% of the world’s population [26]. Because this virus causes few symptoms, diagnosis is difficult in many cases [27]. In the present study, we acquired 82 complete HCV genomes from the Virus Pathogen Database and Analysis Resource (ViPR) [23]. This dataset has a genomic length of 8957–9666 nucleotides. The NCBI accession numbers are shown in Table S1.

4.1.2. HBV

HBV is a hepatotropic virus that can establish a persistent and chronic infection in humans through immune anergy. It exhibits formidable morbidity and mortality in humans and currently infects 3.5% of the global population [28,29]. HBV is genetically diverse and comprises 10 different genotypes, designated A–J [29,30]. Additional subgenotypes exist within Genotypes A–D and F [31]. The HBV genotypes differ in their geographic distributions. Identifying HBV genotypes quickly and accurately is very important for clinical diagnosis. In the present work, 152 complete HBV genomes including eight genotypes (A–H), were downloaded from the Hepatitis B Virus Database (HBVdb) [30]. The NCBI accession numbers are shown in Table S2.

4.1.3. DENV

DENVs are mosquito-borne aviviruses that have plagued humans for centuries [32]. Statistics show that DENVs infect up to 390 million people worldwide each year; 25% of these people suffer from clinical disease. With four antigenically distinct but immunologically cross-reactive serotypes (DENV-1–4), dengue has one of the most complex transmission processes of all infectious diseases in human populations [33]. Therefore, it is important to distinguish which group particular DENVs belong to. In the present study, 330 dengue viruses were used to demonstrate the effectiveness of the method. The NCBI accession numbers are shown in Table S3.

4.1.4. HPV

HPV is a circular double-stranded DNA virus that causes a variety of proliferative epithelial lesions at specific anatomical sites. It is also the most common sexually transmitted virus. There are many different types of HPV, several of which cause health problems such as genital warts and cancer [34]. For example, the low-risk HPV Types 6 and 11 can cause genital warts or benign cell changes, while the high-risk HPV Types 16 and 18 cause about 70% of cervical cancers [35]. Therefore, identification of HPV genotypes in infected patients is particularly important. In the present work, 326 complete genomes of 12 HPV strands (6, 11, 16, 18, 31,33, 35, 45, 52, 53, 58, and 66) were studied. All viral genomes in this HPV dataset are publicly available at GenBank or NCBI databases. The NCBI accession numbers are shown in Table S4.

4.1.5. Bacteria

There is a biomass of bacteria, the main representatives of the prokaryotes, on Earth. Researchers usually investigate evolutionary relationships among bacteria by building phylogenetic trees. Owing to the large genome size (>1 million bp) of bacteria, it is impossible to reconstruct a bacterial phylogenetic tree in a reasonable amount of time using traditional multiple sequence alignment methods with current computational technology. We used 59 bacterial species to test our method (Table 3). The NCBI accession numbers are shown in Table S5.

Table 3

Summary of the datasets HCV, HBV, Dengue, HPV, and Bacteria. The length distribution of each dataset validates that PCNV can work with long sequences.

Dataset	Number	Min	Median	Mean	Max
		(bp)	(bp)	(bp)	(bp)
HCV	82	8957	9442	9427	9666
HBV	152	10161	10669	10606	10780
Dengue	330	10,161	10,669	10,606	10,780
HPV	326	7814	7,905	7895	8051
Bacteria	59	846,214	4,016,947	3,610,938	5,966,919

4.2. Positional Distribution

Let be a DNA sequence of length N. Denote as the number of nucleotides in the sequence, where . Here, is the position of the nucleotide at the jth appearance, , and . Obviously, . For example, given the sequence “ACTGGCAAT”, . We first define the positional distribution of () as follows: The positional distribution of is Example: Given the sequence “ACTGGCAAT”, the positional distribution is as shown in Table 4. Here, we take as an example to show the details of the calculation process. According to the above definition:

Table 4

The positional distribution of “ACTGGCAAT”.

Sequence	A	C	T	G	G	C	A	A	T
Position	1	2	3	4	5	6	7	8	9
UA(i)	16	16	16	16	16	16	71	82	82
UC(i)	0	24	24	24	24	64	64	64	64
UT(i)	0	0	36	36	36	36	36	36	91
UG(i)	0	0	0	41	55	55	55	55	55

When , namely, , so . When , namely, , thus . When , namely, , . Similarly, we can get

4.3. Position Correlation Vector

4.3.1. Average Positional Distribution

The average positional distribution is defined as (): Therefore, For the example sequence “ACTGGCAAT” above, , and likewise for C, G, and T.

4.3.2. Positional Covariance

The positional covariance of nucleotide and ()) are defined as follows: where For the sequence “ACTGGCAAT”, the positional distributions of A and C are and , respectively. . . . The same method is used to compute cov (A;G); cov (A; T); cov (C;G); cov (C; T); and cov (G; T).

4.3.3. Positional variance

The positional variance is described as: Consequently, an 18-dimensional PCNV of a DNA sequence was constructed as follows: Then, we used the Euclidean distance to compute the pairwise distance among the 18-dimensional vectors of the genome sequences. A phylogenetic tree can be built using the NJ algorithm in MEGA 7.0 software [36]. Additionally, the PCNV method can be used to classify organisms. Naturally, sensitivity, specificity, and accuracy were used to evaluate classification performance. The definitions of these measures are as follows: and where TP, TN, FP, and FN are the number of true positive, true negative, false positive, and false negative predictions, respectively. The MATLAB source code in this paper is freely available to the public upon request.

34 in total

1. The closest BLAST hit is often not the nearest neighbor.

Authors: L B Koski; G B Golding
Journal: J Mol Evol Date: 2001-06 Impact factor: 2.395

Review 2. Evolution of dengue in Sri Lanka-changes in the virus, vector, and climate.

Authors: P D N N Sirisena; F Noordeen
Journal: Int J Infect Dis Date: 2013-12-11 Impact factor: 3.623

Review 3. Hepatitis B virus infection.

Authors: Man-Fung Yuen; Ding-Shinn Chen; Geoffrey M Dusheiko; Harry L A Janssen; Daryl T Y Lau; Stephen A Locarnini; Marion G Peters; Ching-Lung Lai
Journal: Nat Rev Dis Primers Date: 2018-06-07 Impact factor: 52.329