| Literature DB >> 28589864 |
Inna Rytsareva1, David S Campo2, Yueli Zheng1, Seth Sims1, Sharma V Thankachan3,4, Cansu Tetik3, Jain Chirag3, Sriram P Chockalingam5, Amanda Sue1, Srinivas Aluru3,5, Yury Khudyakov1.
Abstract
BACKGROUND: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples.Entities:
Mesh:
Year: 2017 PMID: 28589864 PMCID: PMC5461558 DOI: 10.1186/s12864-017-3732-4
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1Transmission detection overview. In this example, there are 3 samples: Pi contains 3 different sequences, Pj contains 4 and Pk contains 3. In addition, Pi and Pj are related, whereas Pk is unrelated to the other two. A total of 33 pairwise sequence comparisons must be performed to find the minimal distance between each pair of samples. The rationale of our approach is to quickly remove the sample-pair comparisons with zero probability of having a minimal distance lower than T
Fig. 2Transmission network density This is an example of a real HCV transmission network obtained during an outbreak study. A link is drawn if the minimal edit distance between the two samples is smaller than T, whereas the size of the node is proportional to its genetic heterogeneity. In this particular example, only 0.8% of all sample-pairs are linked by transmission
Fig. 3Overview of the filtering strategy
Fig. 4Hamming radius filter. If Sd is higher than LT these two samples cannot have any sequence-pair with a distance lower than T
Filtering results on the unrelated dataset
| Filter | Individually | Serial workflow |
|---|---|---|
| k-mer bloom filter | 52536 (65.5%) | 52536 (65.5%) |
| Hamming radius filter | 67940 (84.7%) | 68242 (85.1%) |
| Identical sequences filter | 0 (0.0%) | 68242 (85.1%) |
Number of candidate pairs removed by each filtering approach
Fig. 5Percentage of removed sample-pairs by the k-mer bloom filter
Filtering results on the related dataset
| Filter | Individually | Serial workflow |
|---|---|---|
| k-mer bloom filter | 0 (0.0%) | 0 (0.0%) |
| Hamming radius filter | 0 (0.0%) | 0 (0.0%) |
| Identical sequences filter | 79 (51.6%) | 79 (51.6%) |
Number of candidate pairs removed by each filtering approach