| Literature DB >> 27806686 |
Kiatichai Faksri1,2, Eryu Xia3, Jun Hao Tan4, Yik-Ying Teo3,4,5,6,7, Rick Twee-Hee Ong8.
Abstract
BACKGROUND: Whole-genome sequencing is increasingly used in clinical diagnosis of tuberculosis and study of Mycobacterium tuberculosis complex (MTC). MTC consists of several genetically homogenous mycobacteria species which can cause tuberculosis in humans and animals. Regions of difference (RDs) are commonly regarded as gold standard genetic markers for MTC classification.Entities:
Keywords: Mycobacterium tuberculosis complex; Region of difference analysis; Whole-genome sequence analysis
Mesh:
Substances:
Year: 2016 PMID: 27806686 PMCID: PMC5093977 DOI: 10.1186/s12864-016-3213-1
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1The work flow and output of RD-Analyzer. a A schematic representation of the processes in RD-Analyzer. RD-Analyzer accepts sequence reads in FASTQ format. The input sequence reads are mapped to reference RD sequences, after which read depths along the reference sequences would be calculated. Normally, an RD is identified as ‘present’ in the MTC isolate if the median ratio of read depth along the reference sequence is above a specified threshold (default of 0.09 for all RDs, except 2.97 for RD12can). The default setting of using the median ratio (ratio at the 50th percentile) can be changed to other percentiles subject to user’s preference. Small RDs are detected from CIGAR strings of mapped reads spanning the potential deletion region. The default RD-Analyzer uses the LUSs of 31 RDs and performs lineage prediction using rules elaborated in Additional file 2: Table S2. The extended version of RD-Analyzer allows for user-defined reference RD sequences without strain prediction. b An example output file of default RD-Analyzer. c An example output file of extended RD-Analyzer
Fig. 2Threshold selection in RD-Analyzer. a Different ratios of read depths between present and absent RDs. The ratio refers to the ratio between the median read depth along the RD sequence and estimated genome read depths. The dotted lines indicate the optimal threshold of the read depth (0.09 for all RDs, except 2.97 for RD12can). The numbers above the boxes indicate the number of instances included in the box. b ROC curve for threshold selection for RDs except RD12can. The ROC curve shows very high TPR and very low FPR at nearly all thresholds with the area under the curve being 0.9907. The dotted diagonal line is the line of no discrimination. The default threshold was selected to be 0.09, which produced a TPR of 0.9949 and an FPR of 0.9856. c ROC curve for threshold selection for RD12can. The ROC curve has an AUC of 1. The default threshold was selected to be 2.97, producing a TPR of 1.0000 and an FPR of 1.0000
Performance of RD-Analyzer for predicting RD and differentiating Mtb lineages and MTC species
Note A absence, P presence, com complete (no deletion), 7D 7 bp deletion, 6D 6 bp deletion; and NA not available. Concordance (Validation, %): concordance in predicting respective RDs in the validation dataset, in the unit of %. Concordance (Lineage): concordance of lineage/sublineage prediction for each lineage. Bold letters with grey shades refer to key makers for species or lineages identification. Italic characters denote where unexpected absence and presence was discovered, with the superscript denotes the number of strains with unexpected predictions
Fig. 3Detection of potential RDs for sublineage classification in Lineage 4 Mtb isolates. In the detection of potential RDs for a certain sublineage, isolates belong to this sublineage constitute the experiment group while other isolates constitute the control group. For each sublineage, the p-values reflecting the difference in the read depth between the experiment group and the control groups were calculated for each position and translated into –log10 (p-value) to be plotted on the y-axis of the plot, where the x-axis is grouped by the studied sublineage and the values indicate the genomic positions along the reference genome. Extremely low p-values are indicative of significant difference in read depth between the two groups. Regions with consecutive positions having –log10 (p-value) larger than 60 were regarded as candidate RD markers. Those sublineages with well-defined RD makers are shaded gray in the background
Identification of potential RDs for lineage classification in Mtb lineages with existing RD markers
| Lineage | Sample size | Existing RD markers | RD detected | |||||
|---|---|---|---|---|---|---|---|---|
| Name | Start | End | No. | Start | End | Length (bp) | ||
| Lineage 4.8 | 15 |
|
|
|
|
|
|
|
| Lineage 4.1.2.1 | 12 |
|
|
|
|
|
|
|
| 2 | 2,361,910 | 2,363,682 | 1,773 | |||||
| 3 | 3,194,709 | 3,194,793 | 85 | |||||
| 4 | 4,375,626 | 4,375,708 | 83 | |||||
| Lineage 4.3.3 | 9 |
|
|
|
|
|
|
|
| 2 | 171,458 | 171,778 | 321 | |||||
| 3 | 2,306,444 | 2,306,724 | 281 | |||||
| Lineage 4.1.1.3 | 9 |
|
|
|
|
|
|
|
| 2 | 2,339,260 | 2,339,402 | 143 | |||||
| Lineage 4.1.1.1 | 7 |
|
|
|
|
|
|
|
| 2 | 4,370,424 | 4,373,233 | 2,810 | |||||
| 3 | 2,866,744 | 2,866,852 | 109 | |||||
| Lineage 4.5 | 6 |
|
|
|
|
|
|
|
Note Bold letters emphasize the concordance between exiting RD markers and RDs detected. The start and end positions correspond to the genomic positions of Mtb H37Rv genome
Identification of potential RDs for lineage classification in Mtb lineages without existing RD markers
| Lineage | Sample size | RD detected | Performance | ||||
|---|---|---|---|---|---|---|---|
| No. | Start | End | Length (bp) | Sensitivity | Specificity | ||
| Lineage 4.3.4.2.1 | 15 |
|
|
|
|
|
|
| Lineage 4.6.1.2 | 11 |
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| 3 | 1,190,141 | 1,190,733 | 593 | ||||
| 4 | 2,634,174 | 2,634,542 | 369 | ||||
| 5 | 3,594,343 | 3,594,407 | 65 | ||||
| Lineage 4.6.2.2 | 4 |
|
|
|
|
|
|
| 2 | 3,785,220 | 3,785,638 | 419 | ||||
| 3 | 3,905,337 | 3,905,721 | 385 | ||||
| 4 | 3,742,614 | 3,742,895 | 282 | ||||
| Lineage 4.4.1.1 | 4 |
|
|
|
|
|
|
|
|
|
|
|
|
| ||
| 3 | 3,112,311 | 3,112,459 | 149 | ||||
| 4 | 1,313,135 | 1,313,278 | 144 | ||||
| 5 | 3,377,623 | 3,377,670 | 48 | ||||
Note Bold letters refer to the potential RD markers whose sensitivity and specificity for classification have been assessed. The start and end positions correspond to the genomic positions of Mtb H37Rv genome