| Literature DB >> 23228053 |
René L Warren1, Gina Choe1, Douglas J Freeman1, Mauro Castellarin1, Sarah Munro1, Richard Moore1, Robert A Holt2.
Abstract
The human leukocyte antigen (HLA) is key to many aspects of human physiology and medicine. All current sequence-based HLA typing methodologies are targeted approaches requiring the amplification of specific HLA gene segments. Whole genome, exome and transcriptome shotgun sequencing can generate prodigious data but due to the complexity of HLA loci these data have not been immediately informative regarding HLA genotype. We describe HLAminer, a computational method for identifying HLA alleles directly from shotgun sequence datasets (http://www.bcgsc.ca/platform/bioinfo/software/hlaminer). This approach circumvents the additional time and cost of generating HLA-specific data and capitalizes on the increasing accessibility and affordability of massively parallel sequencing.Entities:
Year: 2012 PMID: 23228053 PMCID: PMC3580435 DOI: 10.1186/gm396
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Computational predictions of HLA-I from shotgun data by targeted assembly (left) or read alignment (right). For targeted assembly, NGS reads having their first fifteen 5' bases matching one of HLA CDS (RNA-Seq) or genomic (WGS/exon capture) sequences are recruited and assembled de novo with TASR. Resulting sequence contigs are aligned against a database sequence of all predicted HLA CDS (RNA-Seq) or genomic sequences (WGS/exon capture), tracking best HLA hit(s). Reciprocal best alignments are considered in the same manner. Putative allele assignments from shotgun datasets (HLAminer) are informed by contig length, depth of coverage and similarity to reference sequences, when applicable. The probability of each prediction being correct is estimated by determining the probability of that prediction being observed by chance.
Output from HLAminer HLA class I predictions from a CRC patient 100-nucleotide RNA-Seq sample
| Allelea | Scoreb | Expect value (Eval) | Confidence (-10 × log10(Eval)) |
|---|---|---|---|
| Predictiond 1 - A*02 | |||
| A*02:01P | 64038.03 | 1.63E-06 | 57.9 |
| Prediction 2 - A*11 | |||
| A*11:01P | 5463.99 | 5.30E-09 | 82.8 |
| Prediction 1 - B*27 | |||
| B*27:05P | 64579.61 | 2.67E-18 | 175.7 |
| Prediction 2 - B*07 | |||
| B*07:02P | 56662.08 | 6.63E-12 | 111.8 |
| Prediction 1 - C*07 | |||
| C*07:02P | 49419.33 | 5.23E-08 | 72.8 |
| Prediction 2 - C*02 | |||
| C*02:02Pe | 20466.00 | 6.64E-16 | 151.8 |
| C*02:21e | 20466.00 | 6.64E-16 | 151.8 |
aHLA protein coding alleles validated by PCR are shown in bold face. bThe protein coding allele predictions are arranged by decreasing score from most to less likely. cMost likely HLA class I allele groups and protein coding alleles (Confidence (-10 × log10(Eval)) ≥ 20 Score ≥ 1,000) for each gene. dThe prediction rank factors in the maximum score for each predicted allele. eAmbiguity arises when two or more HLA allele group or protein coding alleles have the same score (for example, C*02:02P and C*02:21).
Figure 2HLAminer performance. HLA allele group and protein coding allele predictions derived from targeted read assembly (black symbols) or direct read alignment (grey symbols) of simulated 100-nucleotide RNA-Seq, WGS and exon capture (ExCap) datasets were compared to original, spiked-in, HLA sequences and performance metrics evaluated (ambiguity, sensitivity and specificity represented by circle, triangle and square symbols, respectively). HLAminer predictions were also obtained from targeted assembly of colorectal cancer (CRC; blue symbols), lymphoma (DLBCL; red, orange and yellow symbols), 1000 Genomes (1KG; green symbols) and ovarian cancer (OV; violet and magenta symbols) patient tumor (T) and/or matched normal (N) shotgun datasets and compared to PCR-based HLA types to calculate performance metrics.
Effect of read length and base error on HLAminer predictions from targeted assembly of simulated RNA-Seq dataa
| HLA allele resolution | Base error (%) | Read length (nucleotides) | Sensitivity (mean ± SD%) | Specificity (mean ± SD%) | Ambiguity(mean ± SD%) |
|---|---|---|---|---|---|
| Two-digit | 1.0 | 50 | 13.62 ± 2.80 | 92.86 ± 10.10 | 19.06 ± 16.53 |
| 75 | 62.32 ± 3.62 | 90.27 ± 3.28 | 8.93 ± 2.67 | ||
| 100 | 95.72 ± 0.53 | 96.31 ± 0.02 | 4.46 ± 3.84 | ||
| 150 | 97.97 ± 2.80 | 95.73 ± 3.63 | 0.00 ± 0.00 | ||
| Two-digit | 0.5 | 100 | 93.04 ± 4.60 | 96.39 ± 1.44 | 2.91 ± 1.69 |
| 1.0 | 95.72 ± 0.53 | 96.31 ± 0.02 | 4.46 ± 3.84 | ||
| 2.0 | 64.64 ± 4.79 | 96.13 ± 1.43 | 13.40 ± 3.02 | ||
| 3.0 | 6.67 ± 2.51 | 100.00 ± 0.00 | 8.59 ± 8.34 | ||
| Four-digit | 1.0 | 50 | 7.78 ± 1.92 | 60.51 ± 20.02 | 27.78 ± 4.81 |
| 75 | 51.94 ± 2.93 | 77.36 ± 5.44 | 37.38 ± 11.16 | ||
| 100 | 86.84 ± 1.75 | 88.13 ± 1.41 | 36.32 ± 4.76 | ||
| 150 | 93.33 ± 3.63 | 93.07 ± 2.91 | 22.87 ± 2.40 | ||
| Four-digit | 0.5 | 100 | 84.72 ± 5.42 | 89.65 ± 2.25 | 24.03 ± 3.00 |
| 1.0 | 86.84 ± 1.75 | 88.13 ± 1.41 | 36.32 ± 4.76 | ||
| 2.0 | 56.94 ± 1.73 | 87.49 ± 4.77 | 39.14 ± 5.73 | ||
| 3.0 | 4.44 ± 2.10 | 68.69 ± 17.03 | 37.22 ± 25.62 |
aIn triplicate experiments, 5 million read pairs 50, 75, 100 or 150 nucleotides in length (top) and 100-nucleotide read pairs having 0.5, 1, 2 or 3% errors (bottom) were randomly generated from 20 sets of transcripts, each containing 6 randomly chosen reference HLA alleles. HLAminer predictions derived from targeted read assembly were compared to each reference set and the performance of HLAminer was assessed by measuring the specificity, sensitivity and ambiguity. SD, standard deviation.