| Literature DB >> 33758920 |
Irina Chelysheva1, Andrew J Pollard1, Daniel O'Connor1.
Abstract
RNA-sequencing (RNA-seq) is a widely used approach for accessing the transcriptome in biomedical research. Studies frequently include multiple samples taken from the same individual at various time points or under different conditions, correct assignment of those samples to each particular participant is evidently of great importance. Here, we propose taking advantage of typing the highly polymorphic genes from the human leukocyte antigen (HLA) complex in order to verify the correct allocation of RNA-seq samples to individuals. We introduce RNA2HLA, a novel quality control (QC) tool for performing study-wide HLA-typing for RNA-seq data and thereby identifying the samples from the common source. RNA2HLA allows precise allocation and grouping of RNA samples based on their HLA types. Strikingly, RNA2HLA revealed wrongly assigned samples from publicly available datasets and thereby demonstrated the importance of this tool for the quality control of RNA-seq studies. In addition, our tool successfully extracts HLA alleles in four-digital resolution and can be used to perform massive HLA-typing from RNA-seq based studies, which will serve multiple research purposes beyond sample QC.Entities:
Keywords: HLA; RNA-sequencing; bioinformatics; quality control; system biology; transcriptomics
Mesh:
Substances:
Year: 2021 PMID: 33758920 PMCID: PMC8425422 DOI: 10.1093/bib/bbab055
Source DB: PubMed Journal: Brief Bioinform ISSN: 1467-5463 Impact factor: 11.622
Overview of HLA-typing programs
| Program | Input format | Sequencing type | HLA resolution | Languages | OS | |||
|---|---|---|---|---|---|---|---|---|
| fastq | compr. fastq | DNA | RNA | |||||
| single-end | paired-end | |||||||
| HISAT2 | + | + | + | -(?) | + | 8 digits | C++, Python, JAVA | Unix/Linux |
| HLAscan | + | − | + | − | + | 4 digits | Python | Unix/Linux |
| seq2HLA | + | + | − | -(?) | + | 4 digits | Python, R | Unix/Linux, Mac OS, Windows (?) |
| HLAforest | + | − | − | − | + | 8 digits | Perl | Unix/Linux |
Abbreviations: DNA: DNA-seq, RNA: RNA-seq, comp.fastq: compressed fastq,(?): potential to be implemented.
Figure 1
RNA2HLA workflow.
Figure 3
Heat maps of HLA identity: before (A) and after (B) the correction of sample labeling.
RNA2HLA performance on available RNA-sequencing datasets. Identity threshold has been defined by the maximum F1 score
| Dataset | SRP144583 | SRP090552 | SRP103772 | SRP081020 |
|---|---|---|---|---|
| Length (bp) | 75 | 100 | 51 | 101 |
| Type | Paired | Paired | Single | Single |
| # of samples | 195 | 42 | 53 | 55 |
| 0.05 | 0.05 | 0.5 | 0.5 | |
| Identity threshold (%) | 81.8 | 87 | 77 | 70 |
| Precision | 1 | 1 | 0.95 | 0.99 |
| Recall | 1 | 1 | 1 | 0.85 |
| F1 score | 1 | 1 | 0.97 | 0.91 |
Abbreviations: Type: paired – paired-end; single – single-end.
Figure 2
Fraction of correctly assigned samples based on HLA identity threshold for each of the tested studies.