| Literature DB >> 24944579 |
Davis J McCarthy1, Peter Humburg2, Alexander Kanapin2, Manuel A Rivas2, Kyle Gaulton2, Jean-Baptiste Cazier3, Peter Donnelly1.
Abstract
BACKGROUND: Variant annotation is a crucial step in the analysis of genome sequencing data. Functional annotation results can have a strong influence on the ultimate conclusions of disease studies. Incorrect or incomplete annotations can cause researchers both to overlook potentially disease-relevant DNA variants and to dilute interesting variants in a pool of false positives. Researchers are aware of these issues in general, but the extent of the dependency of final results on the choice of transcripts and software used for annotation has not been quantified in detail.Entities:
Year: 2014 PMID: 24944579 PMCID: PMC4062061 DOI: 10.1186/gm543
Source DB: PubMed Journal: Genome Med ISSN: 1756-994X Impact factor: 11.117
Figure 1Annotation examples. These screenshots from the ENSEMBL web browser [40] show two examples of variant annotation. (A) The variant NC_000011.9:g.57983194A>G (rs7103033) is relatively straightforward to annotate. It is the final base of the final exon in both transcripts at this position (a CCDS transcript (green) and a ‘merged’ ENSEMBL/Havana (GENCODE) transcript (gold)). The final codon has changed from TGA (stop codon) to TGG (tryptophan), so this is unambiguously a stop-loss variant. Using the ENSEMBL transcript set, both ANNOVAR and VEP correctly annotate this variant as stop-loss. (B) The variant NC_000006.11:g.30558477_30558478insA (rs72545970) is more difficult to annotate. It is the penultimate base of the exon for all but one of the transcripts shown. It is a single-base insertion, so could be annotated as a frameshift variant. Then again, it is an insertion in a stop codon, so could be a stop-loss variant. In fact, the final codon, TGA (stop codon), remains TGA with this variant (insertion of a single base A), so it is actually a synonymous variant. ANNOVAR annotates it as frameshift insertion and VEP as stop-loss, when using ENSEMBL transcripts. Each browser image consists of several tracks, which provide base-resolution information about the DNA sequence. Two tracks, ‘Sequence (+)’ and ‘Sequence (-)’, show the DNA sequence on the forward and reverse strands, respectively. Above these, a track shows start and stop codons, and above that, several tracks indicate the presence and structure of different transcripts (labelled as ‘Genes’ and ‘CCDS set’; transcripts are read from left to right). The ‘hollowed-out’ parts of transcripts indicate non-coding sequences. Below the DNA sequence, the track ‘Sequence variant’ shows known sequence variants from dbSNP [17] and the 1000 Genomes Project [18]. The ‘Variation Legend’ and ‘Gene Legend’ provide more information about features shown in different colours in the browser. CCDS, Consensus Coding Sequence; UTR, untranslated region.
Same software, different transcripts:REFSEQ vs ENSEMBL by ANNOVAR annotation category
| | |||||||
|---|---|---|---|---|---|---|---|
| stopgain_SNV | 15,835 | 14,183 | 14,960 | 13,308 | 93.83 | 88.96 | 84.04 |
| frameshift_insertion | 6,980 | 5,298 | 6,495 | 4,813 | 90.85 | 74.10 | 68.95 |
| frameshift_deletion | 7,491 | 4,547 | 7,380 | 4,436 | 97.56 | 60.11 | 59.22 |
| stoploss_SNV | 946 | 503 | 906 | 463 | 92.05 | 51.10 | 48.94 |
| splicing | 47,878 | 14,154 | 45,839 | 12,115 | 85.59 | 26.43 | 25.30 |
| frameshift_substitution | 1,960 | 195 | 1,947 | 182 | 93.33 | 9.35 | 9.29 |
| nonsynonymous_SNV | 321,669 | 291,898 | 315,592 | 285,821 | 97.92 | 90.57 | 88.86 |
| nonframeshift_insertion | 3,506 | 2,888 | 2,844 | 2,226 | 77.08 | 78.27 | 63.49 |
| nonframeshift_deletion | 5,136 | 3,321 | 4,963 | 3,148 | 94.79 | 63.43 | 61.29 |
| nonframeshift_substitution | 933 | 226 | 843 | 136 | 60.18 | 16.13 | 14.58 |
| synonymous_SNV | 178,559 | 167,561 | 172,463 | 161,465 | 96.36 | 93.62 | 90.43 |
| UTR3 | 724,802 | 574,255 | 622,441 | 471,894 | 82.17 | 75.81 | 65.11 |
| UTR5 | 177,832 | 94,545 | 162,684 | 79,397 | 83.98 | 48.80 | 44.65 |
| UTR5_UTR3 | 2,183 | 292 | 2,092 | 201 | 68.84 | 9.61 | 9.21 |
| ncRNA_intronic | 8,992,009 | 2,113,428 | 8,244,441 | 1,365,860 | 64.63 | 16.57 | 15.19 |
| ncRNA_exonic | 654,098 | 140,303 | 597,947 | 84,152 | 59.98 | 14.07 | 12.87 |
| ncRNA_UTR3 | 53,379 | 10,712 | 47,133 | 4,466 | 41.69 | 9.48 | 8.37 |
| ncRNA_UTR5 | 10,683 | 1,989 | 9,444 | 750 | 37.71 | 7.94 | 7.02 |
| ncRNA_splicing | 13,931 | 1,051 | 13,562 | 682 | 64.89 | 5.03 | 4.90 |
| ncRNA_UTR5_ncRNA_UTR3 | 107 | 1 | 106 | 0 | 0.00 | 0.00 | 0.00 |
| intronic | 29,289,037 | 26,805,864 | 27,743,749 | 25,260,576 | 94.24 | 91.05 | 86.25 |
| intergenic | 50,305,202 | 49,797,113 | 41,307,708 | 40,799,619 | 81.93 | 98.77 | 81.10 |
| downstream | 991,811 | 474,684 | 840,376 | 323,249 | 68.10 | 38.46 | 32.59 |
| upstream | 910,818 | 440,728 | 762,664 | 292,574 | 66.38 | 38.36 | 32.12 |
| upstream_downstream | 53,608 | 15,621 | 47,293 | 9,306 | 59.57 | 19.68 | 17.36 |
| unknown | 11,205 | 6,215 | 5,703 | 713 | 11.47 | 12.50 | 6.36 |
| ALL LOF | 81,090 | 38,880 | 77,527 | 35,317 | 90.84 | 45.55 | 43.55 |
| ALL LOF and MISSENSE | 412,334 | 337,213 | 401,769 | 326,648 | 96.87 | 81.30 | 79.22 |
| ALL EXONIC | 590,893 | 504,774 | 574,232 | 488,113 | 96.70 | 85.00 | 82.61 |
| ALL | 80,981,575 | 80,981,575 | 80,981,575 | 69,181,552 | 85.43 | 85.43 | 85.43 |
This table summarises the number of annotations that match between the REFSEQ and ENSEMBL results for each category of annotation. It shows the number of variants given each type of annotation when using (i) either REFSEQ or ENSEMBL (‘REF+ENS’; union), (ii) REFSEQ (‘REF’) and (iii) ENSEMBL (‘ENS’). It also shows the number of variants that have matching annotations (i.e. the same annotation when using both transcript sets; intersection) and the match rate for each transcript set, which expresses the proportion of matching annotations for an annotation term relative to the total number of annotations in the category from the particular transcript set, as a percentage. The final column shows the ‘Overall match rate’, which is the percentage of the variants with a given annotation when using either REFSEQ or ENSEMBL (‘REF+ENS’) that have a matching annotation when using the two transcript sets. Categories are loosely ordered by the severity of effect, with LoF annotations listed before nonsynonymous, synonymous, non-exonic categories and so on. Within each loose group, categories are sorted in descending order of overall matching rate. The bottom four rows show the total degree of matching across all putative loss-of-function (LoF) categories, all LoF and missense categories, all exonic categories and, finally, all categories.
Figure 2REFSEQ-normalised heatmap of annotation comparison. This heatmap shows scaled numbers of variants (log10 transformation with offset of 1 applied) for all different combinations of ANNOVAR categories of annotations when using the ENSEMBL transcript set (columns) and REFSEQ transcript set (rows). Values are Z-scaled (mean-centred, divided by standard deviation) by row (each row is scaled separately; contrast with Figure 3). The key above the heatmap shows the values indicated by different colours. This row-normalised heatmap allows us to see which categories of annotation are over-represented (relative to the total number of variants in the column/category) in the ENSEMBL annotations for each category (i.e. row) of REFSEQ annotation. Ideally, all of the dark red squares would lie on the diagonal, with white squares on the off-diagonals, indicating complete agreement in the annotations from the two transcript sets. Compare with Additional file 1: Table S1, which provides the numbers used for this heatmap. Categories are ordered as per Table 1.
Figure 3ENSEMBL-normalised heatmap of annotation comparisons. This heatmap shows scaled numbers of variants (log10 transformation with offset of 1 applied) for all different combinations of ANNOVAR categories of annotations when using the ENSEMBL transcript set (columns) and REFSEQ transcript set (rows). Values are Z-scaled (mean-centred, divided by standard deviation) by column (each column is scaled separately; contrast with Figure 2). The key above the heatmap shows the values indicated by different colours. The column-normalised heatmap allows us to see which categories of annotation are over-represented (relative to the total number of variants in the column/category) in the REFSEQ annotations for each category (i.e. column) of ENSEMBL annotation. Ideally, all of the dark red squares would lie on the diagonal, with white squares on the off-diagonals, indicating complete agreement in the annotations when using the two transcript sets. Compare with Additional file 1: Table S1, which provides the numbers used for this heatmap. Categories are ordered as per Table 1.
Same transcripts, different software:ANNOVAR and VEP annotations for exonic variants
| | |||||||||
|---|---|---|---|---|---|---|---|---|---|
| | | | | ||||||
| LOF total | 104,915 | 77,527 | 96,761 | 68,284 | 69,373 | 88.08 | 70.57 | 66.12 | 65.09 |
| Frameshift | 19,021 | 15,822 | 16,685 | 13,486 | - | 85.24 | 80.83 | - | 70.90 |
| Stop gained | 16,758 | 14,960 | 16,146 | 14,348 | - | 95.91 | 88.86 | - | 85.62 |
| Stop lost | 1,113 | 906 | 1,077 | 870 | - | 96.03 | 80.78 | - | 78.17 |
| All splicing | 69,112 | 45,839 | 62,853 | 39,580 | - | 86.35 | 62.97 | - | 57.27 |
| MISSENSE total | 350,806 | 324,242 | 347,752 | 318,056 | 321,188 | 98.09 | 91.46 | 91.56 | 90.66 |
| Inframe indel | 9,455 | 8,650 | 6,600 | 5,795 | - | 66.99 | 87.80 | - | 61.29 |
| Missense | 343,284 | 315,592 | 339,953 | 312,261 | - | 98.94 | 91.85 | - | 90.96 |
| Initiator codon | 1,199 | 0 | 1,199 | 0 | - | - | 0.00 | - | 0.00 |
| SYNONYMOUS and | | | | | | | | | |
| OTHER CODING total | 182,120 | 172,463 | 175,483 | 165,643 | 165,826 | 96.05 | 94.39 | 91.05 | 90.95 |
| Synonymous | 181,873 | 172,463 | 175,053 | 165,643 | - | 96.05 | 94.62 | - | 91.08 |
| Stop retained | 203 | 0 | 203 | 0 | - | - | 0.00 | - | 0.00 |
| Other coding | 227 | 0 | 227 | 0 | - | - | 0.00 | - | 0.00 |
| ALL LOF | 104,915 | 77,527 | 96,761 | 68,284 | 69,373 | 88.08 | 70.57 | 66.12 | 65.09 |
| ALL LOF and MISSENSE | 455,721 | 401,769 | 444,513 | 386,340 | 390,561 | 96.16 | 86.91 | 85.70 | 84.78 |
| ALL EXONIC | 637,841 | 574,232 | 619,996 | 551,983 | 556,387 | 96.13 | 89.03 | 87.23 | 86.54 |
This table summarises the number of annotations that match between the ANNOVAR and VEP results (when using ENSEMBL transcripts) for each exonic category of annotation. It shows the number of variants given each type of annotation by when using (i) either ANNOVAR or VEP (‘ANV+VEP’; union), (ii) ANNOVAR (‘ANV’) and (iii) VEP (‘VEP’). It also shows the number of variants that have exact matching annotations (i.e. exactly the same annotation from both tools; intersection), and category-matching annotations (i.e. annotations from the two tools in the same high-level category – LoF, missense, synonymous and other coding – even if not an exact match). Columns six and seven show the match rate for each tool, which gives the percentage of matching annotations for an annotation term from ANNOVAR and VEP, respectively, relative to the total number of annotations in the category from the particular software tool. Column eight gives the percentage of variants with annotations from ANNOVAR and VEP in the same high-level category (overall category match rate). Column nine shows the overall exact match rate, which is the percentage of variants with an annotation from either ANNOVAR or VEP (‘ANV+VEP’) that have an exactly matching annotation from the two tools. Here, the specific annotations from equivalent terms for ANNOVAR and VEP have been aggregated to enable the comparison (see Additional file 1: Table S4). The final three rows of the table show aggregate counts and match rates for all loss-of-function categories, all LoF and missense categories and all exonic categories, respectively. Note that the all splicing category for VEP comprises 5,011 splice acceptor variants, 8,544 splice donor variants and 49,298 more general splice region variants. ANNOVAR, in contrast, only has one general splicing category, and does not distinguish between acceptor, donor and other splicing variants.