| Literature DB >> 30129167 |
Meng Wang1, Keith M Callenberg2, Raymond Dalgleish3, Alexandre Fedtsov4, Naomi K Fox4, Peter J Freeman3, Kevin B Jacobs5, Piotr Kaleta6, Andrew J McMurry7, Andreas Prlić4, Veena Rajaraman4, Reece K Hart4.
Abstract
The Human Genome Variation Society (HGVS) nomenclature guidelines encourage the accurate and standard description of DNA, RNA, and protein sequence variants in public variant databases and the scientific literature. Inconsistent application of the HGVS guidelines can lead to misinterpretation of variants in clinical settings. Reliable software tools are essential to ensure consistent application of the HGVS guidelines when reporting and interpreting variants. We present the hgvs Python package, a comprehensive tool for manipulating sequence variants according to the HGVS nomenclature guidelines. Distinguishing features of the hgvs package include: (1) parsing, formatting, validating, and normalizing variants on genome, transcript, and protein sequences; (2) projecting variants between aligned sequences, including those with gapped alignments; (3) flexible installation using remote or local data (fully local installations eliminate network dependencies); (4) extensive automated tests; and (5) open source development by a community from eight organizations worldwide. This report summarizes recent and significant updates to the hgvs package since its original release in 2014, and presents results of extensive validation using clinical relevant variants from ClinVar and HGMD.Entities:
Keywords: HGVS; clinvar; sequence variant; variant representation
Mesh:
Year: 2018 PMID: 30129167 PMCID: PMC6282708 DOI: 10.1002/humu.23615
Source DB: PubMed Journal: Hum Mutat ISSN: 1059-7794 Impact factor: 4.878
Figure 1Overview of hgvs package modules and module relationships
Comparisons between hgvs 1.1 (released in 2018) and hgvs 0.4 (released in 2014)
| v 1.1 | v 0.4 | |
|---|---|---|
| parse substitution, delins, insertion, deletion, duplication and repeats | ✓ | ✓ |
| parse inversion, conversion and identity variants | ✓ | × |
| strict validation mode | ✓ | ✓ |
| relaxed validation mode | ✓ | × |
| different validation response levels | ✓ | × |
| supported sequence assemblies | any NCBI Assembly | GRCh37 only |
| validate variants before mapping | ✓ | × |
| normalize (shift and rewrite) variants | ✓ | × |
| configurable formatter | ✓ | × |
| local data source of transcripts information (UTA) | ✓ | ✓ |
| local data source of sequences (SeqRepo) | ✓ | × |
Figure 2The variant normalization process in hgvs package. a) Normalizing NM_001166478.1:c.30_31insT by 3′ shifting to NM_001166478.1:c.35_36insT, and then rewriting as NM_001166478.1:c.35dupT. b) Normalizing NM_001166478.1:c.59delG by right shifting to or across the intron between c.59 and c.60
Settings and examples of the configurable formatting in hgvs
| Setting | Setting value | Example |
|---|---|---|
|
| 0 (default) | NM_001166478.1:c.31_35del |
| 10 | NM_001166478.1:c.31_35delTTTTT | |
| 3 | NM_001166478.1:c.31_35del | |
|
| True (default) | NP_001628.1:p.Gly528Arg |
| False | NP_001628.1:p.G528R | |
|
| False (default) | NP_001628.1:p.Gly528Ter |
| True | NP_001628.1:p.Gly528* |
Figure 3Timing of mapping, normalization and validation of variants with online data sources and local data sources. Each configuration was run three times independently (grey dots). Each bar was the average of all timings
hgvs package parsing and validation results for transcript variants and genomic variants from ClinVar release 2017‐05
| Transcript variants | Genomic variants | |||||
|---|---|---|---|---|---|---|
| Number | Percent | Example | Number | Percent | Example | |
|
| ||||||
| ‐ Variants with uncertain positions | 696 | 0.244% | NM_000059.3:c.‐227‐?_425+?del | 23,628 | 7.624% | NC_000002.12:g.(?_17019)_(774946_?)del |
| ‐ Large variants (>1 M, slow performance) | 0 | 0.000% | 58 | 0.019% | NC_000006.11:g.255350_3189972dup | |
| ‐ Sequence data not available | 50 | 0.018% | LRG_219t1:c.3261dupC | 40 | 0.013% | NC_000002.10:g.47852014_47873687del |
|
| ||||||
| ‐ Syntactic errors | 718 | 0.252% | NM_000059.3:c.410_411ins8 | 1,275 | 0.411% | NC_000017.11:g.43045707delTins6 |
| ‐ Invalid coordinates | 2,724 | 0.956% | NM_000038.5:c.*2292A>T | 0 | 0.000% | |
| ‐ Discordant del/ins length | 4 | 0.001% | NM_000760.3:c.998_1071del174 | 6 | 0.002% | NC_000015.9:g.89382103_89382159del56 |
| ‐ Wrong reference | 1 | 0.000% | NM_139058.2:c.333_335dupGCC | 0 | 0.000% | |
|
| 284,892 | 91.931% | NC_000001.11:g.17028712delTinsCC | |||
| ‐ Intronic variants | 48,797 | 17.122% | NM_000016.5:c.387+32C>G | |||
| ‐ Exonic variants | 232,003 | 81.407% | NM_000059.3:c.201_202dupGA | |||
hgvs package normalization results for transcript variants and genomic variants from ClinVar release 2017‐05
| Transcript variants | Genomic variants | |||||
|---|---|---|---|---|---|---|
| Number | Percent | Example | Number | Percent | Example | |
|
| 231,127 | 99.622% | 273,210 | 95.899% | ||
|
| 710 | 0.306% |
NM_000041.3:c.291delG ⇒ NM_000041.3:c.292del | 11,617 | 4.078% |
NC_000002.12:g.165901915dupA ⇒ NC_000002.12:g.165901920dup |
|
| 213 | 0.092% |
NM_000023.3:c.981_982insGC ⇒ NM_000023.3:c.981_982dup | 285 | 0.100% |
NC_000023.11:g.18672015_18672016insA ⇒ NC_000023.11:g.18672015dup |
|
| 47 | 0.020% |
NM_000183.2:c.3_4insACT ⇒ NM_000183.2:c.5_7dup | 220 | 0.077% |
NC_000016.10:g.88865_88866insGT ⇒ NC_000016.10:g.88867_88868dup |
|
| 30 | 0.013% |
NM_000051.3:c.822delT (RCV000477779) = NM_000051.3:c.824delT (RCV000205636) | 55 | 0.019% |
NC_000008.11:g.89955482_89955483delCAinsTG (RCV000486277) = NC_000008.11:g.89955482_89955483invCA (RCV000474157) |
Important correctness differences between the hgvs package and Mutalyzer. The same input variant was provided to hgvs and Mutalyzer (Mutalyzer 2.0.26, released on July 19, 2017)
| Feature | Operation | Input Variant | hgvs result | Mutalyzer result | Explanation |
|---|---|---|---|---|---|
| indel‐aware projections | Project transcript variant onto aligned genomic sequence |
NM_033089.6:c.460C>N NM_033089.6:c.461C>N |
NC_000020.10:g.278687C>N NC_000020.10:g.278691C>N |
NC_000020.10:g.278687C>N NC_000020.10:g.278688C>N | NM_033089.6 contains a 3‐nucleotide insertion in the genome relative to the transcript between transcript sequence position 484 and 485 (c.460 and c.461), corresponding to g.278687 and g.278691 on NC_000020.10. Mutalyzer will incorrectly compute coordinates after c.484. This issue affects 428 genes and 1104 transcripts in GRCh37, and 131 genes and 336 transcripts in GRCh38. |
| validate variants before projection | Project transcript variant onto aligned genomic sequence |
NM_003002.3:c.500000G>T |
Exception raised (“HGVSError: The given coordinate is outside the bounds of the reference sequence.”) |
NC_000011.9:g.112465214G>T | hgvs refuses to extrapolate positions beyond the bounds of the sequence alignment. Mutalyzer does not check sequence bounds. |
| replace reference sequence after projection | Project transcript variant onto aligned genomic sequence |
NM_000024.5:c.46A>T |
NC_000005.9:g.148206440G>T |
NC_000005.9:g.148206440A>T | NM_000024.5:c.46 corresponds to NC_000005.9:g.148206440, the site of a known SNP (rs1042713). The reference nucleotides in the transcript and genomic sequence are A and G respectively. hgvs replaces the genomic reference sequence after projection, while Mutalyzer does not. |
| normalize variants after projection | Project transcript variant onto aligned genomic sequence |
NM_024426.4:c.1137_1141dup |
NC_000011.9:g.32417913_32417917dup |
NC_000011.9:g.32417911_32417915dup | NM_024426.4 is on the ‐ strand. The input variant is correctly normalized (3′ shifted). After projection to the genomic sequence, the variant can be normalized on the + strand by 2 nucleotides. Mutalyzer appears to not apply normalization after converting positions. |
| rewrite variants in preferred forms | Normalize/rewrite variant |
NM_001166478.1:c.35_36insT |
NM_001166478.1:c.35dup | website warning | hgvs rewrites NM_001166478.1:c.35_36insT as NM_001166478.1:c.35dup. Mutalyzer raises a warning but does not correct the error. |
Projection of variants in the vicinity of transcript–genome alignment gaps
|
| ||
|---|---|---|
| Position relative to gap | Variant | Transcript to genome projection |
|
| NM_007121.5:n.796A>T | NC_000019.10:g.50378563_50378564insTAC |
| NM_007121.5:n.796_797del | NC_000019.10:g.50378563_50378564insC | |
| NM_007121.5:n.796_797insT | NC_000019.10:g.50378564_50378565insTACA | |
|
| NM_007121.5:n.796_798del | NC_000019.10:g.50378 565_50378567dup |
| NM_007121.5:n.796_798delinsTC GG | NC_000019.10:g.50378563_50378564insTCGG | |
|
| NM_007121.5:n.795_796del | NC_000019.10:g.50378563_50378564insC |
| NM_007121.5:n.795_796delinsTT | NC_000019.10:g.50378563delinsTTAC | |
| NM_007121.5:n.795_796insT | NC_000019.10:g.50378563_50378564insTAAC | |
|
| NM_007121.5:n.794_800del | NC_000019.10:g.50378562_50378565del |
| NM_007121.5:n.794_800delinsTC | NC_000019.10:g.50378562_50378565delinsTC | |