Literature DB >> 28926565

A SNP panel and online tool for checking genotype concordance through comparing QR codes.

Yonghong Du1, Joshua S Martin2, John McGee3, Yuchen Yang2, Eric Yi Liu4, Yingrui Sun5, Matthias Geihs6, Xuejun Kong7, Eric Lingfeng Zhou8, Yun Li2,4,8, Jie Huang9,10.   

Abstract

In the current precision medicine era, more and more samples get genotyped and sequenced. Both researchers and commercial companies expend significant time and resources to reduce the error rate. However, it has been reported that there is a sample mix-up rate of between 0.1% and 1%, not to mention the possibly higher mix-up rate during the down-stream genetic reporting processes. Even on the low end of this estimate, this translates to a significant number of mislabeled samples, especially over the projected one billion people that will be sequenced within the next decade. Here, we first describe a method to identify a small set of Single nucleotide polymorphisms (SNPs) that can uniquely identify a personal genome, which utilizes allele frequencies of five major continental populations reported in the 1000 genomes project and the ExAC Consortium. To make this panel more informative, we added four SNPs that are commonly used to predict ABO blood type, and another two SNPs that are capable of predicting sex. We then implement a web interface (http://qrcme.tech), nicknamed QRC (for QR code based Concordance check), which is capable of extracting the relevant ID SNPs from a raw genetic data, coding its genotype as a quick response (QR) code, and comparing QR codes to report the concordance of underlying genetic datasets. The resulting 80 fingerprinting SNPs represent a significant decrease in complexity and the number of markers used for genetic data labelling and tracking. Our method and web tool is easily accessible to both researchers and the general public who consider the accuracy of complex genetic data as a prerequisite towards precision medicine.

Entities:  

Mesh:

Year:  2017        PMID: 28926565      PMCID: PMC5604942          DOI: 10.1371/journal.pone.0182438

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

Genomic data is being accumulated at an incredible rate. It is projected that approximately one billion people will be whole genome sequenced within the next decade[1]. With a cost easily below $100, genotyping arrays that target single nucleotide polymorphisms (SNPs) will increase this rate exponentially. Many studies, such as the UK biobank project[2] in United Kingdom, the VA million Veteran program[3] in United States, the China Kadoorie Study[4] in China and United Kingdom, have taken advantage of these cost-effective arrays to genotype samples up to ~500,000. These large cohorts are not anomalies, with the Kaiser Perch Program on Genes, Environment, and Health[5] and the and TOPMed[6], building cohorts of similar size. Outside of the research field, direct-to-consumer genetic testing has exploded, with companies claiming to have genotyped more than a million individuals (for example, http://www.23andme.com). However, with this plethora of genetic data comes errors. Hu et al. report an average rate of error for sample mix-up between 0.1% to 1%,[7] suggesting that between 500 to 5,000 samples are probably mislabeled for a large study such as the UK Biobank Study. A significant amount of research has been devoted to reducing these errors and improving the quality control. These strategies range from devoted and detailed outlines of quality control procedures[8] to matching sets of significant markers for sample tracking. All of these methods require a significant amount of expertise and time to implement, making them a drain on limited resources. Individual identifications by SNP analysis require generation of a panel of SNPs that together give an extremely remote probability that two individuals would have the same DNA profile. Previously, a universal panel of 92 SNPs was developed for individual identification[9]. Another panel used 75 SNPs for Eastern Asian populations[10]. A recent simulation study showed that only 60 optimized SNPS are required to differentiate individuals in the global population[7]. In this study, we describe a solution that is accurate, unique, and easy to use. Our proposed solution uses 80 identified SNPs that are shared across widely used genome-wide genotyping arrays. To increase the accessibility and easiness of use, we develop on online platform to extract the genetic data and encode it as a quick response (QR) code. QR codes have the advantage of being a robust method for encoding information and can be read with any image capture devise such as a smart phone. Liu et al. previously compared 53 different types of one-dimensional and ten two-dimensional barcode symbologies and found that the QR code has the largest coding capacity and relatively high compression rate, allowing for easy expansion if necessary[11]. Our website, nicknamed QRC (for QR code based Concordance check), provides an easy to use web based interface for extracting the 80 markers from uploaded genotype data, encoding the markers as a QR code, and comparing the concordance of multiple QR codes. This methodology can easily be expanded to be used by professionals in the genetic field.

Methods

Identification of ID SNPs

To generate our list of fingerprinting SNPs, we first obtained a list of bi-allelic autosomal SNPs that overlap in eight widely used genotyping arrays: three Affymetrix arrays including Axiom Biobank Array, Axiom UK biobank Array, and the newly announced Axiom Precision Medicine Research Array (PMRA) (http://www.affymetrix.com/catalog); three Illumina arrays including infinium-omniexpress-24-v1-2-a1 array, Illumina HumanExome-12v1-2 array, and the newly announced Global Screening array (GSA) (http://www.illumina.com/techniques/microarrays), as well as two direct-to-consumer (DTC) arrays (23&Me and Genes for Good). The resulting list is then selected again to ensure at least moderate frequencies across global populations. Specifically, we select SNPs with minor allele frequency (MAF) over 0.25 in each of the five global sub-populations presented in the 1000GP project, so that the selected are not only available in major genotyping arrays, but are also common in global populations. The five sub-populations are: European (EUR), African (AFR), Native American (AMR), Eastern Asian (EAS), and Southern Asian (SAS). The MAF is based on data from the 1000 genomes project (1000GP)[12] (freezing date 20130502) and the Exome Aggregation Consortium (ExAC)[13] (release 0.3.1). The former includes whole genome sequencing data from 2,504 individuals of diverse ancestry while the latter whole exome sequencing data from over 60,000 individuals. The results are further pruned by removing A/T and C/G SNPs and SNPs annotated as pathogenic or likely pathogenic as reported by ClinGen database[14]. The final selection process limits to those SNPs that are not marginally dependent with each other, i.e., are in linkage disequilibrium (LD). To be very conservative, we pick only one SNP from any 10MB region on the genome. The SNP for a given region was selected as having the highest overall MAF over the remaining SNPs. Across the whole genome this resulted in 74 SNPs that satisfy our filtering criteria. This number slightly exceeds the theoretical number of 60 required to uniquely distinguish the global population[7]. To make this panel verifiable on its own when there is only one genetic dataset, we added four single nucleotide variants (SNVs) that are commonly used to predict ABO blood type: (1). exon-6 deletion rs8176719 for O1 type; (2). rs41302905 for O2 type; (3). rs8176746 for B type[15, 16]; (4). rs56392308 for A2 subtype[17]. We further added two SNPs that are capable of predicting sex: rs12743401, rs12734338. These two SNPs are aligned to both chromosomes 1 and Y, therefore, heterozygosity in male is actually a detection of two regions, one on chromosome 1 and the other on chromosome Y [18, 19]. The resulting total number of 80 SNPs were tested to confirm that they could uniquely label a large cohort. We used the UK Biobank (N ~150,000) as our test cohort. The genotypes of fingerprinting SNPs was extracted and tested for uniqueness using PLINK[20].

Comparing the concordance of ID SNPs through QR codes

We then developed a web based application (http://qrcme.tech) that can extract the genotypes for these fingerprinting SNPs from raw genotype datasets such as those from 23&Me and then generate QC codes. To create a QR code, we first generate a string in the format of “1AA2AC3—”, where 1,2,3 are the index of 80 SNPs and the two digit letters are the genotype of SNPs at that position. Missing data is represented by “-”. Then, this string, without indices, is encoded into a QR code using the open source Zebra Crossing barcode image processing library (https://github.com/zxing/zxing/). This same library is used to decode a QR image back to the original text string. To compare QR Codes, we first decode both images, and compare the 80 SNPs values from the decoded strings. A match includes five scenarios: (1) a perfect match such as “AG” vs. “AG”, (2) a permuted match such as “AG” vs. “GA”, (3) an opposite strand match such as “AG” vs. “TC”; (4) an “AC” vs.”TG” match (all permutations); (5) an “AG” vs. “TC” match (all permutations). All other conditions are considered a mismatch, with missing data reported separately. For those who are interested in deriving their own list of ID SNPs, we have also made it easy to accomplish through our QRC website. It takes a list of SNPs in CHR:POS format and compares it with a reference file that includes allele frequencies of 1,388,180 biallelic variants existing in both 1000GP and ExAC. Then it generates a list of independent SNPs with high allele frequencies across all major sub-populations, based on user specified MAF cutoff and region size threshold.

Results

Through a series of selections, we have identified 74 SNPs across the whole genome that uniquely identify an individual across the global population. To make this list of SNPs more informative and unique, we further included four SNPs for predicting ABO blood type and two SNPs for predicting sex. Therefore, there is a total of 80 SNPs are included. Table 1 shows the overlapping of SNPs across eight major genotyping arrays. The upper diagonal numbers are the numbers of overlapping SNPs for each corresponding pair. The lower diagonal numbers (shown in italicized font with an underline) are the cumulative numbers of overlapping SNPs for each corresponding pair. For example, there are 865,720 SNPs in Axiom PMRA array, among which 272,701 are also present in Axiom UK Biobank array. Among the 272,701 SNPs, 172,088 are also in Axiom Biobank array. And among the 172,088 SNPs, 39,292 are also on Illumina GSA array. Eventually, 3,239 SNPs are shared across all eight arrays and 74 are independent. The details for these 74 fingerprinting SNPs are listed in Table 2. The reference allele and reference allele frequency (RAF) was based on the human reference genome15. These 74 SNPs span 20 autosomes, excluding chromosomes 15 and 21. They overall MAF is all greater than 0.3, based on the 2,504 multi-ethnical individuals in 1000GP. There is at least 10MB separating SNPs with the average distance being 37.4MB reducing the possibility of linkage between SNPs. Additionally, these SNPs have no reported pathogenic or likely pathogenic association according to the ClinGen database meaning these SNPs reveal no information regarding disease risk. Fig 1 shows the RAF between 1000GP and ExAC for these 74 SNPs.
Table 1

Cross tabulation of bi-allelic autosomal SNPs across eight arrays.

Axiom PMRAAxiom UK BiobankAxiom BiobankIllumina GSAOmni Express23&MeGenes for GoodExome Array
Axiom PMRA865,720272,701207,468128,50382,37370,22761,24021,941
Axiom UK Biobank272,701800,194359,529289,548103,36091,747103,13965,910
Axiom Biobank172,088172,088629,487105,80777,13265,734232,406185,863
Illumina GSA39,29239,29239,292733,348185,489113,481192,33354,913
Omni Express15,90515,90515,90515,905693,518303,948253,91718,683
23&Me10,47810,47810,47810,47810,478510,550128,06215,684
Genes for Good8,3858,3858,3858,3858,3858,385540,551233,277
Exome Array3,2393,2393,2393,2393,2393,2393,239238,468

The numbers highlighted in grey along the diagonal line are for each individual SNP panel. The upper diagonal numbers are the numbers of overlapping SNPs for each corresponding pair. The lower diagonal numbers (shown in italicized font with an underline) are the cumulative numbers of overlapping SNPs for each corresponding pair. For example, for the second column, there are 865,720 SNPs in Axiom PMRA array, among which 272,701 are also present in Axiom UK Biobank array, among the 272,701, 172,088 are also in Axiom Biobank array, and among the 172,088, 39,292 are also on Illumina GSA array, etc; and eventually, 3,239 are shared across all eight arrays.

Table 2

List of fingerprint SNPs.

#ChrPos (b37)rsIDRefAltRAF#ChrPos (b37)rsIDRefAltRAF
117,202,190rs970973TC0.5393881,514,009rs2301963CA0.477
2134,071,525rs1874045CT0.57139830,973,957rs1800392GT0.446
31110,998,854rs7514102GA0.435408121,228,679rs4870723AC0.512
41161,479,745rs1801274AG0.479418143,761,931rs2294008CT0.306
51183,542,387rs2274064TC0.4894294,576,680rs301430TC0.364
61203,194,186rs2297950CT0.30343915,784,631rs1539172AG0.478
71225,534,219rs7527925TC0.476449116,136,198rs1043836CT0.615
81248,039,713rs3811445AG0.608459133,927,878rs10901333AG0.459
9226,804,247rs935172TC0.54746106,001,696rs3136618CT0.507
102101,638,888rs3739014AG0.607471030,316,208rs2185724TC0.373
112113,309,473rs1545133CT0.523481099,498,234rs3818876GA0.53
122138,420,996rs10206850AG0.5434910124,610,027rs1891110GA0.528
132191,301,368rs9646748AG0.4855010134,748,331rs12781609CT0.402
142207,041,053rs3732083TC0.458511114,246,296rs1025412GA0.515
152237,149,941rs6756597CT0.479521133,065,394rs1064005CT0.38
16314,755,572rs6765537AG0.391531173,785,326rs4453265TC0.476
17352,727,257rs2289247GA0.429541216,397,734rs1852450CA0.489
183100,963,154rs571391GA0.652551258,162,739rs703842AG0.385
193122,259,606rs9851180TC0.5385612125,467,158rs11558556CT0.361
203193,209,178rs6788448TC0.427571333,703,656rs495680TC0.585
21442,639,186rs898500AG0.481581350,141,345rs4942848GA0.616
22479,443,850rs931606GA0.519591423,299,135rs1135641GT0.464
234187,120,211rs13146272CA0.585601473,138,189rs1060570CA0.449
2451,065,399rs737154CT0.5256114101,350,298rs3825569TC0.506
25552,193,287rs1531545CT0.55462164,751,045rs863980CT0.533
26573,339,114rs285599CT0.394631629,998,200rs4077410AG0.491
27596,503,523rs160632CT0.586641656,995,236rs1800775CA0.459
285150,943,085rs2304054GA0.465651714,005,439rs2159132GA0.522
295169,685,163rs315717CT0.508661733,749,546rs2586514AG0.602
30631,610,686rs1052486AG0.499671757,963,537rs1292053AG0.446
316129,807,629rs2229848CT0.667681771,196,809rs1026128AG0.523
326147,680,359rs9390459AG0.532691860,027,241rs1805034CT0.537
336167,360,389rs2236313TC0.37570194,288,332rs888930AG0.412
34733,282,577rs7793096GA0.502711917,394,124rs2363956TG0.486
35799,757,612rs3823646GA0.537721949,658,367rs3745298CT0.459
367141,672,604rs10246939TC0.476732052,786,219rs2296241GA0.492
377156,762,248rs12919GA0.515742219,951,271rs4680GA0.462

The resulting 74 SNPs sorted by chromosome and position as reported by build 37 reference genome. The RAF is based on 1000GP.

Fig 1

Reference allele frequency of the selected 80 SNPs.

Reference allele frequency across the five major population groups (African: AFR, European: EUR, Native American: AMR, Eastern Asian: EAS and Southern Asian: SAS) and overall as reported by 1000GP and ExAC. Y-axis is the RAF in ExAC.

Reference allele frequency of the selected 80 SNPs.

Reference allele frequency across the five major population groups (African: AFR, European: EUR, Native American: AMR, Eastern Asian: EAS and Southern Asian: SAS) and overall as reported by 1000GP and ExAC. Y-axis is the RAF in ExAC. The numbers highlighted in grey along the diagonal line are for each individual SNP panel. The upper diagonal numbers are the numbers of overlapping SNPs for each corresponding pair. The lower diagonal numbers (shown in italicized font with an underline) are the cumulative numbers of overlapping SNPs for each corresponding pair. For example, for the second column, there are 865,720 SNPs in Axiom PMRA array, among which 272,701 are also present in Axiom UK Biobank array, among the 272,701, 172,088 are also in Axiom Biobank array, and among the 172,088, 39,292 are also on Illumina GSA array, etc; and eventually, 3,239 are shared across all eight arrays. The resulting 74 SNPs sorted by chromosome and position as reported by build 37 reference genome. The RAF is based on 1000GP. As shown in Fig 2A, our web tool allows users to do three things: 1. Generate one or more QR codes from one or more raw genotype datasets and save the QR codes locally; 2. Compare two QR codes to get a report on the concordance of the underlying genotype datasets; 3. Generate one’s own ID SNPs. This is primarily for those savvy users including researchers who prefer to generate their own ID SNPs instead of using the 80 SNPs that we derived. Fig 2B shows a example report. It is based on genotype datasets for two different individuals, therefore, the concordance is low. The report includes the number of missing SNPs and the overlap of non-missing SNPs and the type of matches.
Fig 2

The QRC website interface.

A. The interface allows a user to first upload genetic data to generate a QR code and save it into his local computer, and then compare any two QR codes for concordance check. Researchers could also generate their own ID SNPs. B. A sample report, based on genotype datasets for two different individuals. The report includes the number of missing SNPs and the overlap of non-missing SNPs and the type of matches.

The QRC website interface.

A. The interface allows a user to first upload genetic data to generate a QR code and save it into his local computer, and then compare any two QR codes for concordance check. Researchers could also generate their own ID SNPs. B. A sample report, based on genotype datasets for two different individuals. The report includes the number of missing SNPs and the overlap of non-missing SNPs and the type of matches.

Discussion

Short tandem repeat (STR) markers have been routinely used for genetic fingerprinting forensic settings, because of the large number of alleles within various populations[21]. However, STR does have disadvantages, including high mutation rate, lack of high-throughput technologies, and the need for large amplification products and therefore limits the use of degraded samples.[22] In this manuscript, we have presented a method for creating a list of identifying SNPs. This method uses a series of selections, the first being identifying overlapping SNPs across eight genotyping arrays. The results are further selected by requiring a minimum MAF value above 0.25 across the five major continental groups. Additional selections result in just 80 SNPs that uniquely identify individuals across the global population. We have confirmed this uniqueness in the large publicly available genetic database, the UK biobank. This same procedure can be implemented in other settings to create similar lists that fit a given need. Our identified list of 80 SNPs, has the practical application of reducing the number of SNPs used for comparison in the tracking of genetic data through the genotyping pipeline. Genotyping vendors currently use their own list of SNPs for tracking, with Affymetrix reportedly using over 300 markers for sample tracking. Our lower number of markers results in faster comparisons leading to savings in time and possibly cost, especially over millions of samples as reported by 23&Me. We further implemented the QRC web server (http://qrcme.tech). The simple and easy to use graphical interface allows a user to upload a genetic data set, which is parsed for the genotypes at the 80 SNPs. The results are then encoded as a QR code that can be attached to a data set. QR codes from different data sets can also be compared, leading to a check across commercial genotyping companies. This feature has already been implemented in addition to coding and decoding QR codes. This methodology can be easily expanded to be used by professionals in the genetic field. It is our goal to come up with a most parsimonious list of SNPs to uniquely identify any single person across the globe, through genetic data. However, our purpose is to encode this subset of genetic data into a QR code so that a non-geneticist could use an easy interface to check the concordance of one data with another, not for purposes such as forensic testing or paternity testing. Therefore, some level of uncertainty is tolerated. We further added SNPs that could be used to predict ABO blood type and sex, therefore one genotypic data alone could still provide some useful information for one to validate the data to some extent. It is our hope that the genetic community will work together to identify a robust method and agree upon an omnibus list of SNPs that could be used through user friendly interface like what is presented in QRC.
  22 in total

1.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease.

Authors:  John Michael Gaziano; John Concato; Mary Brophy; Louis Fiore; Saiju Pyarajan; James Breeling; Stacey Whitbourne; Jennifer Deen; Colleen Shannon; Donald Humphries; Peter Guarino; Mihaela Aslan; Daniel Anderson; Rene LaFleur; Timothy Hammond; Kendra Schaa; Jennifer Moser; Grant Huang; Sumitra Muralidhar; Ronald Przygodzki; Timothy J O'Leary
Journal:  J Clin Epidemiol       Date:  2015-10-09       Impact factor: 6.437

2.  Cohort profile: the Kadoorie Study of Chronic Disease in China (KSCDC).

Authors:  Zhengming Chen; Liming Lee; Junshi Chen; Rory Collins; Fan Wu; Yu Guo; Pamela Linksted; Richard Peto
Journal:  Int J Epidemiol       Date:  2005-08-30       Impact factor: 7.196

3.  Genotyping Informatics and Quality Control for 100,000 Subjects in the Genetic Epidemiology Research on Adult Health and Aging (GERA) Cohort.

Authors:  Mark N Kvale; Stephanie Hesselson; Thomas J Hoffmann; Yang Cao; David Chan; Sheryl Connell; Lisa A Croen; Brad P Dispensa; Jasmin Eshragh; Andrea Finn; Jeremy Gollub; Carlos Iribarren; Eric Jorgenson; Lawrence H Kushi; Richard Lao; Yontao Lu; Dana Ludwig; Gurpreet K Mathauda; William B McGuire; Gangwu Mei; Sunita Miles; Michael Mittman; Mohini Patil; Charles P Quesenberry; Dilrini Ranatunga; Sarah Rowell; Marianne Sadler; Lori C Sakoda; Michael Shapero; Ling Shen; Tanu Shenoy; David Smethurst; Carol P Somkin; Stephen K Van Den Eeden; Lawrence Walter; Eunice Wan; Teresa Webster; Rachel A Whitmer; Simon Wong; Chia Zau; Yiping Zhan; Catherine Schaefer; Pui-Yan Kwok; Neil Risch
Journal:  Genetics       Date:  2015-06-19       Impact factor: 4.562

4.  How many single nucleotide polymorphisms (SNPs) are needed to replace short tandem repeats (STRs) in forensic applications?

Authors:  Hyo-Jung Lee; Jae Won Lee; Su Jin Jeong; Mira Park
Journal:  Int J Legal Med       Date:  2017-02-27       Impact factor: 2.686

5.  Molecular genetic basis of the histo-blood group ABO system.

Authors:  F Yamamoto; H Clausen; T White; J Marken; S Hakomori
Journal:  Nature       Date:  1990-05-17       Impact factor: 49.962

6.  Quality control procedures for genome-wide association studies.

Authors:  Stephen Turner; Loren L Armstrong; Yuki Bradford; Christopher S Carlson; Dana C Crawford; Andrew T Crenshaw; Mariza de Andrade; Kimberly F Doheny; Jonathan L Haines; Geoffrey Hayes; Gail Jarvik; Lan Jiang; Iftikhar J Kullo; Rongling Li; Hua Ling; Teri A Manolio; Martha Matsumoto; Catherine A McCarty; Andrew N McDavid; Daniel B Mirel; Justin E Paschall; Elizabeth W Pugh; Luke V Rasmussen; Russell A Wilke; Rebecca L Zuvich; Marylyn D Ritchie
Journal:  Curr Protoc Hum Genet       Date:  2011-01

7.  ClinGen--the Clinical Genome Resource.

Authors:  Heidi L Rehm; Jonathan S Berg; Lisa D Brooks; Carlos D Bustamante; James P Evans; Melissa J Landrum; David H Ledbetter; Donna R Maglott; Christa Lese Martin; Robert L Nussbaum; Sharon E Plon; Erin M Ramos; Stephen T Sherry; Michael S Watson
Journal:  N Engl J Med       Date:  2015-05-27       Impact factor: 91.245

8.  Evaluating information content of SNPs for sample-tagging in re-sequencing projects.

Authors:  Hao Hu; Xiang Liu; Wenfei Jin; H Hilger Ropers; Thomas F Wienker
Journal:  Sci Rep       Date:  2015-05-15       Impact factor: 4.379

9.  A global reference for human genetic variation.

Authors:  Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal:  Nature       Date:  2015-10-01       Impact factor: 49.962

10.  NIH Precision Medicine Initiative: Implications for Diabetes Research.

Authors:  Judith E Fradkin; Mary C Hanlon; Griffin P Rodgers
Journal:  Diabetes Care       Date:  2016-07       Impact factor: 19.112

View more
  2 in total

1.  PAGEANT: personal access to genome and analysis of natural traits.

Authors:  Jie Huang; Zhi-Sheng Liang; Stefano Pallotti; Janice M Ranson; David J Llewellyn; Zhi-Jie Zheng; Daniel A King; Qiang Zhou; Houfeng Zheng; Valerio Napolioni
Journal:  Nucleic Acids Res       Date:  2022-04-22       Impact factor: 16.971

2.  Idéfix: identifying accidental sample mix-ups in biobanks using polygenic scores.

Authors:  Robert Warmerdam; Pauline Lanting; Patrick Deelen; Lude Franke
Journal:  Bioinformatics       Date:  2021-11-18       Impact factor: 6.937

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.