Literature DB >> 20370931

Assignment of chromosomal locations for unassigned SNPs/scaffolds based on pair-wise linkage disequilibrium estimates.

Mehar S Khatkar1, Matthew Hobbs, Markus Neuditschko, Johann Sölkner, Frank W Nicholas, Herman W Raadsma.   

Abstract

BACKGROUND: Recent developments of high-density SNP chips across a number of species require accurate genetic maps. Despite rapid advances in genome sequence assembly and availability of a number of tools for creating genetic maps, the exact genome location for a number of SNPs from these SNP chips still remains unknown. We have developed a locus ordering procedure based on linkage disequilibrium (LODE) which provides estimation of the chromosomal positions of unaligned SNPs and scaffolds. It also provides an alternative means for verification of genetic maps. We exemplified LODE in cattle.
RESULTS: The utility of the LODE procedure was demonstrated using data from 1,943 bulls genotyped for 73,569 SNPs across three different SNP chips. First, the utility of the procedure was tested by analysing the masked positions of 1,500 randomly-chosen SNPs with known locations (50 from each chromosome), representing three classes of minor allele frequencies (MAF), namely >0.05, 0.01<MAF < or = 0.05 and 0.001<MAF < or = 0.01. The efficiency (percentage of masked SNPs that could be assigned a location) was 96.7%, 30.6% and 2.0%; with an accuracy (the percentage of SNPs assigned correctly) of 99.9%, 98.9% and 33.3% in the three classes of MAF, respectively. The average precision for placement of the SNPs was 914, 3,137 and 6,853 kb, respectively. Secondly, 4,688 of 5,314 SNPs unpositioned in the Btau4.0 assembly were positioned using the LODE procedure. Based on these results, the positions of 485 unordered scaffolds were determined. The procedure was also used to validate the genome positions of 53,068 SNPs placed on Btau4.0 bovine assembly, resulting in identification of problem areas in the assembly. Finally, the accuracy of the LODE procedure was independently validated by comparative mapping on the hg18 human assembly.
CONCLUSION: The LODE procedure described in this study is an efficient and accurate method for positioning SNPs (MAF>0.05), for validating and checking the quality of a genome assembly, and offers a means for positioning of unordered scaffolds containing SNPs. The LODE procedure will be helpful in refining genome sequence assemblies, especially those being created from next-generation sequencing where high-throughput SNP discovery and genotyping platforms are integrated components of genome analysis.

Entities:  

Mesh:

Year:  2010        PMID: 20370931      PMCID: PMC2859757          DOI: 10.1186/1471-2105-11-171

Source DB:  PubMed          Journal:  BMC Bioinformatics        ISSN: 1471-2105            Impact factor:   3.169


Background

The last decade has seen a rapid expansion in the number of genomes from a diverse range of species being sequenced [1]. Further developments of high-throughput sequencing platforms are likely to accelerate the sequencing of potentially many more genomes [2]. Furthermore, such data sets may be coupled with high-throughput SNP-analysis platforms to undertake population diversity characterization [3,4]. The relatively short sequence reads from the high-throughput systems pose challenges in the creation and ordering of contigs and scaffolds in the absence of a mature reference genome. Ordering closely linked markers is also a challenge using linkage mapping. Assembly of the bovine genome sequence has recently been reported [5]. In the course of bovine sequencing to date, more than 2 million SNPs have been discovered and more SNPs are being added with additional sequencing efforts using next generation sequencing technologies [6], resulting in several high-density SNP-genotyping platforms for population-wide screening of genome diversity. Despite several genome builds, there are still a large number of scaffolds and SNPs that are not yet assigned to any chromosomes. For example, there are 11,869 un-ordered scaffolds in Btau4.0, constituting 9.72% (263.4 Mb) of the bovine genome. In order to improve the genome assembly, it would be useful to assign un-ordered scaffolds and SNPs to chromosomes, and to locations within chromosomes [7]. A number of strategies can be adopted to place polymorphic markers on chromosomes via linkage maps [8-10], Radiation Hybrid maps [11-13], FISH and integrated maps [14]. Linkage studies require genotypic information on specific families, and it is difficult to construct accurate or high-resolution linkage maps for high-density SNP data [10]. Alternatively, physical maps of SNPs, created by screening RH panels, enable high-resolution positioning of SNPs but require high-density anchoring of the physical genome to the assembly. However, a SNP can be given a chromosomal position based on linkage disequilibrium (LD) information of the SNP with other SNPs with known position in the genome. LD analysis does not rely on family information and decays rapidly across ([15], and within populations [16]) and, as such, can provide a means to accurately position SNPs based on LD relationships with other SNPs with known map positions. Miller et al., [17] applied an LD-based approach to map a test set of SNPs with known map positions. However, the utility of this approach for unmapped SNPs, or SNPs with ambiguous positions in the context of high-density SNP data, has not been demonstrated. Recently we showed [18] that polymorphic markers can be ordered within a chromosome based on pair-wise LD only and termed this procedure LODE (Locus Ordering by Dis-Equilibrium). A sorting algorithm (sorting points into neighbourhoods) [19] was applied. The procedure was successful in assigning a small number of unmapped SNPs to unique chromosomal locations but was found to be limited in terms scaling up to large matrices representing dense SNP panels. Here we modify the initial LODE procedure for assigning SNPs to chromosomes and positioning SNPs within chromosomes. First, the efficiency of using genome-wide LD information is investigated by using mapped SNPs as a test set. Next, the procedure is applied to assign positions for 4,688 out of 5,314 unpositioned SNPs on Btau4.0, which were either un-assigned or assigned with ambiguity based on BLAST against Btau4.0, from a high-density SNP panel of 73,568 SNPs. We also suggest the chromosomal locations of un-ordered scaffolds. Finally the LODE procedure was used to confirm the order of mapped SNPs across the genome as a means to check the quality of genome assembly.

Methods

Genotypic Data

Data from three SNP genotyping arrays, namely 15 k [20], 25 k (Affymetrix; http://www.affymetrix.com) and 54 k (Illumina; http://www.illumina.com/), used for genotyping 1,536, 441 and 377 Australian Holstein-Friesian (HF) bulls, respectively, were combined into a single dataset for the current analyses. There were duplicate samples and duplicate SNPs within and between datasets. Only unique samples and SNPs with higher call rate (% genotype assignment) were selected to include in the final dataset. Any inconsistent genotype was set to unknown. The final combined dataset represented 73,569 unique SNPs and 1,943 bulls with an average of 628 bulls genotyped per SNP. The mean coefficient of coancestry among these 1,943 bulls is 0.025, with 0.0 and 0.035 for the first and third quartiles, respectively.

Position of SNPs

The location of each of the 73,569 SNPs in the bovine genome was assessed from BLAST alignment of SNP flanking sequences with the Btau4.0 assembly ftp://ftp.hgsc.bcm.tmc.edu/pub/data/Btaurus/fasta/Btau20070913-freeze/, which includes a considerable quantity of sequence (organised as either a set of scaffolds or as a pseudo-chromosome) that is not assigned to a chromosome (referred to as 'Un'). We used the Batu4.0 assembly to demonstrate utility of the LODE procedure since the assembly contained a number of SNPs not assigned to chromosomes. Comparison of LODE positions were also made against another bovine assembly build UMD3.0 which has recently become available. SNP positions on Btau4.0 were categorised as follows: i) 'mapped' (single assignment to a chromosome); ii) 'ambiguous' (more than one assignment in the genome); iii) 'Un' (single assignment to 'Un' sequences only); iv) 'unassigned' (no assignments in the genome). Collectively, the last three categories (ambiguous, Un and unassigned) are here called 'unpositioned'.

LODE procedure

The location of each unpositioned SNP was estimated on the basis of its LD (estimated as r2) with mapped SNPs. The r2 estimates were obtained using GOLD [21]. The genotypes for SNPs on the X-chromosome were considered as homozygous for the purpose of computing LD estimates. Only high quality LD estimates (significant at the 0.01 level, and estimated from a minimum of 100 observations) were used. The actual procedure used in the present study is an extension of the strategy first used by Miller et al. [17] and subsequently adapted to the LODE procedure by Sölkner et al. [18]. In the present study, the LODE procedure consisted of two main steps: A) assigning a SNP to a chromosome; B) estimating the position of the SNP within the assigned chromosome. After trialling many combinations of criteria, the following strategy was used. (The relative accuracy of using different threshold combinations is shown in Additional file 1).

A) Assigning a SNP to a chromosome

For each unpositioned SNP with MAF >0.01: 1. r2 was estimated with all mapped SNPs. 2. From these estimates of r2, two parameters were computed with respect to each chromosome, namely: a. maximum r2 (r2, as an indicator of the strength of LD) b. number of mapped SNPs with r2 > 0.1 (n0.1, as an indicator of the number of mapped SNPs in LD with the unpositioned SNP) 3. Chromosomes were then ranked according to r2and n0.1, in the latter case after excluding chromosomes for which n0.1<3. A chromosome with top ranking for both parameters was identified as the candidate chromosome for that unpositioned SNP. After trialling the above threshold combinations, SNP with MAF ≤ 0.05 required an additional check to improve accuracy of placement. In addition to the above strategy (steps 1-3), the chromosome with next highest r2was identified. If the r2of the second chromosome exceed 2/3 r2of the candidate chromosome, the SNP was not assigned to any chromosome. This improved the accuracy of assignment from 92.1% to 98.9% (Additional file 1). SNPs which didn't meet these criteria were left unpositioned.

B) Estimating position within an assigned chromosome

For each unpositioned SNP that could be assigned to a chromosome, its location on that chromosome was allocated the same position as that of the mapped SNP with which the unpositioned SNP has r2. The above LODE procedure was first tested for its ability to determine the location of SNPs whose location was actually known. Three test sets involved determining the location of a total of 1,500 "masked" SNPs (50 from each of the 29 autosomes and the X chromosome, randomly selected from SNPs with known positions). Each set comprised SNPs with a different MAF class, namely 0.0010.05 (900 SNPs, 30 from each chromosome). The extent to which the procedure was successful was assessed in terms of "efficiency" (the percentage of "masked" SNPs that were assigned a location), "accuracy" (the percentage of "masked" SNPs that were assigned to the correct chromosome), and "precision" (the difference in physical distance between the known position and the assigned position). After testing the LODE procedure with the above test sets, the same procedure was applied to unpositioned bovine SNPs.

Comparative position on human genome

To provide further evidence of the utility of the LODE procedure, we used a comparative mapping approach to confirm the genome location of unpositioned bovine SNPs against the human genome assembly hg18, since this represents the most complete mammalian genome to date. This approach was considered helpful since the location of the unpositioned SNPs could not be validated on Btau4.0 directly. The comparative position of bovine SNPs was estimated in the human genome using two approaches. Firstly, BLAST was used to align the flanking sequences of unpositioned SNPs with the hg18 assembly ftp://hgdownload.cse.ucsc.edu//goldenPath/hg18/. Secondly, the 'LiftOver' tool http://genome.ucsc.edu/cgi-bin/hgLiftOver was used with default settings to convert LODE positions from the bovine Btau4.0 assembly to the human hg18 assembly.

LODE as a means for checking genome assembly

The LODE procedure was used to recompute the positions of all SNPs mapped to the genome which were genotyped and met minimum criteria for inclusion as detailed above and MAF >0.05. The procedure was performed in batches, where the positions of 10% (every 10th) of SNPs of a chromosome were masked. The positions of the masked SNPs were recomputed based on the LD information of the remaining SNPs in the genome. The chromosomal assignments and positions estimated by LODE were compared with original positions on Btau4.0 and also with UMD3.0.

Results

Validation of LODE procedure by test runs

A total of 870 (96.7%) of the 900 test SNPs with MAF>0.05 were allocated a chromosomal position by LODE. All but one (i.e. 869 = 99.9%) of the positions were the same as the Btau4.0 accepted assembly position. The comparison of estimated and known SNP positions (Additional file 2) shows strong agreement (mean Pearson's correlation = 0.98 across all chromosomes). The mean precision of localisation was 914 ± 130 kb (Table 1). The results from alternate criteria that were tested during the development of the preferred strategy are shown in Additional file 1.
Table 1

Efficiency (proportion of SNPs placed), accuracy (proportion of SNPs placed correctly) and precision (kb location from draft assembly location) of the LODE procedure for placing SNPs with known location in three test runs with varying thresholds of MAF of SNPs to be placed.

Test RunNumber of SNPsEfficiency(%)Accuracy (%)Precision (kb)
Run1 (SNPs with MAF>0.05)90096.799.9914 ± 130

Run2 (SNPs with 0.01<MAF ≤ 0.05)30030.698.93137 ± 381

Run3 (SNPs with 0.001<MAF ≤ 0.01)3002.033.36853 ± NA
Efficiency (proportion of SNPs placed), accuracy (proportion of SNPs placed correctly) and precision (kb location from draft assembly location) of the LODE procedure for placing SNPs with known location in three test runs with varying thresholds of MAF of SNPs to be placed. 92 SNPs (30.6%) from the second test set (0.010.01 with high accuracy.

Application of LODE to unpositioned SNPs

In the Btau4.0 assembly, there are 6,470 'unpositioned' SNPs. Of these, 5,314 SNPs have MAF>0.01, making them suitable for LODE mapping (Additional file 4). Table 2 shows the number of SNPs positioned by LODE. Of the 5,314 'unpositioned' SNPs with MAF >0.01, 2,291 had ambiguous positions, 1770 were aligned to 'Un' sequences, and 1,253 were unaligned. Using the LODE strategy, 4,688 of the 'unpositioned' SNPs were positioned. Of the 626 SNPs which didn't meet the thresholds of the LODE procedure, 231 had ambiguous positions, 271 had 'Un' sequences and 124 were unaligned. As expected from the test-set results, a higher proportion of the SNPs with MAF >0.05 (94.2%) than with 0.01
Table 2

Number of 'unpositioned' SNPs assigned to specific chromosomes by LODE.

ambiguous (multi-hits against Btau4.0)'Un' (assignment to 'Un' sequence)unaligned (no hit against Btau4.0)Total 'unpositioned'
SNPs (>0.05 MAF) positioned by LODE2006/2082 (96.3)1466/1625 (90.2)1085/1132 (95.8)4557/4839 (94.2)

SNPs (0.01<MAF ≤ 0.05) positioned by LODE54/209 (25.8)33/145(22.8)44/121 (36.4)131/475(27.6)

Total SNPs positioned by LODE2060/2291 (89.9)1499/1770 (84.7)1129/1253 (90.1)4688/5314 (88.2)

All 'unpositioned' SNPs2291177012535314

Figures in parenthesis are per cent.

Number of 'unpositioned' SNPs assigned to specific chromosomes by LODE. Figures in parenthesis are per cent. Of 2,291 SNPs in the ambiguous category, 2,060 were positioned by LODE. The SNPs in this category had multiple hits when flanking SNP sequence was BLASTed against Btau4.0. Although it is possible that some of the sequence alignment positions in this category may be the result of errors in the Btau4.0 assembly, it is more likely that they are genuine genomic positions reflecting structural polymorphisms or segmental duplications. The SNP positions estimated by LODE are approximations and hence for the SNPs in this category it may be preferable to use LODE positions to discriminate between the multiple sequence-alignment results, and use the sequence alignment consistent with LODE for final positioning. Of 1,770 SNPs belonging to 'Un' sequences, 1,499 were positioned by LODE. These SNPs belong to 494 unique "Un" unordered Btau4.0 scaffolds. Assignment of these SNPs to definite chromosomes suggests the assignment and positions of respective "Un" scaffolds to the same chromosome as well. Table 3 presents the number and length of these "Un" scaffolds assigned to different candidate chromosomes. These assigned scaffolds comprise 87.7 Mb of genome sequence in total. There were multiple SNPs on some of the "Un" scaffolds. Out of these, 210 "Un" scaffolds had two or more SNPs (mean = 5.04) with all the SNPs aligned to one chromosome (Additional file 5). These 210 "Un" scaffolds with multiple SNPs could be assigned and some of them could be oriented on the chromosome, based on the SNP position estimates. This approach may therefore be very useful for improving the bovine assembly, since it provides for a higher resolution assignment of SNPs and the scaffolds.
Table 3

Number of SNPs and unassigned scaffolds ("Un") assigned to different chromosomes by the LODE procedure.

ChromosomeNo. of SNPs assignedNo. of "Un" scaffolds assignedLength of "Un" scaffolds in bp
1290313632458

2213121399643

3219132375137

4176151177590

519892050871

6235183492179

7212172416408

8197223299957

9183182257082

10149162017247

11152171655602

12215164401081

13152132066119

14228124702182

15139183049502

16212245697862

1711711845416

1895111530906

1997111769762

20757559988

21115202921283

2287101083985

23845829315

24866607772

25625398973

26121212117068

27666392536

2894111571243

29126121981261

X2937825445183

Total468848587745611
Number of SNPs and unassigned scaffolds ("Un") assigned to different chromosomes by the LODE procedure. There were 9 scaffolds with multiple SNPs that were given positions on two chromosomes by LODE. This may indicate problems in the assembly of these scaffolds themselves and may require the segments with separate SNPs to be placed separately for improved accuracy of genome assembly. Of 1,253 SNPs in the unaligned category, 1,129 were positioned by LODE. These sequences are missing from the Btau4.0 assembly, possibly because of the nature of whole-genome shotgun sequencing, or because they are within polymorphic regions not present in the two individuals which contributed to Btau4.0, but are present within the population with which we have worked. In summary, the LODE procedure has positioned 4,688 of 5,314 SNPs that are unpositioned in the Btau4.0 assembly.

Validation of LODE positions by comparative mapping

Unique (single location) positions on the human hg18 assembly were obtained from the BLAST and LiftOver procedures for 284 SNPs from the panel of 4,688 SNPs positioned by LODE. The chromosomal assignments for 230 (81%) of these SNPs were identical between BLAST and LiftOver. 54 (19%) of the 284 SNPs had different chromosomal assignments on hg18 by the two above procedures, which may be due to the LODE positions being outside of conserved syntenic blocks between bovine and human chromosomes. Such blocks are normally very small and quite variable in length. Comparison of the chromosomal positions of the 230 SNPs, with same chromosomal assignments, shows very strong agreement (cor = 0.95) (Figure 1) between the positions obtained through BLAST and LiftOver. These results support the accuracy and utility of the LODE procedure for positioning SNPs with MAF>0.01.
Figure 1

Comparison of SNP positions on human assembly obtained with BLAST of flanking sequence of SNPs vs the positions obtained by 'LiftOver' of bovine LODE positions. The comparisons for positions are shown in a single figure for the 230 SNPs combined across all chromosomes.

Comparison of SNP positions on human assembly obtained with BLAST of flanking sequence of SNPs vs the positions obtained by 'LiftOver' of bovine LODE positions. The comparisons for positions are shown in a single figure for the 230 SNPs combined across all chromosomes.

Checking genome assembly with LODE

Table 4 shows comparison of the chromosomal assignments of SNPs repositioned by LODE. Out of 54,062 SNPs tested 53,068 (98.16%) of the SNPs could be given a chromosomal assignment by LODE confirming the high efficiency seen in the pilot test batch of 1,500 SNPs. Most of the SNPs (99.9%) were given the same chromosomal assignments which indicate in general a high level of integrity of Btau4.0. A total of 81 SNPs were found to have different chromosomal assignments by LODE compared with Btau4.0.
Table 4

Chromosome-wise summary of repositioning done by LODE.

ChromosomeNo. of SNPs testedNo. of SNPs assigned to same chromosomeNo. of SNPs assigned a different chromosomeNo. of SNPs not assigned
136353585248

227742732339

3261225511249

425172470443

522602221435

625972570126

722442204436

823782344232

919931958431

10228122411030

1123892337250

1216771646328

1319091880326

1417711747123

1518081781225

1616541627027

1716611626035

1814741436236

1915771538237

2016161603013

2113771343232

2213131279232

2312581223233

2413391319119

2510871065121

2610621042218

271014969144

281004980123

2910701040030

X711630873

Total540625298781994
Chromosome-wise summary of repositioning done by LODE. Table 4 shows distribution of these 81 SNPs mapped to different chromosomes. Out of these SNPs, 5 blocks can be noted as shown in Table 5. All the SNPs of these blocks were assigned to a different chromosome by LODE. The positions of these SNPs were compared with another recently released assembly of the bovine genome (UMD3.0) which agrees with LODE assignments for the SNPs in the blocks (Table 5). These blocks suggest problem areas within the Btau4.0 assembly. The comparison of the overall agreement between LODE and SNPs positioned on Btau4.0 are shown in Figure 2 by the way of Oxford grid. The detailed alignment of LODE positions and Btau4.0 for each chromosome is shown in Additional file 6. This identifies the chromosomal regions which may suggest potential problem areas in the Btau4.0 assembly. In particular two regions (10-11 Mb and 90-120 Mb) on BTA5 suggest problem areas in the assembly of this chromosome (Figure 3). Similarly X-chromosome shows several regions where a relatively higher number of SNPs show differences in original Btau4.0 positions and LODE positions which may suggest general problem in the assembly of X-chromosome (Additional file 6).
Table 5

List of SNPs assigned a different chromosome by LODE as compared to Btau4.0 and 5 problem regions in Btau4.0 identified by LODE.

Btau4.0LODEUMD3.0
SNP Assay IDChromosomePosition(bp)ChromosomePosition(bp)Chromosome: Position (bp)Problem region

ARS-BFGL-NGS-2676611140100382690896782:71023597

1861431114422509515472621641:142952903

BTA-45006-no-rs2107603814340493954:33987224

187617125676277110430367812:54560749

1873289294496897115928309011:61135811

ARS-BFGL-NGS-3234433032914511365361453:27963904

ARS-BFGL-NGS-11579737748873121262969142:1182779891

34348737771994821251136982:1180489611

35166837775782721206535542:1180111421

ARS-BFGL-NGS-8918337778683421251532972:1179821341

ARS-BFGL-NGS-31737782271221233663772:1179462101

187142437785910521218593312:1179098171

ARS-BFGL-NGS-705413127719186226052830222:607854912

18701283127750012225773927822:607552752

ARS-BFGL-NGS-586763127769201226129819622:607360892

ARS-BFGL-NGS-28073127874807225433504722:606344002

ARS-BFGL-NGS-191243127908628226135878122:606004902

ARS-BFGL-NGS-88074260777711409761634:2676685

ARS-BFGL-NGS-1137984444519510302528574:4340312

ARS-BFGL-NGS-15424483768798147861257914:81878266

344491411002820023280373744:106890553//8:91314567

18675665533106211209059475:5049476

186800159459814714115254895:88617706

BTA-110801-no-rs5967422471164082201:16094228

BTA-74608-no-rs510291413910664346995:96344375

4649726368201929495348916:37434546

Hapmap38285-BTA-11007772555730819511600267:27891275

18676477608367761869439707:63367359

BTB-00316147762064599171336203317:12137554

BTA-98858-no-rs777340598X69412698X:121600055

18731468740026616383147518:7357495

464622870460363182600093611:9122777//18:24282247//8:67661725

BTA-97336-no-rs9382390711251089469:4581947

186549395493596611227280509:53194372

BTB-018068569813761211619836361:62607089

1866405910332873913684550779:100810058

ARS-BFGL-NGS-92893103905779211096766210:4340696

ARS-BFGL-NGS-7314910155137221352901381:302599433

BTA-97531-no-rs10155781911329514331:303333603

BTB-0149027010156591831313429471:304148713

18749461017857141111815387310:17528963

BTA-88378-no-rs10624667434713725754:714130104

Hapmap28362-BTA-4808610624871854682386974:714334524

ARS-BFGL-NGS-5316710627125214695907504:716035304

BTA-109199-no-rs10627341494714576384:716251584

187551110627410804726269614:716320894

ARS-BFGL-BAC-66421173615194377154954:371090344

1861989119891972329029474611:95497623

ARS-BFGL-BAC-5532121446010225175928412:15518882//25:1416859

3493001241051742116422696011:65590436

1874001126950692984784125912:72414147

BTB-01588251133055670184533404318:45474850

344445136518378848558129213:65259657

464345138252035275023170613:82162818

ARS-BFGL-NGS-31537146499527414131550514:69103642

18758551529419479204651572915:31389253

BTA-113745-no-rs15817267952129016702:15780878

352984185722202246613906818:57677863

345202186449227679329406518:64408080

ARS-BFGL-NGS-772781934916761264191732026:46541332

ARS-BFGL-NGS-673961955603678201001694319:54645063

18594672160422341207159699621:61747881

ARS-BFGL-NGS-96122164787627104039651221:66220013

4612402254344094173131989722:53552948

4637482260828419118494287022:59688782

4659542342051287174253522523:41166590

1860554234442802127438682323:43645744

BTA-58638-no-rs2455379446186445585924:53790841

ARS-BFGL-NGS-5537425287951599434579079:42792078

1860594261098439863195770426:10567934

ARS-BFGL-NGS-224092632419852194165529819:43295532

ARS-BFGL-NGS-102734276298951233151136923:25899622//23:26148979

ARS-BFGL-NGS-4017028514942615899040428:6888276

1864508X489189438457694438:429697395

ARS-BFGL-NGS-109695X489527608462950848:430035295

1862184X489658348466793908:430163965

BTB-01044512X489885568453673008:430391135

BTB-01631465X67394831263068990726:32336734

1862196X870598582257016127X:146942067

1863291X87885269837568243X:138709703

ARS-BFGL-NGS-10360X883388741117475323X:137429436
Figure 2

Comparison of repositioned locations of SNP by LODE with original location on Btau4.0. X-chromosome is labelled as 30.

Figure 3

The comparison of chromosomal assignments of 1776 SNPs repositioned by LODE procedure with original positions on Btau4.0 on chromosome 5. Two potential problem regions (10-11 Mb and 90-120 Mb) on Btau4.0 on this chromosome can be noted.

Comparison of repositioned locations of SNP by LODE with original location on Btau4.0. X-chromosome is labelled as 30. The comparison of chromosomal assignments of 1776 SNPs repositioned by LODE procedure with original positions on Btau4.0 on chromosome 5. Two potential problem regions (10-11 Mb and 90-120 Mb) on Btau4.0 on this chromosome can be noted. List of SNPs assigned a different chromosome by LODE as compared to Btau4.0 and 5 problem regions in Btau4.0 identified by LODE.

Discussion

In this study we reported and validated a procedure to accurately and efficiently map SNPs based on LD information. The LODE procedure offers particular advantages in the positioning of problem SNPs for which no unambiguous assignment on a draft genome assembly could be made, as well as a means for positioning of unordered scaffolds containing SNPs. Miller et al. [17] used a genetic algorithm based approach and linkage disequilibrium to position a test set of bovine SNPs with known location, and applied a minimum threshold of r2 >0.4 between SNPs in their method. Application of such a threshold would have resulted in lower efficiency (71% for SNPs in test Run1 (MAF >0.05) and slightly lower accuracy (2 mis-assignments) when compared to the thresholds adopted in our study (Table 1). The LODE procedure showed greater utility over the methods described by [17] where the authors have not demonstrated the placement of SNPs with MAF<0.05, SNPs with ambiguous assignments or unpositioned SNPs. The original LODE procedure of Solkner et al. [18] was of similar accuracy and efficiency in small test runs, but has severe limitations in terms of computing time (Solkner et al. in preparation) imposed by matrix dimensions of marker density, thus limiting application to full genome analyses. MAF of SNP to be placed has a significant effect on the efficiency of the LODE procedure, as shown in detail in the result section by running the three different test sets of varying MAF (Table 1). However despite the lower efficiency, the accuracy of the LODE procedure for SNPs with a 0.0110]. Finally LODE procedure can be used for checking the integrity of assembly by sampling and reassigning the positions of SNPs as shown in the result section. The LODE procedure described here is complementary to other commonly used methods to assemble maps, including linkage maps and physical maps such as Fluorescence in situ hybridization (FISH) and Radiation Hybrid mapping [13], but offers significant advantages over these methods since they are very laborious, may have limited resolution, and often require highly specialized resources [22-25]. The comparative advantages and limitations of using LODE mapping are discussed in detail below. The building of linkage maps for genome assemblies has the advantage that de novo ordering of markers can result in robust framework maps, but such maps required information from often large and specific resource populations. Indeed linkage maps have been assembled for many species including a broad range of markers (cattle [8], pig [26], sheep [27], mouse [28], chicken [29] and human [30]). In the case of mouse [31] was able to place SNP markers at a resolution and accuracy of 0.3 Mb by linkage mapping. Most resource populations do not have sufficient power to treat each marker in a high density map as a framework reference point (anchor marker) as described by Ott et al. [32] in their guidelines for developing linkage maps. Recently Arias et al. [10] reported on the construction of a bovine hybrid linkage-map by combining linkage and physical map (Btau4.0) information. However, of the 9,713 SNPs genotyped, 2,946 (30.3%) could not be assigned to the linkage map for quality control reasons. Furthermore 743 (9.4%) of the 7,822 markers assembled for mapping could not be positioned. In contrast the LODE procedure was able to place 4,688 out of 5,314 SNPs in a data set of 73,569 SNPs which is the largest panel of bovine SNPs which can currently be assembled from commercially available SNP arrays. Integrated maps and comparative maps are frequently used to build interim maps for the species in the absence of a completed genome assembly [33,14,34]. BLAST procedure is commonly used to align sequence and when combined with LiftOver can make inference about marker position and order. However, this procedure is highly inefficient when compared to direct mapping such as LODE. For example, out of 4,688 SNPs successfully mapped by LODE, only 230 would have been mapped successfully using BLAST and LiftOver from human assembly to bovine assembly. Lewin et al. [7] highlighted the limitations and conundrum of using comparative mapping information for building maps and emphasised the importance of developing independent species specific maps for discovery of conserved chromosome segments and evolutionary breakpoint regions. Despite the array of tools available for constructing genetic and physical maps, a large number of SNPs and scaffolds remain unpositioned which is likely to be common for most species in which genome assembly is being undertaken (chicken [35], dog [36], cat [37], pig [38] and many other species [1]). As such, the LODE mapping procedure offers a significant additional tool for completing genome maps and assemblies. LODE procedure relies on the linkage disequilibrium information from the unrelated samples from the population and does not require a specific resource population. A reliable estimate of r2 can be obtained from a minimum sample size of 75 unrelated individuals [16] which can be found in many diversity and association studies. However, despite the high degree of accuracy of placement, the LODE procedure still only provides an approximation of the exact localisation (precision) of SNPs within a chromosome, since it is dependent on the accuracy of prior genome assembly as a reference framework and the density of known SNPs to allow positioning of unknown SNPs. Hence, the precision of positioning SNPs with the LODE procedure will increase with increasing SNP density and accuracy of the sequence map. However, quality of the assembly can be assessed by using the LODE procedure to confirm the location of SNP markers with assigned positions, and provides for an independent cross check as shown in the result section. The initial density of marker maps, in order for the LODE procedure to be effective, will depend upon the extent of LD in the population which is often population specific. In the case where no reference positions are available as in the case of denovo genome sequencing and mapping, using D' as a measure of LD will be useful for LODE mapping (Solkner et al. in preparation). It is recommended to always test the LODE strategy on a panel of mapped SNPs with known positions, before applying the procedure to unmapped SNPs; and, if necessary, to alter some of the thresholds criteria. LODE procedure can also be very helpful in refining sequence and genetic maps for species where comparative genome assemblies are used to build a virtual assembly for the species of interest, such as has recently been done for sheep [33]. Population wide (across or within breeds) LD information from high density SNP data (see the ISGC website http://www.sheephapmap.org/) can be used to place and validate SNP locations, and order of unplaced scaffolds where they contain SNPs with appropriate genotype information. The LODE procedure is likely to be of significance in the future as developments in next-generation sequencing technologies are providing deep sequencing coverage at an affordable price [39-41]. These platforms generally provide enormous information on new SNPs from short sequence reads [4,6] but these short sequence reads, at present, can only be assembled into short scaffolds. Genotyping SNPs with the advent of ultra-high genotyping platforms [42] will allow for LODE to integrate these short sequence scaffolds into the existing map information.

Conclusion

The LODE procedure described in this study is an efficient and accurate procedure for positioning SNPs, and offers a means for positioning of unordered scaffolds containing SNPs. The LODE procedure will be helpful in refining genome sequence and checking assemblies, especially those being created from next-generation sequencing where high-throughput SNP discovery and genotyping populations are components of genome analysis.

Authors' contributions

MSK conceived the method and the study, contributed in its design, data collection, analysis and was the primary author for assembling the manuscript. MH provided bioinformatics support in SNP positioning and comparative mapping. MN and JS participated in the method development and manuscript preparation. FWN contributed in the method development, interpretation and manuscript preparation. HWR is project leader and contributed in project concept, design, interpretation and manuscript preparation. All the authors have read and approved the final manuscript.

Additional file 1

The comparison of different threshold combinations for chromosomal assignments. This file presents the results from alternate criteria that were tested during the development of the preferred strategy for the LODE procedure. Click here for file

Additional file 2

Chromosome wise comparison of estimated (LODE) and known positions (Btau4.0) of 869 SNPs allocated a chromosomal position by LODE out a test set of 900 SNPs (MAF>0.05). This file contains 30 scatter plots, one for each bovine autosomes (1-29) and X-chromosome. Click here for file

Additional file 3

The comparison of estimated (LODE) and known positions (Btau4.0) of 91 SNPs allocated a chromosomal position by LODE out a test set of 300 SNPs (0.01 Click here for file

Additional file 4

List of 5,314 SNPs (MAF>0.01) from three SNPs chips which were unpositioned on bovine assembly Btau4.0. This table presents the chromosomal positions for 4,688 of 5,314 SNPs estimated by LODE. The BLAST results against another bovine assembly UMD3.0 are also given. Click here for file

Additional file 5

Assignment of 'Un' scaffolds to chromosomes by aligning all SNPs within 'Un' a scaffold to one chromosome by LODE procedure. This file contains list of 494 "Un" scaffolds, chromsomal assignments by LODE, number of SNPs in each scaffold and the length of each scaffold. Click here for file

Additional file 6

Detailed chromosome-wise comparison of chromosomal assignments of 52,987 SNPs repositioned by LODE procedure with original positions on Btau4.0. This file contains 30 scatter plots, one for each bovine autosomes (1-29) and X-chromosome. Click here for file
  39 in total

1.  A consensus linkage map of the chicken genome.

Authors:  M A Groenen; H H Cheng; N Bumstead; B F Benkel; W E Briles; T Burke; D W Burt; L B Crittenden; J Dodgson; J Hillel; S Lamont; A P de Leon; M Soller; H Takahashi; A Vignal
Journal:  Genome Res       Date:  2000-01       Impact factor: 9.043

2.  An enhanced linkage map of the sheep genome comprising more than 1000 loci.

Authors:  J F Maddox; K P Davies; A M Crawford; D J Hulme; D Vaiman; E P Cribiu; B A Freking; K J Beh; N E Cockett; N Kang; C D Riffkin; R Drinkwater; S S Moore; K G Dodds; J M Lumsden; T C van Stijn; S H Phua; D L Adelson; H R Burkin; J E Broom; J Buitkamp; L Cambridge; W T Cushwa; E Gerard; S M Galloway; B Harrison; R J Hawken; S Hiendleder; H M Henry; J F Medrano; K A Paterson; L Schibler; R T Stone; B van Hest
Journal:  Genome Res       Date:  2001-07       Impact factor: 9.043

3.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species.

Authors: 
Journal:  J Hered       Date:  2009-11-05       Impact factor: 2.645

Review 4.  [Extending the capabilities of human chromosome analysis: from high-resolution banding to chromatin fiber-FISH].

Authors:  T Ikeuchi
Journal:  Hum Cell       Date:  1997-06       Impact factor: 4.174

5.  Sorting points into neighborhoods (SPIN): data analysis and visualization by ordering distance matrices.

Authors:  D Tsafrir; I Tsafrir; L Ein-Dor; O Zuk; D A Notterman; E Domany
Journal:  Bioinformatics       Date:  2005-02-18       Impact factor: 6.937

6.  A comprehensive genetic map of the human genome based on 5,264 microsatellites.

Authors:  C Dib; S Fauré; C Fizames; D Samson; N Drouot; A Vignal; P Millasseau; S Marc; J Hazan; E Seboun; M Lathrop; G Gyapay; J Morissette; J Weissenbach
Journal:  Nature       Date:  1996-03-14       Impact factor: 49.962

7.  A 12,000 rad whole genome radiation hybrid panel for high resolution mapping in cattle: characterization of the centromeric end of chromosome 1.

Authors:  C E Rexroad; E K Owens; J S Johnson; J E Womack
Journal:  Anim Genet       Date:  2000-08       Impact factor: 3.169

8.  A comprehensive genetic map of the cattle genome based on 3802 microsatellites.

Authors:  Naoya Ihara; Akiko Takasuga; Kazunori Mizoshita; Haruko Takeda; Mayumi Sugimoto; Yasushi Mizoguchi; Takashi Hirano; Tomohito Itoh; Toshio Watanabe; Kent M Reed; Warren M Snelling; Steven M Kappes; Craig W Beattie; Gary L Bennett; Yoshikazu Sugimoto
Journal:  Genome Res       Date:  2004-10       Impact factor: 9.043

9.  Towards high resolution maps of the mouse and human genomes--a facility for ordering markers to 0.1 cM resolution. European Backcross Collaborative Group.

Authors: 
Journal:  Hum Mol Genet       Date:  1994-04       Impact factor: 6.150

10.  Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution.

Authors: 
Journal:  Nature       Date:  2004-12-09       Impact factor: 49.962

View more
  7 in total

1.  Genome-wide estimation of linkage disequilibrium from population-level high-throughput sequencing data.

Authors:  Takahiro Maruki; Michael Lynch
Journal:  Genetics       Date:  2014-05-28       Impact factor: 4.562

2.  Estimation of Recombination Rate and Maternal Linkage Disequilibrium in Half-Sibs.

Authors:  Alexander Hampel; Friedrich Teuscher; Luis Gomez-Raya; Michael Doschoris; Dörte Wittenburg
Journal:  Front Genet       Date:  2018-06-05       Impact factor: 4.599

3.  Age- and disease-dependent HERV-W envelope allelic variation in brain: association with neuroimmune gene expression.

Authors:  Rakesh K Bhat; Kristofor K Ellestad; B Matt Wheatley; Rene Warren; Robert A Holt; Christopher Power
Journal:  PLoS One       Date:  2011-04-29       Impact factor: 3.240

4.  Revealing misassembled segments in the bovine reference genome by high resolution linkage disequilibrium scan.

Authors:  Adam T H Utsunomiya; Daniel J A Santos; Solomon A Boison; Yuri T Utsunomiya; Marco Milanesi; Derek M Bickhart; Paolo Ajmone-Marsan; Johann Sölkner; José F Garcia; Ricardo da Fonseca; Marcos V G B da Silva
Journal:  BMC Genomics       Date:  2016-09-05       Impact factor: 3.969

5.  A comparative integrated gene-based linkage and locus ordering by linkage disequilibrium map for the Pacific white shrimp, Litopenaeus vannamei.

Authors:  David B Jones; Dean R Jerry; Mehar S Khatkar; Herman W Raadsma; Hein van der Steen; Jeffrey Prochaska; Sylvain Forêt; Kyall R Zenger
Journal:  Sci Rep       Date:  2017-09-04       Impact factor: 4.379

6.  SNPrune: an efficient algorithm to prune large SNP array and sequence datasets based on high linkage disequilibrium.

Authors:  Mario P L Calus; Jérémie Vandenplas
Journal:  Genet Sel Evol       Date:  2018-06-26       Impact factor: 4.297

7.  Linkage Disequilibrium Estimation in Low Coverage High-Throughput Sequencing Data.

Authors:  Timothy P Bilton; John C McEwan; Shannon M Clarke; Rudiger Brauning; Tracey C van Stijn; Suzanne J Rowe; Ken G Dodds
Journal:  Genetics       Date:  2018-03-27       Impact factor: 4.562

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.