Literature DB >> 23695301

Whole-genome sequences of DA and F344 rats with different susceptibilities to arthritis, autoimmunity, inflammation and cancer.

Xiaosen Guo¹, Max Brenner, Xuemei Zhang, Teresina Laragione, Shuaishuai Tai, Yanhong Li, Junjie Bu, Ye Yin, Anish A Shah, Kevin Kwan, Yingrui Li, Wang Jun, Pércio S Gulko.

Abstract

DA (D-blood group of Palm and Agouti, also known as Dark Agouti) and F344 (Fischer) are two inbred rat strains with differences in several phenotypes, including susceptibility to autoimmune disease models and inflammatory responses. While these strains have been extensively studied, little information is available about the DA and F344 genomes, as only the Brown Norway (BN) and spontaneously hypertensive rat strains have been sequenced to date. Here we report the sequencing of the DA and F344 genomes using next-generation Illumina paired-end read technology and the first de novo assembly of a rat genome. DA and F344 were sequenced with an average depth of 32-fold, covered 98.9% of the BN reference genome, and included 97.97% of known rat ESTs. New sequences could be assigned to 59 million positions with previously unknown data in the BN reference genome. Differences between DA, F344, and BN included 19 million positions in novel scaffolds, 4.09 million single nucleotide polymorphisms (SNPs) (including 1.37 million new SNPs), 458,224 short insertions and deletions, and 58,174 structural variants. Genetic differences between DA, F344, and BN, including high-impact SNPs and short insertions and deletions affecting >2500 genes, are likely to account for most of the phenotypic variation between these strains. The new DA and F344 genome sequencing data should facilitate gene discovery efforts in rat models of human disease.

Entities: Chemical Disease Gene Species

Keywords: BN; DA; F344; Rattus norvegicus; next-generation whole-genome sequencing (NGS); whole-genome sequencing

Mesh：

Year: 2013 PMID： 23695301 PMCID： PMC3730908 DOI： 10.1534/genetics.113.153049

Source DB: PubMed Journal: Genetics ISSN： 0016-6731 Impact factor: 4.562

THE laboratory rat (Rattus norvegicus) has been a model organism for the study of human biology and diseases for nearly 200 years (Jacob 1999). Rats differing in susceptibility to disease models and other traits have been extensively studied to better understand human physiology, pharmacology, toxicology, nutrition, behavior, immunology, and diseases such as diabetes, autoimmunity, arthritis, and cancer. These traits have a strong genetic component, making rat models of human disease highly useful for the identification and validation of causative genes and pathways, as well as for testing new therapeutic approaches. The sequencing of the Brown Norway (BN/SsNHsdMcwi) rat genome was a milestone for the identification, positional cloning, and study of disease model and trait regulatory genes. The BN rat genome was first drafted using a strategy that combined bacterial artificial chromosome (BAC) end sequencing, whole-genome shot gun sequencing, and BAC fingerprinting mapping (Gibbs ). The BN rat genome was later expanded and reassembled, leading to the draft assembly RGSC v3.4 (Worley ). The BN strain was chosen because it has been commonly used in many different fields and studies and was also a founder strain for panels of consomic and recombinant inbred rat strains (Worthey ). The DA (D-blood group of Palm and Agouti, also known as Dark Agouti) and the F344 (Fischer) strains have been extensively studied due to their phenotypic differences in complex traits as diverse as nociception and behavior (Brodkin ; Terner ), resistance to infections and parasites (Ishih 1994; Suzuki ; Zhang ), severity of autoimmune and inflammatory diseases such as arthritis (Dahlman ; Sun ; Wilder ), oxygen-induced retinopathy (van Wijngaarden ), muscular strength (Biesiadecki ), bone mineral density (Turner ), taste preference (Tordoff ), cellular phenotypes (Brenner ; Laragione , 2008; Zhang ), metabolic traits (van Den Brandt ), and costicosterone levels (Potenza ). These and other complex traits have been mapped in linkage studies, and the Rat Genome Database (http://rgd.mcw.edu) presently curates 257 quantitative trait loci (QTL) in crosses involving DA and 362 QTL in crosses involving F344 rats, including congenics. Yet detailed genomic information for DA and F344 is lacking and would be instrumental for the identification of the genes accounting for each QTL and for the understanding of the genetic regulation of several complex traits. Next-generation whole-genome sequencing (NGS) technology enables ultrahigh depth and high-resolution sequencing projects at a cost significantly lower than the traditional dideoxynucleotide-based capillary method. NGS has been successfully used to resequence the human (Bentley ; Wang ; Wheeler ; Ahn ; Kim ; G. Li ; Fujimoto ; Tong ), mouse (Keane ; Yalcin ), and the spontaneously hypertensive rat (SHR) (Atanur ) genomes. Here we report the high-depth sequencing of the DA and F344 strains using NGS to generate the first two de novo assemblies of the rat genome and the identification of >2 million new variants likely to account for many of the phenotypic differences between DA, F344, and BN.

Materials and Methods

Rats and DNA

DA (DA/BklArbNsi) rats were originally purchased from Bantin and Kingman, transferred to the Arthritis and Rheumatism Branch, National Institute of Arthritis and Musculoskeletal and Skin Diseases, National Institutes of Health, and maintained since 2002 at the Laboratory of Experimental Rheumatology at the Feinstein Institute for Medical Research (formerly North Shore-Long Island Jewish Research Institute) via brother–sister mating. F344 (F344/NHsd) rats were purchased from Harlan Laboratories. Genomic DNA was extracted from the liver of one male DA and one male F344 rat using the phenol–chloroform–isoamyl alcohol method (Strauss 2001). The quantity of DNA was determined using a NanoDrop spectrophotometer (Thermo Scientific), and the integrity was evaluated using electrophoresis.

Construction and sequencing of DNA libraries

Illumina pair-end index libraries were constructed according to the manufacturer’s protocol. Briefly, ∼3 μg of DNA was randomly fragmented by nebulization with compressed nitrogen gas. Overhangs (5′ or 3′) of double-stranded DNA fragments were converted to blunt ends using T4 DNA polymerase and Klenow polymerase. An “A” base was added to the end of double-stranded DNA fragments using exo- Klenow polymerase, followed by ligation to adaptors with a “T” base overhang. After electrophoresis, DNA fragments of 500 bp on average were gel-purified. To minimize bias in library preparation, two DNA libraries were built for each sample. The adaptor-modified DNA fragments were loaded on an Illumina Cluster Station and underwent 10 cycles of bridge amplification PCR to generate sequencing template clusters on flow cells. Samples were processed on the HiSequation 2000 platform (Illumina) according to the manufacturer’s instructions for template hybridization, isothermal amplification, linearization, blocking and denaturing, and hybridization of the sequencing primers. Base-calling was done using Illumina’s pipeline, HiSeq Control (HCS) + OLB + GAPipeline-1.6 (Illumina), and the sequences of each lane were generated as 90-bp reads. Data were processed and analyzed according to a pipeline summarized in Supporting Information, Figure S9 and described in detail below.

Reference genome

The rat (R. norvegicus) reference genome (RGSC v3.4) was downloaded from the University of California at Santa Cruz database (http://genome.ucsc.edu/) along with data on gene annotation, ESTs, gaps, repeats, and position of the centromeres. Single nucleotide polymorphisms (SNPs) were downloaded from dbSNP build 136.

Read filtering and mapping

The raw data were refined using two filtering steps: (1) Contaminant filtering: adapter sequences may be introduced into raw reads during the library construction process. Therefore reads containing sequences similar to the adapter (mismatch ≤3) were considered contaminated and discarded, as were reads <30 bp in length. (2) Quality value filtering: to obtain high-quality data, reads with 40% or more low-confidence bases (quality value = 2) were discarded. All cleaned reads were mapped onto the BN rat reference genome using SOAP2.21 (Li ), allowing a maximum of five mismatches for each read. The alignment parameters were the following: -a –b –D –o -2 –u –m –x (-g) –l 32 –s 30 –v 3. Duplicated reads caused by PCR were removed using an in-house C++ script.

Detection of SNPs

To identify SNPs against the reference genome, the genotype probability of each site in DA and F344 was calculated using SOAPsnp (Li ), which is based on the Bayesian statistical model. A consensus sequence (CNS) was generated to contain the genotype with the highest probability for each position. SNPs between the reference sequence and the CNS were considered high-quality SNPs when they fulfilled all of the following criteria: (1) quality value >20 (indicating an inferred base call accuracy >99%); (2) estimated copy number of flanking sequences <2; (3) minimum distance between adjacent SNPs of 5 bp; (4) at least six uniquely mapped reads supporting homozygous SNPs or three for each allele of heterozygous SNPs; and (5) a maximum depth of each site of 75 (depth value was limited to twice the mean depth to avoid incorrect SNP calls supported by reads in repeats). DA and F344 genomic DNA were extracted from male rats, so we considered all SNP sites in chromosome X to be hemizygous and required them to be covered by only two reads.

Detection of short insertions/deletions

The clean reads were realigned to the BN genome with SOAP2 set to tolerate gaps of up to 10 bp. Then we clustered mapped read pairs containing gaps in only one end to detect insertions/deletions (indels) of up to 5 bp. Candidate indels overlapping SNP sites were filtered out. The remaining candidate events were considered high-quality indels when supported by 15–55 reads.

Experimental validation of SNPs and indels

Primers were designed to cover 1045 variants (SNPs and indels) on the chromosome 4 locus Cia3d (Brenner ) and on the chromosome 10 loci Cia5a and Cia5d (Brenner ). PCR products were generated using AmpliTaq Gold (Life Technologies) and 10 ng of genomic DNA. Excess primers and dNTPs were removed from the PCR reaction by treatment with Exosap-IT (USB) according to the manufacturer’s instructions. Samples were then diluted to 20–40 ng/μl and sequenced at Genewiz, Inc. (South Plainfield, NJ) using BigDye Terminator v.3.1 on a 3730xl capillary analyzer (Life Technologies). Base calls were manually determined using LaserGene v.8 (Dnastar, Madison, WI).

Detection of structural variation and copy-number variation candidates

We identified structural variation using the paired-end method (Wang ). The accuracy of this method depends on the distribution of the insert size of the DNA library. A Perl script was written to compile the mean and the standard deviation of the insert sizes used for the paired-end mapping. Paired-end reads that could both be aligned but did not meet the insert size and/or orientation inferred from the reference genome were classified as abnormal paired-end reads. Regions supported by at least three abnormal paired-end reads and differing from the inferred insert size by at least 3 standard deviations were considered to contain structural variation. Abnormal paired-end reads were analyzed by clustering, and structural variants were categorized as insertions, tandem or dispersed duplications, deletions, and combinations of inversions and deletions. Segmental duplication or deletion events are also evident as regions of increased or decreased copy number (Yoon ). To locate copy-number variation (CNV) candidates using the alignment results, we first obtained the depth of each base along the reference genome using SOAPcoverage (http://soap.genomics.org.cn/). We then used CNVDetector (Chen ), a program developed by BGI, to calculate the mean depth of 100-bp sliding windows along each chromosome and to select the candidate regions of CNV based on the difference of depth between each consecutive window and the overall mean. Events with a high absolute difference in depth (i.e., outside the 0.75- to 1.25-fold range) and >10 kb were considered an effective CNV candidate. Some candidate regions had to be subdivided because of gaps (N-region) in the BN genome.

Simulation

To evaluate our optimal sequencing depth and the accuracy of our methods, we simulated short reads of different lengths using the BN genome. We also simulated mismatch sequencing errors using a sampling of the quality scores from the DA and F344 sequencing data, as well as SNPs, indels, and structural variants (separately and with occurrence rates of 1 × 10−3, 1 × 10−4, and 1 × 10−5, respectively). The length of indels ranged from 1 to 10 bp and the length of structural variants ranged from 100 bp to 100 kb. The simulated reads were then realigned back to the whole BN genome. We used the rate of misplacement to calculate the sensitivity and specificity to detect SNPs, indels, and structural variation and how their detection rates were affected by coverage and quality scores.

GapCloser Tool

The GapCloser tool (http://soap.genomics.org.cn/) (Li ) adopts a greedy algorithm to fill gaps. It extends contig ends iteratively by using reads overlapping with the contig end. Contig-end extension terminates when (1) the extended sequence overlaps with the other contig end at the other side of a gap, (2) an extended sequence with no overlap with the contig end at the other side of gap is 1000 bp longer than the size of the gap in the reference genome, or (3) no reads can be found to make a new round of extension. If extension of one strand fails to close a given gap, GapCloser will perform another extension on the complementary strand.

Construction of the DA and F344 genome drafts

The DA and F344 genomes were assembled using the reference-aided assembly method (RAM), a novel strategy for genome assembly based on resequencing data. RAM contains three main steps: (1) construction of semifinished genome, (2) independent de novo assembly to generate contigs and scaffolds, and (3) generation of the genome draft by anchoring scaffolds onto the semifinished genome. Cleaned sequencing reads were aligned onto the BN reference genome using SOAPaligner (Li ) to construct DA and F344 CNS equal in length to the BN genome but tolerating SNPs at a rate of 10−3. Then gaps in each chromosome’s CNS were closed with each line’s own clean sequencing reads using GapCloser (Li ). Each gap-closed CNS constituted a coordinated, semifinished genome. To obtain the de novo genome assembly of each line, SOAPdenovo (Li ) was used to reassemble clean reads and to generate contigs and scaffolds for DA and F344. Gaps between scaffolds were closed using GapCloser. The final step to obtain the genome drafts of DA and F344 was anchoring the scaffolds onto the semifinished genome. To avoid scaffold contamination, only qualified scaffolds—>200 bp and containing <50% of Ns—were selected for anchoring. Tag sequences with a length of 100 bp and containing no Ns were extracted from each end of the qualified scaffolds, with additional tags extracted for >5000 bp. Tag sequences were mapped to the semifinished genomes using BLAST (Altschul ), and the aligned tag sequences were filtered according to the following criteria: (a) e-value <1 × 10−40, (b) identity value >95, (c) alignment length >95, and (d) number of mismatches fewer than five. We used these qualified tag sequences to anchor the high-confidence scaffolds onto the semifinished genome and thus obtain the genome assembly for each strain. To evaluate the accuracy of the DA and F344 genome drafts, we retrieved all 194,363 ESTs available in the rat genome and aligned them to the assembled drafts using BLAST, set to cover at least 95% of each EST. We also estimated the single-base error for each genome draft by comparing their sequence to the corresponding positions containing homozygous SNPs at the same strain’s semifinished genome.

Data access

All reads have been deposited in the European Bioinformatics Institute (EBI)/NCBI Short Read Archive (accession no. SRA046343). All DA and F344 data have been released for public use and can be freely accessed at NCBI’s Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra or at http://dx.doi.org/10.5524/100042). The data set includes all reads, semifinished genome sequences, genome drafts, annotation of variants including SNPs, short indels (1–5 bp), structural variations, and the bioinformatics tools used.

Results

Sequencing

Genomic DNA was extracted from the liver of male DA (DA/BklArbNsi, The Feinstein Institute for Medical Research) and F344 (F344/NHsd, Harlan Laboratories) rats using the phenol–chloroform–isoamyl alcohol method (Strauss 2001). Massively parallel whole-genome sequencing was performed using the Illumina HiSeq2000 sequencing platform. To minimize systematic bias in library preparation, for each genome we prepared two paired-end DNA libraries with a read length of 90 bp and insert sizes of 475–504 bp (Table S1). A total of 1.02 billion reads from the DA and 1.07 billion reads from the F344 genomes were generated, corresponding to 92.03 and 96.80 Gb of sequencing data, respectively. The proportion of high-quality data (Q-score ≥20) obtained for DA was 96.67% and for F344 was 97.63%. The current assembly of the BN reference genome has an effective size of 2.57 Gb. Using the Short Oligonucleotide Alignment Program (SOAP) (Li ), 83.87 Gb of DA and 88.18 Gb of F344 sequence—91.3% of each strain’s reads—aligned with the BN genome. These reads covered 98.9% of the BN reference genome with at least one read and 98.0% with a sequencing depth of three or more reads and resulted in genome-wide average sequencing depths of 32.68-fold for DA and 34.36-fold for F344 reads (Table S2). The sequencing depth did not vary significantly between autosomes, indicating euploidy (Figure S1). Sequencing depth followed a Poisson distribution, and regions of lower depth correlated with extremes of GC content (Figure S2). Regions of sequence ambiguity and breaks between contigs in the BN genome form 876,652 gaps that limit alignment with DA and F344 reads. These gaps contain 267.83 million positions of undetermined sequence (Ns). Using the GapCloser tool (Li ) to bridge gaps with aligned reads, we were able to assign sequences to 59.31 million positions in DA and to 59.70 million positions in F344 and to effectively close 359,392 gaps in DA and 361,412 in F344 (Figure 1, Table S2).

Figure 1

Genetic variation in the DA and F344 genomes. Distribution and frequency of (A) SNPs, (B) short insertion/deletions (InDel), (C) structural variants (SV), (D) copy-number variant (CNV) candidates, and (E) filled gaps along the rat genome (numbers outside the circle represent each chromosome), using the BN genome as reference, are shown. The F344 genome is in dark blue, and DA is in light blue.

SNPs

We used SOAPsnp (Li ) to identify the SNP sets for each inbred line based on the alignment results of all sequencing data with the BN genome sequence. Unreliable sites were excluded from the analysis by filtering the SNPs for quality, copy number, distance between SNPs, number of supporting reads for each allele, and total depth. After filtering, we identified 2,964,158 high-quality nuclear DNA SNPs in DA and 2,973,513 in F344, compared with the BN genome (Figure 2A). We also identified 156 mitochondrial SNPs in DA and 163 in F344. Mitochondrial SNPs had a frequency of 9.8 × 10−3.

Figure 2

Variation between DA, F344, and BN. (A) DA (blue, light blue) and F344 (red, pink) each had 1.03 million unique homozygous SNPs and shared alleles for 1,786,600 homozygous SNPs (purple, light purple). Forty-eight percent of the strain-specific SNPs and 20% of the shared SNPs were novel and were not present in dbSNP v.136, including 502,994 SNPs with alleles unique to DA (blue), 496,368 SNPs with alleles unique to F344 (red), and 370,879 SNPs for which DA and F344 had the same alleles (purple). (B) DA and F344 had the same alleles for 146,502 homozygous indels; 143,058 were unique to DA, and 149,621 were unique to F344. Most homozygous indels were 1 bp long, and distribution of short insertions and deletions according to size was similar in DA and F344. (C) DA and F344 had the same alleles covering 80% of 30,978 structural variants; 15,151 were unique to DA, and 17,575 were unique to F344. The most frequent structural variants were deletions and insertions, followed by tandem and dispersed duplications. (D) There were 2594 CNV candidates unique to DA, 3611 unique to F344, and 994 identical between DA and F344. Most CNV candidates were in the 10- to 20-kb range (blue: DA; red: F344; purple or gray: shared). We detected a total of 5,632,694 homozygous SNPs: 2,816,017 in the DA set and 2,816,677 in the F344 set. A total of 2,059,492 homozygous SNPs were polymorphic between DA and F344, and 1,786,600 homozygous SNPs were identical (Table 1, Figure 2A). The frequency of homozygous SNPs was 1.1 × 10−3 for both strains. More than 1.37 million homozygous SNPs were new and not represented in the dbSNP database (build 136). These novel SNPs included 502,994 SNPs with alleles unique to DA (F344 and BN carried the same allele), 496,368 SNPs with alleles unique to F344 (DA and BN carried the same allele), and 370,879 SNPs with alleles unique to BN (DA and F344 carried the same allele). A percentage of 39.38 of DA and F344 homozygous SNPs mapped to repeat regions, in agreement with the 40% interspersed repetitive DNA described in the rat genome (Gibbs ). To estimate the accuracy of our SNP set, we sequenced three gene regions of chromosomes 4 and 10 using the Sanger method and confirmed 99.68% of 933 homozygous SNPs (Table S3).

Table 1

SNPs and indels in the DA and F344 consensus assemblies

	SNPs				Short indels
Sample	Homozygous	Heterozygous	Known^a^b	Novel^b	Homozygous	Heterozygous
DA^c	1,029,417	91,848	526,423	502,994	143,058	9,461
F344^d	1,030,075	100,545	533,707	496,368	149,621	9,071
Shared^e	1,786,600	56,293	1,415,721	370,879	146,502	511

Total	3,846,092	248,686	2,475,851	1,370,241	439,181	19,043

Homozygous SNPs mapping to SNP positions in dbSNP 136.

Include only homozygous SNPs.

Allele is unique to DA (F344 allele = BN allele)

Allele is unique to F344 (DA allele = BN allele)

DA and F344 have the same allele, which is different from BN.

Homozygous SNPs mapping to SNP positions in dbSNP 136. Include only homozygous SNPs. Allele is unique to DA (F344 allele = BN allele) Allele is unique to F344 (DA allele = BN allele) DA and F344 have the same allele, which is different from BN. A percentage of 5.14 of the SNPs were detected in the heterozygous state, with a genomic distribution rate of 5.9 × 10−5. Heterozygous SNPs were predominantly detected in regions with high alignment rates (median sequencing depth; homozygous SNPs = 28-fold for DA and 27-fold for F344, heterozygous SNPs = 40-fold for both strains; Figure S3) and mapped to repeat regions at a rate significantly higher than homozygous SNPs (60.02% vs. 39.38%, respectively; P < 0.001, chi-square test), suggesting that improperly aligned reads might account for some of heterozygous SNP calls. To confirm heterozygous SNPs in these highly inbred strains, 45 heterozygous SNPs were sequenced with Sanger methodology, and 6 (13.33%) were in fact homozygous SNPs, while 39 (86.87%) were false SNP calls (Table S3). Therefore, heterozygous SNPs were not included in subsequent analyses. Many of the homozygous SNPs detected had the potential to impact gene function, including 422 SNPs predicted to cause loss or gain of start codons, 231 SNPs impacting splicing sites, and 140 SNPs causing loss or gain of stop codons. Additionally, 15,477 SNPs were nonsynonymous, mapping to 3174 Refseq genes and 4724 Ensembl genes in the DA genome and to 3074 Refseq genes and 4632 Ensembl genes in the F344 genome (Table S4 and Table S5). A total of 4.3 million SNPs (88.4% of all homozygous SNPs) were intergenic, intronic, or synonymous.

Indels

We detected indels using SOAP2 (Li ). Indels were defined as alignment gaps of up to 5 bp, supported by three or more nonredundant pairs of reads and present in at least one-third of reads for autosomic indels or in all of the reads for X-chromosome indels. In total, we identified 299,532 indels in DA and 305,705 in F344 (Figure 2B, Table 1). Of the indels, 96.8% were homozygous and had a genomic distribution rate of 1.1 × 10−4, and 3.2% were heterozygous and had a genomic distribution rate of 3.8 × 10−6. Insertions or deletions of 1 bp accounted for 67.76% of the indels (Figure S4). Sanger sequencing confirmed 100% of 77 homozygous indels tested (Table S6). While DA and F344 shared 146,502 homozygous indels, 292,679 were polymorphic between these two strains with 143,058 indels unique to DA and 149,621 unique to F344 (Figure 2B). Most indels were intronic or intergenic (Figure S5), but 605 homozygous indels were predicted to cause codon insertions/deletions or frameshift in coding genes. Of these, 204 transcript-affecting indels were unique to DA, 155 were unique to F344, and 246 were found in both DA and F344 (Table 2).

Table 2

Genetic variation annotation of DA and F344

	SNPs^a			Indels			Structural variant
	DA	F344	Shared	DA	F344	Shared	DA	F344
Intergenic region	658,646	660,019	1,150,467	85,753	91,326	97,035	—	—
Intron	485,087	489,738	837,376	70,958	73,231	75,716	17516	19607
Downstream (up to 5 kb)	71,596	71,646	123,030	10,565	10,620	10,722	—	—
Upstream (up to 5 kb)	70,831	71,107	119,219	10,109	10,413	9,853	—	—
3′ UTR	4,094	3,766	6,955	685	723	811	291	305
5′ UTR	752	647	1,061	48	37	47	226	239
Start gained in 5′ UTR	128	122	157	—	—	—	—	—

Coding sequences and splice sites							2572	2829
Synonymous coding	7,146	6,981	12,104	—	—	—	—	—
Nonsynonymous coding	4,230	4,060	7,187	—	—	—	—	—
Frameshift	—	—	—	156	135	217	—	—
Start lost	1	6	8	—	—	—	—	—
Stop gained	35	35	59	—	—	—	—	—
Stop lost	4	1	6	—	—	—	—	—
Splice-site acceptor	30	31	59	35	21	58	—	—
Splice-site donor	26	27	58	32	16	62	—	—
Synonymous stop	3	5	10	—	—	—	—	—
Nonsynonymous start	1	0	2	—	—	—	—	—
Codon deletion	—	—	—	19	10	14	—	—
Codon change + codon deletion	—	—	—	10	2	6	—	—
Codon insertion	—	—	—	12	4	5	—	—
Codon change plus codon insertion	—	—	—	7	4	4	—	—

Within noncoding gene
Nonsynonymous coding	434	428	866	—	—	—	—	—
Synonymous coding	232	228	395	—	—	—	—	—
Stop gained	24	14	27	—	—	—	—	—
Stop lost	6	15	18	—	—	—	—	—
Synonymous stop	3	0	6	—	—	—	—	—
Start lost	3	1	1	—	—	—	—	—
Synonymous start	1	0	0	—	—	—	—	—
Codon change + codon deletion	—	—	—	3	0	2	—	—
Codon deletion	—	—	—	2	1	2	—	—
Codon insertion	—	—	—	1	0	2	—	—
Codon change + codon insertion	—	—	—	0	1	0	—	—

Calculated using SNPEff v.1.9.5 (Cingolani ) and Ensembl’s R. norvegicus build 3.4.64.

Homozygous SNPs.

Calculated using SNPEff v.1.9.5 (Cingolani ) and Ensembl’s R. norvegicus build 3.4.64. Homozygous SNPs. The frequencies of homozygous SNPs along the DA and F344 genomes varied from 0 to 3 ± 10−3 and strongly correlated with that of homozygous indels (Figure 1, Figure S6), suggesting a progressive increase in variation density from shared haplotypes.

Structural variation

We used paired-end alignment to identify structural variation. Regions containing structural variants were detected when read pairs aligned to the reference genome abnormally—differing in orientation and/or inferred insert size with the support of at least three read pairs. We identified a total of 58,174 structural variants: 12,151 unique to DA, 17,575 unique to F344, and 30,978 present in both DA and F344 (Figure 2C, Figure S7, and Figure S8). Deletions and insertions >5 bp were the most frequently detected class of structural variants, followed by tandem duplication, dispersed duplication, and combined insertion–deletion. Structural variants overlapping coding sequences have a high potential to disrupt the function of those genes. In total, 2572 structural variants in the DA and 2829 in the F344 genomes overlapped coding sequences of Ensembl genes (Table 2). And 1398 structural variants in the DA and 1502 in the F344 genomes overlapped coding sequences of RefSeq genes (Table S7). Based on the mean depth of 100-bp sliding windows along each chromosome, we detected 7199 candidate regions of copy-number variation: 2594 unique to DA, 3691 unique to F344, and 994 in both DA and F344 (Figure 1D, Table S8). Seventy-seven percent of copy-number variant candidates were in the 10- to 20-kb range (Figure 2D).

Sensitivity and specificity

To evaluate the accuracy of read mapping, we generated a variation of the BN genome containing SNPs, indel, and structural variants with frequencies similar to those observed in the DA and F344 sequencing data. We also simulated short reads of different lengths containing mismatch sequencing errors and quality scores similar to those in the DA and F344 sequences. We then aligned the simulated reads back to the BN reference assembly to quantify the precision of alignment for the detection of variants. For an average 35-fold coverage with simulated reads, sensitivity for SNP detection was inversely proportional to read-quality threshold and varied from slightly over 96% for reads with Q = 22, to 96.6% and for reads with Q = 15. Specificity for SNP detection was more dependent on sequencing coverage, and it increased from 99.78% with a depth of 1-fold to 99.82% with a depth of 5-fold to 99.94% with a depth of 10-fold (Figure 3A). Sensitivity and specificity for indel detection were similar to those of the simulated SNPs (data not shown).

Figure 3

Accuracy of variant detection. We produced a copy of the BN rat genome with a read coverage of 35-fold and aligned the simulated reads back onto the RGSC3.4 genome scaffold to measure the rate of misplacement. Simulated reads contained simulated mismatch sequencing errors, SNPs, indels, and structural variants to the RGSC3.4 reference at rates similar to those detected in the DA and F344 genomes. (A) SNP detection sensitivity (open circles) was inversely proportional to the read quality threshold. The SNP detection specificity (solid circles) was more dependent on the number of supporting reads. (B) The detection sensitivity for structural variants (open circles) was inversely proportional to the number of supporting reads. The detection specificity for structural variants (solid circles) increased with the number of supporting reads and remained >99% with six or more reads. Specificity for the detection of structural variants increased sharply with the number of supporting reads from 47% (1–2 reads) to 91% (3 reads) and continued increasing at a lower rate to plateau at 99.68% with seven or more reads (Figure 3B). Sensitivity for the detection of structural variants was inversely correlated with the number of supporting reads, sharply declining from 62.1% (1 read) to 49.92% (3 reads) and then to 47.34% (10 reads). To generate the DA and F344 genome drafts, we created a new strategy for de novo genome assembly using NGS data: the reference-aided assembly method (Figure 4). Briefly, semifinished genomes were generated for each strain by aligning their reads to the BN genome using SOAPaligner (Li ) to form a consensus sequence, followed by assembly of reads to bridge gaps in the BN genome using GapCloser (Li ). In parallel, contigs and scaffolds were independently assembled for each strain using SOAPdenovo (Li ), followed by closure of gaps between scaffolds using GapCloser. Finally, sequences from both ends of each scaffold were mapped onto each coordinated semifinished genome using BLAST to anchor the scaffolds and obtain the DA and F344 genome drafts.

Figure 4

Construction of the DA and F344 genome drafts using the Reference-Aided Assembly Method. The strategy to construct the DA and F344 genome drafts from NGS data consisted of (1) generating a coordinated, semifinished genome, (2) producing a de novo assembly, and (3) anchoring the de novo assembly onto the semifinished genome. Each semifinished genome was created by alignment of reads onto the BN reference genome, inference of a consensus sequence, and closure of gaps (left arm). In parallel, reads were assembled independently into scaffolds, followed by closure of gaps and extraction of tag sequences (right arm). Tag sequences were then mapped onto the semifinished genome using BLAST, anchoring the affiliated scaffolds to finalize each genome draft (bottom of diagram). The DA and F344 genome drafts include 2,616,053,766 and 2,615,410,193 effective bases and are 1.94% and 1.91% larger than the BN genome, respectively. The DA and F344 genome drafts also contain 49.76 and 49.11 million novel base pairs bridging 391,057 and 401,069 gaps of the BN genome. Of the novel base pairs, 20.47 million (41.13%) and 19.35 (39.41%) million base pairs are in novel scaffolds. And 2.55% and 2.42% more reads could be mapped to each coordinated draft compared with the consensus sequences (Table 3).

Table 3

Construction of the DA and F344 genome drafts

	Genome size (bp)			Novel scaffolds (bp)		Reads mapped
	Total	Effective^a	Gaps^b	Total	Effective^a	DA (%)	F344 (%)
BN	2,834,127,293	2,566,294,765	876,652	—	—	91.49	91.84
DA	2,798,712,224	2,616,053,766	485,595	20,558,331	20,465,987	94.03	—
F344	2,793,938,348	2,615,410,193	475,583	19,441,505	19,355,322	—	94.27

Genome length without Ns.

Number of gaps in each genome draft.

Genome length without Ns. Number of gaps in each genome draft. We evaluated the quality of DA and F344 genome drafts using two methods. First, we retrieved all 194,363 ESTs available in the rat genome and aligned them to the assembled drafts using BLAST to cover at least 95% of each EST. Of the ESTs, 97.97% aligned to each de novo assembly, and 836 (0.43%) and 1088 (0.56%) ESTs aligned exclusively to novel scaffolds in DA and F344, respectively (Table S9). Second, we estimated the single-base error rates for de novo assemblies by comparing the draft genome sequences to corresponding positions containing homozygous SNPs in the semifinished genome of each strain. The estimated single-base error for these two newly assembled drafts was 3.06 × 10−5 for DA and 2.99 × 10−5 for F344 (Table S10).

Discussion

DA and the F344 rats have unique dichotomous phenotypes that have been used to better understand development, human physiology, and disease. DA rats are highly susceptible to autoimmunity, including models of rheumatoid arthritis, multiple sclerosis, and uveitis (Dahlman ; Sun ; Wilder ). DA rats are also susceptible to bladder and tongue carcinomas (Kitano ), have reduced variation in circadian corticosteroid production (Brodkin ), and are more easily addicted to morphine (Brodkin ). F344 rats, on the other hand, are typically resistant to the above conditions, but are susceptible to chemically induced hepatocarcinoma and lymphoma (Lu ; De Miglio ) and have decreased bone mineral density (Turner ). Genetic variation between these strains accounts for most of such strain-specific phenotypes. Therefore, sequencing the DA and F344 genomes constitutes a major step toward identifying the genetic causes and pathogenic processes underlying these traits and models of human diseases such as rheumatoid arthritis, multiple sclerosis, and cancer (Table S11) and should facilitate the development of novel disease treatments and biomarkers, as well as new pathways to be tested for disease prevention. The sequencing of the DA and F344 genomes identified a large number of variants between each of these two strains and BN. The 5.6 million SNPs identified in the DA and F344 genomes increased the total number of known SNPs between these two strains and BN by 150-fold from 19,326 (Saar ) to 2.2 million SNPs between DA and F344 and 2.9 million SNPs between BN and each of the other two strains. Furthermore, 1.37 million SNPs and 0.44 million indels were novel. The addition of these novel variants significantly expands known variation in the rat genome. A large number of variants were predicted to significantly disrupt gene structure. High-impact variants included deletion of coding sequences, loss of start codons, premature stops, frameshifts, codon insertions/deletions, nonsynonymous SNPs, and changes at splicing sites. In addition to these effects on gene structure, other variants can potentially alter gene expression. Upstream and 5′-UTR variants can disrupt epigenetic regulation and transcription factor-binding sites, 3′-UTR variants can modify messenger RNA stability (Boffa ), intronic SNPs can influence expression breadth (Park ), and synonymous SNPs can affect translation efficiency (Plotkin and Kudla 2011). The DA and F344 genome sequencing provides a detailed framework for future studies aimed at characterizing how these variants alter gene function. At 32- and 34-fold redundancy, the DA and F344 genomes were assembled at a sequencing depth almost five times that of the BN genome (Gibbs ) and three times that of the SHR genome (Atanur ). The DA and F344 genome assemblies also used stringent quality criteria to define variants. The combination of high-quality and high-sequencing depth resulted in increased accuracy to detect SNPs and indels, as was confirmed with Sanger sequencing and in silico simulations. The frequency of SNPs was 10-fold higher than that of indels, revealing a SNP/indel ratio similar to that of other resequencing projects (Ahn ; Atanur ). SNPs in mitochondrial DNA were 8.9-fold more frequent than in nuclear DNA in agreement with its 9–25 times higher mutation rate (Lynch ). A small percentage of the SNPs (5%) and indels (3%) were detected as heterozygous, and Sanger-based resequencing showed that a fraction were in fact homozygous SNPs, while the majority were false calls. Misalignment of reads mapping to repeats or highly homologous segmental duplications and sequencing errors may have contributed to false detection of heterozygous SNPs in the DA and F344 genomes. Eventual residual heterozygosity cannot be entirely excluded; it might result from selection against recessive alleles that are embryonically lethal or are associated with infertility or unproductive breeding behavior (Bailey 1977; Saar ). The importance of copy-number and copy-neutral structural variants in the genome has only recently begun to be understood (Korbel ). Structural variation accounts for an even higher proportion of the genetic diversity between individuals than SNPs (Li ) and has been associated with disease in both rats and humans (Aitman ). Copy number variants can also correlate with levels of gene expression in rats (Guryev ; Charchar ) and have been estimated to account for 20% of expression differences in humans (Stranger ). We identified variants in the DA and F344 genomes that caused duplications, deletions, or potential disruptions of the structure of >2500 genes. This frequency of potentially gene-disrupting structural/copy-number variants has also been seen in other interstrain comparisons such as that described between DBA and B6 mice (Quinlan ). Insert size of libraries can be a limiting factor for the identification of insertion events in NGS (Pang ). And in fact, using the simulated reads we estimated that our method of detecting structural variants had a sensitivity of 45–50%. Therefore, DA and F344 structural variants are most likely underrepresented. DA and F344 rats shared alleles for 60% of the SNPs, 50% of the indels, and 70% of the structural variants, an indication of the phylogenetic proximity between these two strains. The high levels of allele sharing between DA and F344 are in agreement with a previous observation that BN was the most divergent of 167 commonly used laboratory inbred strains, including DA and F344 (Saar ). We devised and employed a new strategy to generate the first de novo assembly of a rat genome using NGS technology data. As a result, the DA and F344 genome drafts are more extensive and more complete than the BN genome and should facilitate the study of discrepancies with genetic maps (Saar ) and areas of sequencing collapse (Guryev ). The new DA and F344 genome drafts contain 49 million base pairs of novel sequence each, nearly half the number of gaps present in the BN genome, and ∼1000 ESTs uniquely mapped to novel scaffolds of each strain. The BN and SHR are the only rat nuclear genomes drafted to date. As additional rat genomes become available, investigators will be able to construct detailed haplotype maps, a key resource for both targeted and genome-wide studies in the rat. Sequencing additional genomes will help resolve regions of poor coverage in the BN and other rat genomes, as well as alignment and sequencing errors and undetected duplications. Over 615 inbred rat strains and substrains are presently registered at the Rat Genome Database. These strains are an important resource for gene identification and studies of gene function and are currently being used by several laboratories worldwide. The SNPs, indels, and structural variants reported here compose a large collection of new informative markers that can be used to increase the precision of genetic mapping and genotype-guided breeding, as well as for studies in advanced intercross lines and for genome-wide association studies using heterogeneous stocks. Indeed, with an average density of one SNP per 0.86 kb, SNPs identified in this study will facilitate mapping at a resolution 100-fold higher than with previously available SNPs (Saar ).

60 in total

1. SOAP2: an improved ultrafast tool for short read alignment.

Authors: Ruiqiang Li; Chang Yu; Yingrui Li; Tak-Wah Lam; Siu-Ming Yiu; Karsten Kristiansen; Jun Wang
Journal: Bioinformatics Date: 2009-06-03 Impact factor: 6.937

2. The arthritis severity quantitative trait loci Cia4 and Cia6 regulate neutrophil migration into inflammatory sites and levels of TNF-alpha and nitric oxide.

Authors: Teresina Laragione; Nuriza C Yarlett; Max Brenner; Adriana Mello; Barbara Sherry; Edmund J Miller; Christine N Metz; Pércio S Gulko
Journal: J Immunol Date: 2007-02-15 Impact factor: 5.422

Review 3. Synonymous but not the same: the causes and consequences of codon bias.

Authors: Joshua B Plotkin; Grzegorz Kudla
Journal: Nat Rev Genet Date: 2010-11-23 Impact factor: 53.242

4. The non-MHC quantitative trait locus Cia5 contains three major arthritis genes that differentially regulate disease severity, pannus formation, and joint damage in collagen- and pristane-induced arthritis.

Authors: Max Brenner; Hsiang-Chi Meng; Nuriza C Yarlett; Bina Joe; Marie M Griffiths; Elaine F Remmers; Ronald L Wilder; Pércio S Gulko
Journal: J Immunol Date: 2005-06-15 Impact factor: 5.422

5. Towards a comprehensive structural variation map of an individual human genome.

Authors: Andy W Pang; Jeffrey R MacDonald; Dalila Pinto; John Wei; Muhammad A Rafiq; Donald F Conrad; Hansoo Park; Matthew E Hurles; Charles Lee; J Craig Venter; Ewen F Kirkness; Samuel Levy; Lars Feuk; Stephen W Scherer
Journal: Genome Biol Date: 2010-05-19 Impact factor: 13.583

6. Effect of single nucleotide polymorphisms on expression of the gene encoding thrombin-activatable fibrinolysis inhibitor: a functional analysis.

Authors: Michael B Boffa; Deborah Maret; Jeffrey D Hamill; Nazareth Bastajian; Paul Crainich; Nancy S Jenny; Zhonghua Tang; Elizabeth M Macy; Russell P Tracy; Rendrik F Franco; Michael E Nesheim; Marlys L Koschinsky
Journal: Blood Date: 2007-09-12 Impact factor: 22.113

7. Genomic regions controlling corticosterone levels in rats.

Authors: Marc N Potenza; Edward S Brodkin; Bina Joe; Xingguang Luo; Elaine F Remmers; Ronald L Wilder; Eric J Nestler; Joel Gelernter
Journal: Biol Psychiatry Date: 2004-03-15 Impact factor: 13.382

8. A highly annotated whole-genome sequence of a Korean individual.

Authors: Jong-Il Kim; Young Seok Ju; Hansoo Park; Sheehyun Kim; Seonwook Lee; Jae-Hyuk Yi; Joann Mudge; Neil A Miller; Dongwan Hong; Callum J Bell; Hye-Sun Kim; In-Soon Chung; Woo-Chung Lee; Ji-Sun Lee; Seung-Hyun Seo; Ji-Young Yun; Hyun Nyun Woo; Heewook Lee; Dongwhan Suh; Seungbok Lee; Hyun-Jin Kim; Maryam Yavartanoo; Minhye Kwak; Ying Zheng; Mi Kyeong Lee; Hyunjun Park; Jeong Yeon Kim; Omer Gokcumen; Ryan E Mills; Alexander Wait Zaranek; Joseph Thakuria; Xiaodi Wu; Ryan W Kim; Jim J Huntley; Shujun Luo; Gary P Schroth; Thomas D Wu; HyeRan Kim; Kap-Seok Yang; Woong-Yang Park; Hyungtae Kim; George M Church; Charles Lee; Stephen F Kingsmore; Jeong-Sun Seo
Journal: Nature Date: 2009-07-08 Impact factor: 49.962

9. Propylnitrosourea-induced T-lymphomas in LEXF RI strains of rats: genetic analysis.

Authors: L M Lu; H Shisa; J Tanuma; H Hiai
Journal: Br J Cancer Date: 1999-05 Impact factor: 7.640

10. Copy number polymorphism in Fcgr3 predisposes to glomerulonephritis in rats and humans.

Authors: Timothy J Aitman; Rong Dong; Timothy J Vyse; Penny J Norsworthy; Michelle D Johnson; Jennifer Smith; Jonathan Mangion; Cheri Roberton-Lowe; Amy J Marshall; Enrico Petretto; Matthew D Hodges; Gurjeet Bhangal; Sheetal G Patel; Kelly Sheehan-Rooney; Mark Duda; Paul R Cook; David J Evans; Jan Domin; Jonathan Flint; Joseph J Boyle; Charles D Pusey; H Terence Cook
Journal: Nature Date: 2006-02-16 Impact factor: 49.962

9 in total

Review 1. Genome sequencing in the clinic: the past, present, and future of genomic medicine.

Authors: Jeremy W Prokop; Thomas May; Kim Strong; Stephanie M Bilinovich; Caleb Bupp; Surender Rajasekaran; Elizabeth A Worthey; Jozef Lazar
Journal: Physiol Genomics Date: 2018-05-04 Impact factor: 3.107

2. Genomic landscape of rat strain and substrain variation.

Authors: Roel Hermsen; Joep de Ligt; Wim Spee; Francis Blokzijl; Sebastian Schäfer; Eleonora Adami; Sander Boymans; Stephen Flink; Ruben van Boxtel; Robin H van der Weide; Tim Aitman; Norbert Hübner; Marieke Simonis; Boris Tabakoff; Victor Guryev; Edwin Cuppen
Journal: BMC Genomics Date: 2015-05-06 Impact factor: 3.969

3. Natural polymorphisms in Tap2 influence negative selection and CD4∶CD8 lineage commitment in the rat.

Authors: Jonatan Tuncel; Sabrina Haag; Anthony C Y Yau; Ulrika Norin; Amelie Baud; Erik Lönnblom; Klio Maratou; A Jimmy Ytterberg; Diana Ekman; Soley Thordardottir; Martina Johannesson; Alan Gillett; Pernilla Stridh; Maja Jagodic; Tomas Olsson; Alberto Fernández-Teruel; Roman A Zubarev; Richard Mott; Timothy J Aitman; Jonathan Flint; Rikard Holmdahl
Journal: PLoS Genet Date: 2014-02-20 Impact factor: 5.917

Review 4. The utility of Apc-mutant rats in modeling human colon cancer.

Authors: Amy A Irving; Kazuto Yoshimi; Marcia L Hart; Taybor Parker; Linda Clipson; Madeline R Ford; Takashi Kuramoto; William F Dove; James M Amos-Landgraf
Journal: Dis Model Mech Date: 2014-10-02 Impact factor: 5.758

5. Diversity in the preimmune immunoglobulin repertoire of SHR lines susceptible and resistant to end-organ injury.

Authors: M L Gonzalez-Garay; S M Cranford; M C Braun; P A Doris
Journal: Genes Immun Date: 2014-07-24 Impact factor: 2.676

6. Draft Genome Sequence of Olsenella scatoligenes SK9K4T, a Producer of 3-Methylindole (Skatole) and 4-Methylphenol (p-Cresol), Isolated from Pig Feces.

Authors: Xiaoqiong Li; Ole Højberg; Samantha Joan Noel; Nuria Canibe; Bent Borg Jensen
Journal: Genome Announc Date: 2016-02-25

7. Animal Models of Rheumatoid Arthritis (I): Pristane-Induced Arthritis in the Rat.

Authors: Jonatan Tuncel; Sabrina Haag; Markus H Hoffmann; Anthony C Y Yau; Malin Hultqvist; Peter Olofsson; Johan Bäcklund; Kutty Selva Nandakumar; Daniela Weidner; Anita Fischer; Anna Leichsenring; Franziska Lange; Claus Haase; Shemin Lu; Percio S Gulko; Günter Steiner; Rikard Holmdahl
Journal: PLoS One Date: 2016-05-26 Impact factor: 3.240

8. The Genomic Scrapheap Challenge; Extracting Relevant Data from Unmapped Whole Genome Sequencing Reads, Including Strain Specific Genomic Segments, in Rats.

Authors: Robin H van der Weide; Marieke Simonis; Roel Hermsen; Pim Toonen; Edwin Cuppen; Joep de Ligt
Journal: PLoS One Date: 2016-08-08 Impact factor: 3.240

Review 9. Genome-to-phenome research in rats: progress and perspectives.

Authors: Amy L Zinski; Shane Carrion; Jennifer J Michal; Maria A Gartstein; Raymond M Quock; Jon F Davis; Zhihua Jiang
Journal: Int J Biol Sci Date: 2021-01-01 Impact factor: 6.580

9 in total