Literature DB >> 23544073

SeqEntropy: genome-wide assessment of repeats for short read sequencing.

Hsueh-Ting Chu1, William W L Hsiao, Theresa T H Tsao, D Frank Hsu, Chaur-Chin Chen, Sheng-An Lee, Cheng-Yan Kao.   

Abstract

BACKGROUND: Recent studies on genome assembly from short-read sequencing data reported the limitation of this technology to reconstruct the entire genome even at very high depth coverage. We investigated the limitation from the perspective of information theory to evaluate the effect of repeats on short-read genome assembly using idealized (error-free) reads at different lengths. METHODOLOGY/PRINCIPAL
FINDINGS: We define a metric H(k) to be the entropy of sequencing reads at a read length k and use the relative loss of entropy ΔH(k) to measure the impact of repeats for the reconstruction of whole-genome from sequences of length k. In our experiments, we found that entropy loss correlates well with de-novo assembly coverage of a genome, and a score of ΔH(k)>1% indicates a severe loss in genome reconstruction fidelity. The minimal read lengths to achieve ΔH(k)<1% are different for various organisms and are independent of the genome size. For example, in order to meet the threshold of ΔH(k)<1%, a read length of 60 bp is needed for the sequencing of human genome (3.2 10(9) bp) and 320 bp for the sequencing of fruit fly (1.8×10(8) bp). We also calculated the ΔH(k) scores for 2725 prokaryotic chromosomes and plasmids at several read lengths. Our results indicate that the levels of repeats in different genomes are diverse and the entropy of sequencing reads provides a measurement for the repeat structures.
CONCLUSIONS/SIGNIFICANCE: The proposed entropy-based measurement, which can be calculated in seconds to minutes in most cases, provides a rapid quantitative evaluation on the limitation of idealized short-read genome sequencing. Moreover, the calculation can be parallelized to scale up to large euakryotic genomes. This approach may be useful to tune the sequencing parameters to achieve better genome assemblies when a closely related genome is already available.

Entities:  

Mesh:

Year:  2013        PMID: 23544073      PMCID: PMC3609794          DOI: 10.1371/journal.pone.0059484

Source DB:  PubMed          Journal:  PLoS One        ISSN: 1932-6203            Impact factor:   3.240


Introduction

The development of the next generation sequencing technologies (NGS) raised the hope to conduct true haplotype analysis of human genome [1] and for rapid full genome sequencing and typing of various organisms. The 1000 Genomes Project, launched in 2008, began to sequence one thousand human genomes with SGS platforms [2]. In the first phase of the project, the goal was to generate low coverage whole genome shotgun sequencing of 185 individuals. These data were produced in order to validate millions of published genetic variations including single nucleotide polymorphisms (SNPs), insertions and deletions (indels), and other structural variants. Soon after the announcement of the project, another group of scientists started the Genome 10 K project in 2009 which aims to “assemble a genomic zoo” by sequencing the genomes of vertebrate animals [3]. These studies help us understand the correlation between genotypes and phenotypes if large-scale genome shotgun sequencing could be unambiguously and accurately assembled. Recently, Alkan et al. published their analysis of the short-read sequencing data generated from the whole genomes of a Han Chinese individual and a Yoruban individual [4]. In contrary to the initial optimistic view of using NGS technologies to reconstitute the whole genome, it showed a severe deficiency in disambiguating certain genomic regions with short reads. Compared to reference human genome, more than 400 mega-base-pairs (Mbps) of common repeats are missing. As a consequence, it is still a challenge to perform accurate haplotype analysis even though a massive amount of genome sequencing data from multiple individuals is currently available. In another study by Kingsford et al., they explored the effect of repeats in prokaryotic genome assembly using de Bruijn graphs and derived an upper bound of contig sizes for a large number of prokaryotic genomes based on simulated short-reads of different lengths [5]. They concluded that while most genes (>98%) can be recovered in contigs derived from reads as short as 100 bps, even reads as long as 1000 bps are not sufficient to produce a complete prokaryotic genome in most cases. In this paper, we investigated the impact of read length with a different quantitative analysis. We define the entropy of nucleotide fragments (H) and use the loss in entropy to measure the influence of repeats on genome assembly. The repeat problem has plagued the assembly process since the first generation of sequencers [6], [7]. Regardless of the sequence assembler used, the de novo assembly of sequencing data will collapse identical repeats if the length of repeated segments is greater than the read length, resulting in incomplete genome reconstruction. As a result, with the read length limitations imposed by sequencing platforms, repetitive regions will not be reconstructed. In information theory, the entropy score is an index to measure the disorder in a system. Thus we use the definition of k-substring entropy to represent the expected value of the quantified information contained in the reads of length k produced by the sequencing procedure. We apply this measurement to both prokaryotic and eukaryotic genomes, including the human genome. We demonstrated the usefulness of the score as a measurement of the repeat structures of the genomes and proposed how it can be used to aid genome sequencing efforts from the perspective of read lengths and repetitiveness of target genomes.

Materials and Methods

An Idealized Model of Short Read Sequencing

In Figure 1, we illustrated a simplified model of the process to generate high coverage sequence data using a modern sequencing platform such as the Illumina/Solexa system. We considered fixed read length systems for this study since such systems currently provide the cost-effectiveness required for large-scale sequencing and have wider market adoptions. For our current purpose, we simplified the sequencing into two major steps. In the first step, a target DNA sequence is broken into smaller fragments. The fragments are filtered by size and then form a sequencing library. The second step is the parallel sequencing of the ends of these fragments. Current parallelized sequencing technologies are based on various sequencing-by-synthesis methods which can produce a massive number of reads with high redundancy. In this model, we assume that both of the steps are random. That is, the produced sequencing reads can be from any position in the DNA sequence with equal probability although various factors can contribute to sequence sampling bias in real life - resulting in uneven coverage and gaps in sequence assembly [8]. In addition, the reads can come from both the forward strand TF and the reverse strand TR of the target DNA sequence T. For convenience, we just use T to present both strands TF and TR in the following sections.
Figure 1

Model of typical short read sequencing.

(a) The target sequence is randomly broken into fragments and filtered by their lengths to form a sequencing library. (b) The end or ends of the DNA fragments are sequenced in parallel to generate a massive set of short reads. We assumed the sequencing is random so that each position is more or less covered by equal numbers of fixed-length reads.

Model of typical short read sequencing.

(a) The target sequence is randomly broken into fragments and filtered by their lengths to form a sequencing library. (b) The end or ends of the DNA fragments are sequenced in parallel to generate a massive set of short reads. We assumed the sequencing is random so that each position is more or less covered by equal numbers of fixed-length reads.

The Computation of k-substring Entropy Loss as a Quantitative Measurement of Repeats

If we can filter out all sequencing errors, we can see in Figure 1 that the reads generated are substrings of the target DNA sequence. Let x be a substring of length k from DNA sequence T = t1t2t3…tm where each tj, 1≤ j ≤ m, is one of the nucleotides {A, C, G, T}. In other words, for short read sequencing with reads of length k, the reads could be denoted as Let S be the collection of all possible substrings x = t …t with length k where each tj∈{A, C, G, T}. Definewhere c(x) is the number of occurrences of x in the sequence T. The Shannon entropy H of S is defined aswhere k is the length of substrings [9]. In particular, if any substring xi = t …t of length k is unique. Then, If any substring of length k is unique, there is no repeat whose length is greater than or equal to k. In this case, it achieves a maximum of entropy In almost all of the applications, we have We define the relative entropy loss of length k as If there are a large number of substrings {x}, we can divide the substrings into independent sets T, T, ..., T according their prefixes. Then we can rewrite (3) as Equation 8 represents an approach to divide up the input into smaller sets and to process them in parallel. This tactic allows one to process a large genome in a timely manner using modest computing resources.

Results

Deficiency in Sequence Coverage Caused by Sequence Repeats is Strongly Correlated to Loss in k-substring Entropy

We investigated the impact of repeat structures on genome assembly and the correlation of sequence coverage and k-substring entropy using the SeqEntropy program that we developed by comparing the sequence coverage results obtained from the SHARCGS de-novo assembler paper [10] and our entropy measurements. Table 1 lists the BAC insert sequences derived from Arabidopsis thaliana and Drosophila melanogaster (fruit fly). The reconstructed sequence coverage of assembly was quoted from the original SHARCGS paper. The result showed a strong correlation between the ratio of reconstructed coverage and entropy loss. Most BAC sequences have small deficiency from the reconstructed sequence coverage except the following three sequences: AC009243, AC092242 and AC007329 (the bolded rows in Table 1). These three sequences have significantly incomplete coverage such that the sequence coverage by SHARCGS assembled contigs is less than 90%. All of these three sequences lose more than 1% entropy at read length of 30 bp. Consequently, we propose that >1% entropy loss results in poor assembly results using de novo assemblers. For comparison, we also calculated the percentage of simple repeats in these BACs and showed the results in the last column of Table 1. The proposed relative entropy loss scores correlate well with the percentages of repeats. When both are high, we observed poor coverage of assembly, as shown in the bolded rows.
Table 1

Comparison of coverage and entropy loss of BAC sequencesa.

Clone (BAC no.)SequenceLength(bp)SequenceCoverage (%)CoverageDeficiency (%)H30 I30 ΔH30 % of repeat
AC011809108,76799.980.025.3358625.3374110.03%0.46%
AC002328109,17199.610.395.3359475.3390220.06%0.99%
AC064879109,18099.680.325.3359715.3390580.06%0.97%
AC023673109,36799.910.095.3386165.3398010.02%0.38%
AC011713109,69498.851.155.3320095.3410980.17%2.91%
AC009243110,565100.000.005.3443735.3445340.00%0.02%
AC022520110,61199.630.375.3418125.3447140.05%0.92%
AC018460 110,619 66.48 33.52 4.99782 5.344746 6.49% 41.15%
AC007764111,22299.840.165.3466645.3471070.01%0.13%
AC000348111,56698.781.225.3399375.3484490.16%2.81%
AC0921918091999.520.485.2058965.2089250.06%1.01%
AC1855339580898.461.545.2716175.28230.20%3.41%
AC018485 99441 82.86 17.14 5.094117 5.298469 3.86% 19.86%
AC01847810380999.900.105.3150525.3171440.04%0.56%
AC092242111023100.000.005.3462765.3463290.00%0.01%
AC018482 113821 87.42 12.48 5.261927 5.357142 1.78% 19.27%
AC18553411946199.420.585.372665.3781510.10%1.75%
AC09239912201399.920.085.3868155.3873330.01%0.17%
AC00783712364799.900.105.3919585.3931120.02%0.28%
AC00732912614099.990.015.4016385.4017830.00%0.05%

Sequence coverage percentages as listed in Dohm et al [10].

The programs for the computation are available at: http://sourceforge.net/projects/seqentropy/files/SeqEntropy-demo-20130203.zip.

The columns H30, I30,ΔH30 are computed by our program “SeqReadEntropy” using read length of 30 bp.

The column “% of repeat” is computed by our program “SeqReadRepeat” using read length of 30 bp.

Sequence coverage percentages as listed in Dohm et al [10]. The programs for the computation are available at: http://sourceforge.net/projects/seqentropy/files/SeqEntropy-demo-20130203.zip. The columns H30, I30,ΔH30 are computed by our program “SeqReadEntropy” using read length of 30 bp. The column “% of repeat” is computed by our program “SeqReadRepeat” using read length of 30 bp.

Evaluation of the Limitation of Short-read Sequencing for Animal Genomes

We applied the entropy measurement to analyze the limitation of short-read sequencing for different organisms. We selected five model animals (Table 2) with genome sizes ranging from 107∼109 bp. In order to calculate the entropy scores for large eukaryotic genomes on a desktop computer (Intel i7-3820 CPU and 8G RAM), we applied the principle behind Equation 8 above and divided up the input sequences into 256 (44) or 1024 (45) subsets based on the prefix (4mer or 5mer, respectively) of the sequences. The subsets from each genome are than run sequentially on the desktop computer. We had to run the processes sequentially due to memory constraint (only 8 Gb of RAM was available on the desktop computer) The total run-time for each organism is reported in Table 2. While large genomes such as that of human took a long time (295 hours) to complete, with some modifications to the program, we can run each subset in parallel on a computing cluster to reduce the run time significantly.
Table 2

Five animal genomes for entropy measurement.

OrganismGenome sizeVersiona Computation timeb
Yeast (S. cerevisiae)1.2×107 sacCer31.3 minutes
Nematode (C. elegans)1.0×108 ce1033 minutes
Fruit fly (D. melanogaster)1.3×108 dmel_r5.4242 minutes
Zebrafish (D. rerio)1.4×109 danRer766 hours
Human (H. sapiens)3.2×109 hg19, GRCh37.p5295 hours

The whole genome sequences were downloaded from http://hgdownload.cse.ucsc.edu/for the organisms: S. cerevisiae, C. elegans, D. rerio, and H. sapiens and ftp://ftp.flybase.net/for D. melanogaster.

The computation time of entropy measurement was recorded for read length 100 bp on a PC with Intel i7-3820 CPU and 8G RAM.

The whole genome sequences were downloaded from http://hgdownload.cse.ucsc.edu/for the organisms: S. cerevisiae, C. elegans, D. rerio, and H. sapiens and ftp://ftp.flybase.net/for D. melanogaster. The computation time of entropy measurement was recorded for read length 100 bp on a PC with Intel i7-3820 CPU and 8G RAM. Figure 2 depicts the relative entropy losses at different read lengths if idealized sequences are used for the organisms. Human requires read length of 60 bp and zebra fish requires read length of 100 bp to overcome the 1% entropy loss threshold. The genome size of zebra fish is less than half of human genome. It indicates that zebra fish genome is more repetitive than human genome. Moreover, the nematode, C. elegans, requires very short read (less than 30 bp) to avoid 1% entropy loss whereas fruit fly (D. melanogaster) requires more than 320 bp. Our analysis shows that genome assembly of many other organisms using short reads may be more challenging than human genome assembly.
Figure 2

Entropy losses at different read lengths for different

organisms. In the five organisms, the genomes of zebra fish (D. rerio) and fruit fly (D. melanogaster) will lose more entropy regardless of any read length used for sequencing. In particular, the fruit fly loses >2% of entropy loss even with read length of 120 bp. It will be <1% of entropy loss at read length of 230 bp. On the other hand, the genomes of Yeast (S. cerevisiae) and Nematode (C. elegans) have minor entropy loss even with very short reads. The detail results of entropy measurements are listed in .

Entropy losses at different read lengths for different

organisms. In the five organisms, the genomes of zebra fish (D. rerio) and fruit fly (D. melanogaster) will lose more entropy regardless of any read length used for sequencing. In particular, the fruit fly loses >2% of entropy loss even with read length of 120 bp. It will be <1% of entropy loss at read length of 230 bp. On the other hand, the genomes of Yeast (S. cerevisiae) and Nematode (C. elegans) have minor entropy loss even with very short reads. The detail results of entropy measurements are listed in .
Table 5

The relative entropy losses of five animal genomes at different read lengths.

ReadLen kΔHk of YeastΔHk of NematodeΔHk of Fruit flyΔHk of ZebrafishΔHk of Human
20 0.831217% 1.397796%5.627856%7.44469%5.406936%
300.736196% 0.844718% 4.852753%4.461654%2.858026%
400.683882%0.642178%4.337883%3.11601%1.684438%
500.644228%0.522634%3.932514%2.363446%1.072922%
600.611417%0.440682%3.599801%1.886473% 0.655126%
700.582789%0.380003%3.319438%1.556083%0.482622%
800.557695%0.332944%3.078671%1.315029%0.379953%
900.536264%0.295774%2.868949%1.134327%0.313174%
1000.517069%0.265546%2.684404% 0.996147% 0.2661318%
1100.499602%0.240705%2.520669%0.887384%0.2306789%
1200.483514%0.219947%2.374214%0.799316%0.2028979%
320__1.007846%__
330__ 0.973252% __

The bold numbers show the relative entropy loss values and the corresponding minimal read lengths at which the relative entropy losses are below 1% for different animal genomes.

The relationship between entropy loss and read length explains the limitation of short-read sequencing technology illustrated by Alkan et al. [4]. In the early experiments of the 1000 Genomes Project (like SRA ID: ERX000020), the read length of sequencing data is 36 bp and the curve of relative entropy loss for human genome in Figure 2 indicates more than 2% entropy loss at the read length of 36 bp. As a result, it is almost impossible to retrieve a perfect whole genome assembly from those WGS experiments. On the other hand, Sundquist et al. showed the sequencing of D. melanogaster still achieved worse genome sequence coverage than that of human chromosomes at read lengths of 200 bp [11]. Our proposed quantitative model illustrates the deficiency of sequence coverage for D. melanogaster comes from the richness of repeats in its genome.

Evaluation of Bacterial Whole Genome Sequencing at Different Read Lengths

The Escherichia coli strain MG1655 whole genome shotgun sequencing datasets SRX000429 and SRX000430 generated using Illumina Genome Analyzer are commonly used as performance benchmark of short read sequencing [12]. The complete genome sequence of the same strain (NCBI REFSEQ ID: NC_000913) had been well characterized since 1997 [13]. Therefore, we can compare the result of the de novo assembly using the Illumina reads to the completed reference genome by mapping the contigs to the reference genome sequence. To explore the reliability of the de novo assembly result, we computed the entropy at read length of 36 bp for the E. coli genome sequence NC_000913. We listed the entropy loss of the E. coli genome sequence along with some other prokaryotic whole genome sequences in Table 3. It shows that the entropy loss for the E. coli genome sequenced at read length of 36 bp is 0.22%. Compared this with the results obtained in Table 1, a relative entropy loss of 0.22% corresponds to about 2% genome coverage loss and suggests the difficulty in achieving a perfect genome coverage. Most of the assembly results without pair information by publicly available de novo assemblers can only achieve a genome-wide coverage of around 98% for the E. coli short reads dataset SRX000429 (https://wiki.nbic.nl/index.php/Raw_results_of_NGS_de_novo_assembly). With the help of longer reads or paired-end reads information available since the two 36 bp E. coli datasets were generated, the de novo assembly can achieve a better genome coverage than 98%. To approximate the effects of longer reads and paired-end reads, we calculate the relative entropy losses at k-substring length of 500 bp and 1000 bp (Table 3). However, based on the entropy losses, we think it is still very difficult to develop an automated de novo assembler to reach lossless assembly using data generated from current sequencing platforms. Manual inspections and the use of long range mapping information are necessary in most of the genome assembly. In general, our analysis indicates that smaller prokaryotic genomes with fewer repeats have less entropy loss compared to larger genomes. However, it’s the repeat structure and not the genome size that plays a determining role in entropy loss. Bacteroides thetaiotaomicron VPI-5482 has a larger genome but much fewer long repeats than E. coli. [14]. As a consequence, B. thetaiotaomicron’s entropy loss at read length 36 bps is more similar to that of a much smaller prokaryotic genome. Using entropy losses calculated at various read lengths, we can capture the repeat structure of a genome.
Table 3

Entropy loss of selected prokaryotic whole genomes with reads of lengths 36, 500 and 1000 bps.

Seq. noOrganismSequenceLength(bp)ΔH36 ΔH500 ΔH1000
NC_000913E. coli K-124,639,6750.22%0.09%0.04%
NC_004663B. thet VPI-54826,260,3610.15%0.09%0.05%
NC_008525P.pent ATCC257451,832,3870.16%0.11%0.08%
NC_000908M.geni G37580,0760.11%0.00%0.00%
The third generation sequencing platforms use single-molecule technologies and other nanotechnologies [15]. The new methods claim to produce longer reads (>3000 bp). For instance, the reads for sequencing the E. coli genome could reach an average length of more than 3000 bps [16]. We showed the entropy loss at read length of 500 and 1000 bps for a few prokaryotic whole genome sequences. At the read length of 500 bps, Mycoplasma genitalium, an obligate parasitic bacterium with a highly reduced genome, has an insignificant entropy loss (Table 3). In light of the new and emerging sequencing technologies, we applied the entropy calculation to 2725 prokaryotic chromosomes and plasmids which were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genomes/Bacteria/- downloaded June 2012). We computed the relative entropy losses at read lengths of 125 bp, 500 bp and 3000 bp – roughly correspond to the read lengths of Illumina, Roche 454 pyrosequencing and PacBio SMRT platforms, respectively. We showed the distribution of the relative entropy loss scores in Figure 3. While there are some replicons, especially certain highly repetitive plasmids, with entropy loss scores >1.0% (see Table S1), the vast majority (>99%) of the prokaryotic replicons would lose less than 1.0% entropy with current sequencing technologies. These results are confirmed by the fact that the vast majority of the prokaryotic genomes can be reasonably reconstructed into long contigs with current shotgun sequencing technologies. In figure 4, we zoomed in to show only the results from replicons with entropy loss <1% in order to further differentiate the effect of read length. The entropy loss scores decrease as the read length increases. At read length of 125 bp, even if the sequencing reads cover the genomes evenly and completely, de novo assemblers without utilizing pair-end information would still not be able to fully reconstruct the genome due to the presence of repeats in ∼80% of the cases (2140/2725 with ΔH125>0). On the other hand, with read length of 3000 bp, 58% of the replicons surveyed could be reconstructed completely (i.e. no loss of entropy due to repeats). We list the top 10 genomes (excluding plasmids) with the largest entropy losses at read length of 125 bp, 500 bp and 3000 bp in and the complete results of the entropy loss scores are listed in . There are 41 chromosomes with ΔH125>1% and 6 chromosomes with ΔH500>1%. Many of these outliers have been noted in the literature to have highly repetitive genomes due to various evolutionary forces [17], [18], [19], [20].
Figure 3

Histograms and quartile box plot of relative entropy losses in 2725 prokaryotic replicons.

The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). The vast majority of the replicons lost <1% entropy regardless of the read length.

Figure 4

Histograms and quartile box plot of entropy losses in 2725 prokaryotic replicons truncated at 1% entropy loss in order to see the finer breakdown.

The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). It is clear that as read length increases, the entropy loss decreases. As a result, a higher number of replicons have ΔH <1.0%.

Table 4

Prokaryotic chromosomes with largest entropy losses at read lengths of 125, 500 and 3000 bp.

Genome IDΔH125 Genome IDΔH500 Genome IDΔH3000
Bordetella pertussis Tohama I1.79136%Bordetella pertussis Tohama I0.898868%Mycoplasma agalactiae0.367427%
Xanthomonas oryzae pv. oryzae PXO99A1.82280%Mycoplasma fermentans M64 chromosome0.934389%Dehalococcoides ethenogenes1950.394745%
Wolbachia sp. wRi1.98404%Acinetobacter baumannii SDF0.934389%Mycoplasma fermentans M640.415779%
Aliivibrio salmonicida LFI1238chrom 12.04042%Mycoplasma mycoides subsp. mycoides SC str. PG10.940714%Orientia tsutsugamushi Boryong0.416469%
Shigella boydii CDC 3083-942.09366%Wolbachia endosymbiont ofCulex quinquefasciatus Pel1.028366%Methylobacillus flagellatus KT0.42687%
Shigella dysenteriae Sd1972.40508%Aliivibrio salmonicida LFI1238chrom 11.153796%Wolbachia sp. wRi0.43966%
Acinetobacter baumannii SDF2.64465%Wolbachia sp. wRi1.285340%Alteromonas macleodi0.465815%
Orientia tsutsugamushi str. Iked2.67584%Aliivibrio salmonicida LFI1238chrom 21.334486%Bartonella tribocorum CIP1054760.484289%
Mycoplasma mycoides subsp.mycoides SC str. PG12.75548%Orientia tsutsugamushi str. Ikeda1.570541%Streptococcus agalactiaeNEM3160.49117%
Orientia tsutsugamushi Boryong4.62902%Orientia tsutsugamushi Boryong2.753110%Candidatus Phytoplasma mali0.693043%

The complete entropy computations of 2725 prokaryotic replicons are listed in .

Histograms and quartile box plot of relative entropy losses in 2725 prokaryotic replicons.

The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). The vast majority of the replicons lost <1% entropy regardless of the read length.

Histograms and quartile box plot of entropy losses in 2725 prokaryotic replicons truncated at 1% entropy loss in order to see the finer breakdown.

The x-axis shows the number of replicons in each bin while the y-axis shows the % entropy loss (ΔH). The quartile box plot displays the mean (diamond shape), the medium (50%) the first (25%) and the third (75%) quartiles (the boxes), and the entire range (the whiskers). It is clear that as read length increases, the entropy loss decreases. As a result, a higher number of replicons have ΔH <1.0%. The complete entropy computations of 2725 prokaryotic replicons are listed in . The bold numbers show the relative entropy loss values and the corresponding minimal read lengths at which the relative entropy losses are below 1% for different animal genomes. Taken together, these observations mean the third generation sequencing technologies can theoretically provide complete genome sequences for most prokaryotic organisms if the error rate is controlled or corrected by other short read sequencing technologies. On the other hand, bigger eukaryotic genomes, which are likely to have more complex repeat structure, would benefit less from the longer reads produced by third generation sequencing technologies.

Discussion

The evaluation of k-substring entropy shows that the genomes of different organisms may have distinct repeat structures that impose limitation on sequencing at a certain read length regardless of their genome sizes. Using the entropy measurement, we can estimate an ideal read length for a given genome sequencing project by trying to minimize the entropy loss. Kingsford et al. previously introduced a de Bruijn graph method to evaluate the influence of repeat structure on sequence assembly [21]. However, their analysis was only applied to prokaryotic genomes which are relatively small (most <10 megabases). For the analysis of large genomes such as the mammalian or plant genomes, which can be a gigabases long, a program requires huge amount of memory to record all distinct substrings in the genome. Our algorithm is designed to handle genomes at any scale. With eq. (8), we can separate the substrings into different subsets by their prefix of length m (m = 5 for human genome and m = 4 for zebra fish genome) and calculate the entropy measurement of each set individually. As a result, the proposed program can be run on a desktop computer (8 Gb of RAM) for genomes of an arbitrary size at the cost of time. If completely parallelized, the calculation can be done on a large cluster to reduce the run time significantly. For a typical prokaryotic genome, it takes a few seconds to a few minutes to calculate the entropy loss at a given read length. The processing time is read length dependent and for read length <1000 bp, the calculation takes <1 min using a single CPU core. This length should be sufficient for most of the sequencing technologies in the near future with a few exceptions (e.g. Pacific Biosciences SMRT platform). For read length of 3000 bp, it takes about 20 min to calculate the entropy for one bacterial genome. This task of entropy measurement can be performed on a preliminary assembly or on a reference genome from a closely related organism. Since read length depends on the sequencing platforms and the protocols used, it is often not possible to alter. Experimentally, it is easier to construct paired-end libraries with different insert sizes. We can use pair-end libraries of different insert sizes to mimic the effect of longer reads. Wetzel et al. recently demonstrated that the assembly outcome can be improved drastically by tuning mate-pair sizes (i.e. adjusting the average insert size of the pair-end library) to match the dominant repeat types [22]. Furthermore, they showed that “short” inserts that are between 4 to 6 times the actual read lengths perform better than long inserts that are a few kilobases long. This is because short inserts that barely span the repeats are more effective at resolving local ambiguities than long inserts. Their work is based on idealized de Bruijn graph reconstruction of a genome (see Kingsford et al 2010 for the method). This process is computationally and memory intensive. As a result, it is not easily scaled up to handle large eukaryotic genomes. On the contrary, our method can estimate the ideal read length and is fast and highly scalable as we have demonstrated in the previous sections. We propose that SeqEntropy can be run with different read length parameters to detect the minimum entropy loss. Based on the work of Wetzel et al, we propose that the mate-pair sizes of the sequence library be slightly longer than the theoretical ideal read length detected by SeqEntropy. While the insert size is tuned, the actual read length is still based on available funding and sequencing platforms, allowing minimal interruption to existing and on-going sequencing projects. As more genomes are being sequenced in an automated fashion, the ability to tune the sequencing parameters to achieve better assemblies is highly desirable. We propose that entropy loss can be used to provide an accurate and objective estimate for the optimal sequence length.

Availability

The source programs of this work are available from the Web site http://sourceforge.net/projects/seqentropy/. List of 2725 Prokaryotic chromosomes. The detail computations of entropy losses at read lengths of 125, 500 and 3000 bp for all 2725 Prokaryotic chromosomes. (XLS) Click here for additional data file.
  19 in total

1.  Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions, DNPs.

Authors:  Martti T Tammi; Erik Arner; Tom Britton; Björn Andersson
Journal:  Bioinformatics       Date:  2002-03       Impact factor: 6.937

2.  A haplotype map of the human genome.

Authors: 
Journal:  Nature       Date:  2005-10-27       Impact factor: 49.962

Review 3.  Why repetitive DNA is essential to genome function.

Authors:  James A Shapiro; Richard von Sternberg
Journal:  Biol Rev Camb Philos Soc       Date:  2005-05

4.  SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

Authors:  Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal:  Genome Res       Date:  2007-10-01       Impact factor: 9.043

5.  The complete genome sequence of Escherichia coli K-12.

Authors:  F R Blattner; G Plunkett; C A Bloch; N T Perna; V Burland; M Riley; J Collado-Vides; J D Glasner; C K Rode; G F Mayhew; J Gregor; N W Davis; H A Kirkpatrick; M A Goeden; D J Rose; B Mau; Y Shao
Journal:  Science       Date:  1997-09-05       Impact factor: 47.728

6.  The Orientia tsutsugamushi genome reveals massive proliferation of conjugative type IV secretion system and host-cell interaction genes.

Authors:  Nam-Hyuk Cho; Hang-Rae Kim; Jung-Hee Lee; Se-Yoon Kim; Jaejong Kim; Sunho Cha; Sang-Yoon Kim; Alistair C Darby; Hans-Henrik Fuxelius; Jun Yin; Ju Han Kim; Jihun Kim; Sang Joo Lee; Young-Sang Koh; Won-Jong Jang; Kyung-Hee Park; Siv G E Andersson; Myung-Sik Choi; Ik-Sang Kim
Journal:  Proc Natl Acad Sci U S A       Date:  2007-05-02       Impact factor: 11.205

7.  Short- and long-term evolutionary dynamics of bacterial insertion sequences: insights from Wolbachia endosymbionts.

Authors:  Nicolas Cerveau; Sébastien Leclercq; Elodie Leroy; Didier Bouchon; Richard Cordaux
Journal:  Genome Biol Evol       Date:  2011-09-22       Impact factor: 3.416

8.  Genomic differences between type strain PG1 and field strains of Mycoplasma mycoides subsp. mycoides small-colony type.

Authors:  Daniela F Bischof; Edy M Vilei; Joachim Frey
Journal:  Genomics       Date:  2006-08-17       Impact factor: 5.736

9.  The Whole-genome sequencing of the obligate intracellular bacterium Orientia tsutsugamushi revealed massive gene amplification during reductive genome evolution.

Authors:  Keisuke Nakayama; Atsushi Yamashita; Ken Kurokawa; Takuya Morimoto; Michihiro Ogawa; Masahiro Fukuhara; Hiroshi Urakami; Makoto Ohnishi; Ikuo Uchiyama; Yoshitoshi Ogura; Tadasuke Ooka; Kenshiro Oshima; Akira Tamura; Masahira Hattori; Tetsuya Hayashi
Journal:  DNA Res       Date:  2008-05-28       Impact factor: 4.458

10.  Whole-genome sequencing and assembly with high-throughput, short-read technologies.

Authors:  Andreas Sundquist; Mostafa Ronaghi; Haixu Tang; Pavel Pevzner; Serafim Batzoglou
Journal:  PLoS One       Date:  2007-05-30       Impact factor: 3.240

View more
  1 in total

1.  Diminishing return for increased Mappability with longer sequencing reads: implications of the k-mer distributions in the human genome.

Authors:  Wentian Li; Jan Freudenberg; Pedro Miramontes
Journal:  BMC Bioinformatics       Date:  2014-01-03       Impact factor: 3.169

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.