Literature DB >> 25708162

A pipeline for completing bacterial genomes using in silico and wet lab approaches.

Rutika Puranik, Guangri Quan, Jacob Werner, Rong Zhou, Zhaohui Xu.   

Abstract

BACKGROUND: Despite the large volume of genome sequencing data produced by next-generation sequencing technologies and the highly sophisticated software dedicated to handling these types of data, gaps are commonly found in draft genome assemblies. The existence of gaps compromises our ability to take full advantage of the genome data. This study aims to identify a practical approach for biologists to complete their own genome assemblies using commonly available tools and resources.
RESULTS: A pipeline was developed to assemble complete genomes primarily from the next generation sequencing (NGS) data. The input of the pipeline is paired-end Illumina sequence reads, and the output is a high quality complete genome sequence. The pipeline alternates the employment of computational and biological methods in seven steps. It combines the strengths of de novo assembly, reference-based assembly, customized programming, public databases utilization, and wet lab experimentation. The application of the pipeline is demonstrated by the completion of a bacterial genome, Thermotoga sp. strain RQ7, a hydrogen-producing strain.
CONCLUSIONS: The developed pipeline provides an example of effective integration of computational and biological principles. It highlights the complementary roles that in silico and wet lab methodologies play in bioinformatical studies. The constituting principles and methods are applicable to similar studies on both prokaryotic and eukaryotic genomes.

Entities:  

Mesh:

Year:  2015        PMID: 25708162      PMCID: PMC4331810          DOI: 10.1186/1471-2164-16-S3-S7

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Next-generation sequencing technologies produce massive amount of data at greatly reduced costs, making it possible to routinely sequence the genomes of various organisms. This is especially true for bacteria, whose genomes are typically less than 10 million base pairs (Mb). A standard Illumina sequencing operation can easily generate enough data to cover the genome of a bacterium more than 100 times, which often results in a near-complete genome assembly in a single attempt. In recent years, encouraging progress has been made in de novo sequencing for both small (for example, bacteria [1,2]) and large (for example, mammalian [3,4]) genomes. Methods for alignment and assembly [5-7] and evaluations [8,9] have also been developed. Nevertheless, no method is all-purpose, and the effectiveness of a method is often subject to constraints, such as genome size as well as the quality, length, and abundance of the reads. In addition, software and hardware environment can also play a role. As a consequence, despite the sheer volume of sequencing data and the highly sophisticated software dedicated to handling these types of data, gaps are commonly found in draft assemblies. Besides the limitations of assembling software, two other factors can lead to gaps: the nature of DNA templates and sequencing errors. Between them, the nature of DNA is more critical. For example, some regions of the genome are inherently prone to physical degradation while some others are resistant to amplification due to secondary structures. Both of these scenarios result in underrepresentation of the affected sequences in the data set, and therefore, leave gaps. The presence of gaps often leads to errors in gene finding, annotation, and functional studies. A complete genome is thus preferred or even required in a study. One straightforward way of closing gaps is conducting wet lab experiments, that is, primer walking and Sanger sequencing. However, this approach can be prohibitive, in terms of costs. Here we report a pipeline aimed to assembling complete genomes with a combination of in silico and wet lab approaches. The input of the pipeline is paired-end sequence reads generated by the Illumina technology, and the output is a high quality complete genome sequence. The genome being used as an example belongs to the hyperthermophilic bacterium Thermotoga sp. strain RQ7, which has a circular genome about 1.8 ~ 1.9 Mb, as estimated based on its close relatives. Thermotoga are potential producers of biohydrogen gas [10], a type of clean, renewable fuel.

Methods

Polymerase chain reaction (PCR)

The genomic DNA of T. sp. strain RQ7 was prepared by three rounds of phenol extraction [11]. Primers used in all PCR reactions were designed with the assistance of Primer3 [12,13] and Clone Manger[14]. The sequences of PCR products were determined by Sanger sequencing. All wet lab experiments were performed according to standard procedures.

Pipeline of scaffold assembling and gap closure

A modular pipeline consisting of seven components (Figure 1) was developed to meet our assembling needs:
Figure 1

The pipeline of genome assembling and gap closure. Clean data were subject to de novo assembling, initial gap filling, and single base corrections with SOAPdenovo and SOAPaligner; the resulting assembly was used for comparative genomics studies and also provided guidance for wet lab validations. Meanwhile, the clean data were assembled separately based on a reference genome using CLC Genomics Workbench. This second assembly was integrated into the first one to yield a hybrid assembly, which was then updated with public data, GapFish results, and Sanger sequencing data until the genome sequence was complete.

The pipeline of genome assembling and gap closure. Clean data were subject to de novo assembling, initial gap filling, and single base corrections with SOAPdenovo and SOAPaligner; the resulting assembly was used for comparative genomics studies and also provided guidance for wet lab validations. Meanwhile, the clean data were assembled separately based on a reference genome using CLC Genomics Workbench. This second assembly was integrated into the first one to yield a hybrid assembly, which was then updated with public data, GapFish results, and Sanger sequencing data until the genome sequence was complete. Step 1. De novo assembling with the SOAP package. Three sequencing libraries, with inserts in sizes of 500 bp, 2000 bp, and 5000 bp, were prepared from the genomic DNA of T. sp. strain RQ7. Paired-end sequencing was performed with each library, which generated reads of 90 bp for the 500 bp library and of 49 bp for the other two libraries. A total of 400 Mb clean data were collected and were assembled by SOAPdenovo [3] with K = 33. K value determines the quality of assembly, as measured by the size of the assembly and the number of "N"s it has. Larger K values generate longer contigs, but require deeper sequencing depth and longer reads [3,5,6]. When assembling microbial genomes from Illumina data, K is often set between 25 and 40. In this work the Kmer was empirically set at 33 for best results. After the assembling, an initial gap filling and a single base correction were undertaken with SOAPaligner [15]. This part of the work was done by BGI Americas (Cambridge, MA) under a service agreement. Step 2. Comparative genomics and wet lab validation. A comparative genomics study was performed to evaluate the above assembly and to identify the closest relatives of the T. sp. strain RQ7. PCR and Sanger sequencing were conducted to validate the analyses. Step 3. Integration of the assembly based on a reference genome. An independent assembling effort was taken using the commercial software package CLC Genomics Workbench [16], based on the complete genome sequence of a close relative, which had been identified in Step 2. This assembly was combined with the one obtained in Step 1 to give rise to a hybrid assembly, which was further updated in later steps. Step 4. Integration of public data. Public databases were searched for sequences belonging to T. sp. strain RQ7. The retrieved sequences were used to validate the assembly of Step 3 and to fill in the gaps in the hybrid assembly. Step 5. "Bait-and-fish". This step intends to close the remaining gaps with GapFish (Figure 2), an in-house program written in Python. GapFish adopts the "bait-and-fish" (or "seed-and-extend") scheme. The input of the program is a sequence found upstream of the gap, which is used as the "bait". The output is a sorted list of all sequences (the "fish") located in the same reads with the "bait" but immediately downstream of them. The list is evaluated for consensus sequences, and new "bait" is formulated, which is usually the second or third longest "fish". The process is iterated until the gap is filled (Figure 2). A gap is considered filled when the new sequence is mapped to the other side of the gap and the total length of the new sequences is about the size of the gap. When there are competing consensus sequences, the search process will be branched out until a fished out sequence matches to the downstream of the original gap. This process requires human intervention after each search cycle. Depending on the context of each gap, the user can define the length and the location of the "bait" and specify the source data set to search within. By this stage, only the most problematic gaps remain, which demands constant monitoring and intervening from the user. In this work, we typically started from the immediate upstream of a gap. The baits were ~45 bp in length, and the source data were the two files containing 90 bp reads. The expected size of the fished out sequences were ~45 bp or less, depending on the position of the bait in the matched read. The further the "bait" toward the 5' of the read, the longer the "fish" will be. GapFish can be downloaded from http://personal.bgsu.edu/~zxu/.
Figure 2

Schematic overview of the GapFish algorithm. (a) A segment upstream of a gap will be used as the "bait" to search against all Illumina reads that are 90 nt long. If the "bait" is found in a read, GapFish will excise the fragment adjacent to the "bait" at the 3' direction and return the result (the "fish") to the console. At the end of each search, all identified fragments will be sorted and save into a text file. (b) An example output of GapFish when searching with "bait" = 'GAGGCTCCTCAGGCGGTTGTGGAGGGCAATCCCAGAAACTCCG' (total 43 nt). Sequencing errors are apparent in the results, such as the 3position (G -> C) in the second line and the 8position (T -> G) in the fifth line (both are underlined). This type of errors could have led to the collapse of the assembling effort of SOAPdenovo, leaving a gap behind. For solving this type of complications, GapFish-assisted human interventions have proven to be necessary. The sequence second to the last one (also underlined) will be used as the "bait" for the next round of search.

Schematic overview of the GapFish algorithm. (a) A segment upstream of a gap will be used as the "bait" to search against all Illumina reads that are 90 nt long. If the "bait" is found in a read, GapFish will excise the fragment adjacent to the "bait" at the 3' direction and return the result (the "fish") to the console. At the end of each search, all identified fragments will be sorted and save into a text file. (b) An example output of GapFish when searching with "bait" = 'GAGGCTCCTCAGGCGGTTGTGGAGGGCAATCCCAGAAACTCCG' (total 43 nt). Sequencing errors are apparent in the results, such as the 3position (G -> C) in the second line and the 8position (T -> G) in the fifth line (both are underlined). This type of errors could have led to the collapse of the assembling effort of SOAPdenovo, leaving a gap behind. For solving this type of complications, GapFish-assisted human interventions have proven to be necessary. The sequence second to the last one (also underlined) will be used as the "bait" for the next round of search. Step 6. Closing the remaining gaps with primer walking and Sanger sequencing. This was to fill in the remaining gaps and to validate the results of the previous steps. Optimization of DNA preparation and PCR conditions were frequently required. As expected, this was the most costly, time-consuming, and labor-intensive step. Step 7. Final review with GapFish. Although GapFish was originally written to fill in the difficult gaps, we found it was also handy in the final review stage. It was used to double check the overall assembly, especially the sites surrounding the original gaps. Errors, introduced when replacing the "N"s with specified sequences, were corrected. GapFish was also useful in circularizing the genome. It was used to remove overlapping region at the ends of the linear assembly. In case the ends of the linear assembly do not meet, GapFish can be used to fill in the gap between the ends.

Results

Completing the genome

Illumina sequencing generated 400 Mb of clean data, which was expected to cover the genome of T. sp. strain RQ7 more than 200 times. After assembling with SOAPdenovo and gap-filling with SOAPaligner (Step 1), a scaffold of 1,822,593 bp (including 14,240 "N"s) was available (Table 1), which corresponds to ~98% of the complete genome. The scaffold had 27 gaps in the range of 1 bp to 3.2 kb, represented by strings of "N"s. Each of these gaps seemed to be small enough to be closed by a single PCR reaction; thus these gaps were referred to as minigaps.
Table 1

Comparison of the assemblies generated by different methods.

MethodsScaffold size (including 'N's)# of 'N'sTotal nt assembledCoverage*# of gapsMax gap
SOAP package1,822,59314,2401,808,35397.7%28~36 kb

CLC package1,884,513201,8501,682,66390.9%380~21 kb

This pipeline1,851,61801,851,618100%00

*Coverage was calculated by comparing the total number of nucleotides (nt) assembled (without "N"s) to the size of the complete genome of T. sp. strain RQ7.

Comparison of the assemblies generated by different methods. *Coverage was calculated by comparing the total number of nucleotides (nt) assembled (without "N"s) to the size of the complete genome of T. sp. strain RQ7. A comparative genomics study (Step 2) was conducted by examining the assembly against all publicly available Thermotoga genomes. It was found that the genome of T. sp. strain RQ7 was most similar to that of T. neapolitana, indicating that the two strains are more closely related to each other than to any other Thermotoga strains. This finding is in agreement with the results based on 16s rRNA analysis [17]. Therefore, the genome of T. neapolitana was used as the reference genome in later part of the study. Meanwhile, it did not escape our attention that although the genomes of T. sp. strain RQ7 and T. neapolitana shared a high level of synteny, small scales of insertions, deletions, and rearrangements were common. Most strikingly, a region of ~36 kb found in T. neapolitana was missing in the assembly of T. sp. strain RQ7. Further investigation indicated that this region was conserved among other Thermotoga genomes (Table 2). This unusually large deletion, observed only in the current assembly of T. sp. strain RQ7, prompted us to test the authenticity of the deletion by wet lab experiments. Five genes of the region were arbitrarily selected for amplification. PCR primers were designed based on the conserved parts of these genes, and genomic DNA from four species was tested, namely, T. neapolitana, T. maritima, T. sp. strain RQ2, and T. sp. strain RQ7. The PCR profiles and the following Sanger sequencing results revealed that all of the 5 tested genes were present in T. sp. strain RQ7, which means, in addition to the 27 minigaps, the current assembly contained a hidden big gap of ~36 kb.
Table 2

Comparison of the big gap region among different Thermotoga genomes.

T. maritima T. sp. strain RQ2 T. neapolitana T. sp. strain RQ7Annotation
TM0968TRQ2_1822CTN_1608Presenthypothetical protein

TM0969TRQ2_1821CTN_1607Presenthypothetical protein

TM0970AbsentCTN_1606Disruptedhypothetical protein

TM0971TRQ2_1821Present*Presenthypothetical protein

TM0972TRQ2_1820CTN_1605Disruptedconserved hypothetical protein, GGDEF domain

TM0973TRQ2_1819CTN_1604Presenthypothetical protein

TM0974TRQ2_1818CTN_1603Presenthypothetical protein

TM0975AbsentCTN_1602Disruptedhypothetical protein

TM0976AbsentPresent*Presenthypothetical protein

TM0977AbsentCTN_1601Presenthypothetical protein

TM0978TRQ2_1817CTN_1600Presenthypothetical protein

TM0979TRQ2_1816CTN_1599Presenthypothetical protein

TM0980TRQ2_1815CTN_1598Presenthypothetical protein

TM0981TRQ2_1814CTN_1597Disruptedhypothetical protein

TM0982TRQ2_1813CTN_1596Presenthypothetical protein

TM0983TRQ2_1812CTN_1595Disruptedhypothetical protein

TM0984TRQ2_1811CTN_1594Disruptedhypothetical protein

TM0985TRQ2_1810CTN_1593Presenthypothetical protein

TM0986TRQ2_1809CTN_1592 &CTN_1591Presenthypothetical protein

TM0987TRQ2_1808CTN_1590Disruptedhypothetical protein

TM0988TRQ2_1807CTN_1589Disruptedhypothetical protein

TM0989TRQ2_1806CTN_1588Presenthypothetical protein

TM0990TRQ2_1805CTN_1587Disruptedhypothetical protein

TM0991TRQ2_1804CTN_1586Disruptedhypothetical protein

TM0992AbsentCTN_1585Disruptedhypothetical protein

TM0993AbsentCTN_1584Presenthypothetical protein

TM0994AbsentCTN_1583Presenthypothetical protein

TM0995AbsentCTN_1582Presenthypothetical protein

TM0996TRQ2_1803CTN_1581Presenthypothetical protein

TM0997TRQ2_1802CTN_1580Disruptedhypothetical protein

TM0998TRQ2_1801CTN_1579Presenttranscriptional regulator, ArsR family

TM0999DisruptedPresent*Presenthypothetical protein

TM1000AbsentCTN_1578Presenthypothetical protein

TM1001AbsentCTN_1577Presenthypothetical protein

TM1002TRQ2_1800CTN_1576Disruptedhypothetical protein

TM1003AbsentCTN_1575Absenthypothetical protein

TM1004TRQ2_1800CTN_1573Absenthypothetical protein

Comparative analysis of the ~36 kb big gap region among four Thermotoga genomes, showing the synteny and conservation. This region was missing from the initial assembly of the T. sp. strain RQ7 genome but was included after combining the data from the CLC assembly and the primer walking effort. The asterisks (*) indicate the presence of a homolog that is not annotated as a gene in a particular genome. Absent indicates the complete deletion of the ORF in a particular genome.

Comparison of the big gap region among different Thermotoga genomes. Comparative analysis of the ~36 kb big gap region among four Thermotoga genomes, showing the synteny and conservation. This region was missing from the initial assembly of the T. sp. strain RQ7 genome but was included after combining the data from the CLC assembly and the primer walking effort. The asterisks (*) indicate the presence of a homolog that is not annotated as a gene in a particular genome. Absent indicates the complete deletion of the ORF in a particular genome. In order to reduce the gap number and/or their sizes, a commercial software package CLC Genomics Workbench was employed to re-assemble the entire set of Illumina reads of T. sp. strain RQ7, using T. neapolitana as the reference genome (Step 3). An assembly of 1,884,513 bp (including 201,850 "N"s) was generated, which had 380 gaps, ranging from 1 bp to 21 kb (Table 1). The scaffolds generated by the CLC package filled 12 of the 27 minigaps found in the SOAP assembly. After patching in the sequences of the 12 minigaps, we obtained a hybrid assembly of 1,823,180 bp (including 7,544 "N"s). Furthermore, 7 fragments totalling up to 12,511 bp were assigned to the big gap, which brought the total assembled nucleotides to 1,828,147 bp (Table 3).
Table 3

Statistics of the assembling process

Step 1Step 2Step 3Step 4Step 5Step 6Step 7
# Minigaps27271513100

Assembled nt in the big gap0012,51112,51112,51135,74635,746

Total nt assembled*1,808,3531,808,3531,828,1471,828,3631,832,5881,851,7161,851,618

*: "N"s are not counted. Assemblies in Steps 1-6 have overlapping end sequences (terminal redundancy). As a result, the assembly in Step 6 appeared to be slightly bigger than the final assembly.

Statistics of the assembling process *: "N"s are not counted. Assemblies in Steps 1-6 have overlapping end sequences (terminal redundancy). As a result, the assembly in Step 6 appeared to be slightly bigger than the final assembly. Searching of GenBank (Step 4) identified 13 entries belonging to T. sp. strain RQ7, which were Sanger reads previously deposited by other researchers. Among them, 11 fell into the assembled regions and 2 into the unfilled minigaps. After that, GapFish (Step 5) successfully closed 12 more minigaps. The last remaining minigap, located in a highly repetitive region, could only be reliably determined by wet lab experiment. In addition, we were able to use GapFish to correct 4 locations that had been previously misassembled by either the SOAP package or the CLC package. However, the attempt to fill in the big gap with GapFish revealed that there were no additional reads in the data set that belonged to the big gap. This surprising finding suggested that sequences of this region were severely underrepresented in the original data set. Under such circumstances, we had to appeal to the wet lab approach for the remaining tasks. Primer walking and Sanger sequencing (Step 6) were sought after to solve the last mini-gap, to validate the filling of other minigaps containing repetitive sequences, and to close the big gap. The closure of the big gap was achieved by using freshly prepared DNA template (less than 5 days) and optimizing PCR conditions. The requirement of fresh DNA templates indicated the physical instability of the region, which could have led to the underrepresentation of the corresponding sequence reads. Eventually, the total size of the big gap was determined to be 35,746 bp. This region of the T. sp. strain RQ7 genome was compared to the counterparts of T. neapolitana, T. maritima, and T. sp. strain RQ2. Most of the predicted gene products are hypothetical proteins with unknown functions (Table 2). After a final review (Step 7), the complete genome was determined to be 1,851,618 bp, with a GC content of 46.13%. The statistics of the assembling process is summarized in Table 3. To highlight the effectiveness of the pipeline, the final assembly was compared to the assemblies obtained just from the SOAP or CLC package (Table 1). Because our pipeline integrates results from both packages, it is clearly superior to either of them employed alone. By further integration of the data from other sources, including the public data, the GapFish analyses, and the wet lab experimental data, the pipeline was able to deliver a complete genome in the end. In our pipeline, the results of each step were evaluated and validated multiple times in the following steps, which ensures high quality of the final assembly. The genome of T. sp. strain RQ7 has been deposited to GenBank with the accession number CP007633.

Differentially annotated genes

Because gaps in draft genomes may contain start and stop codons, closing the gaps is likely to alter the annotation results of the affected regions. Compared to the initial draft assembly of the SOAP package, the complete genome of T. sp. strain RQ7 recovered 42 genes that were previously missing, changed the sizes of 12 genes, and dismissed 9 false open reading frames (ORFs) (Table 4). The affected genes were predicted to take part in many essential cellular processes, such as sugar transportation and utilization, chemotaxis, RNA/DNA processing, sulfur metabolism, and transcription regulations. Missing or false annotation of these genes could lead to erroneous interpretation of the biology of the organism, which could result in great loss if follow-up work was pursued with these genes.
Table 4

ORFs differentially annotated in the complete genome

# of affected ORFsPutative functions
Size variation12ABC-type sugar transport and utilization machinery; chemotaxis protein; RNA/DNA processing

Recovered ORFs42ABC-type sugar transport and utilization machinery, transcriptional regulators, sulfur metabolism system, DNA/RNA helicases, DNA methylases

Dismissed ORFs9-
ORFs differentially annotated in the complete genome

Discussion

This work distinguishes itself from similar studies [18,19] from the aspect of multi-phase interactions between computational and biological approaches (Figure 1). The pipeline started from a de novo assembly of paired-end reads, followed by a comparative genomics analysis and wet lab verification, which led to the identification of a reference genome as well as the discovery of a hidden big gap. The genome was then re-assembled based on the reference genome, providing additional contigs. Public databases were then utilized to reduce the number of gaps, prior to another round of customized gap closure effort, using GapFish. When in silico options were exhausted, wet lab methods were introduced again; by then, the number of required lab work had been reduced to the minimum. The pipeline ended with an in silico final review. The robustness of the pipeline relies upon the intimate interaction of in silico and wet lab approaches from an early stage. Rich biological information deduced from the previous step of computation can be used to guide the next stage of work, either in silico or in a wet lab setting. Because in silico methods are more efficient and less expensive, we should take full advantage of their benefits. In this work, software products from various sources were utilized, including the free ware SOAPdenovo and SOAPaligner, the commercial package CLC Genomics Workbench, and the in-house program GapFish. They accomplished complementary tasks, and each played a critical role at different stages. On the other hand, technologies have their shortcomings. Computational results should always be carefully evaluated in a biological context and be validated with wet lab experiments. The pipeline is intrinsically flexible to allow customization of the modules. For instance, one can simply skip Step 4 if there are no public data available for the target genome. In addition, it is worthwhile to point out that the reference genome used in Step 3 refers to the most related genome sequence currently available. The reference genome would be most helpful if it is from the same species as the target genome, but it could still provide useful regional assemblies if they just share the genus (as T. sp. strain RQ7 and T. neapolitana do). Bypassing this step is also acceptable, although it will shift the workload to the later steps.

Conclusions

In this study, we developed a genome assembling pipeline using commonly available tools and resources. It stresses the intimate, multi-phase interactions between in silico and wet lab approaches. The application of the pipeline was established via the delivery of the complete genome sequence of T. sp. strain RQ7. The constituting principles and methods are applicable to a variety of similar studies with both prokaryotes and eukaryotes.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RP conducted all PCR experiments and performed most part of the data analysis. JW, GQ, RZ, and ZX participated in data analysis. RP, GQ, and RZ also contributed to drafting the manuscript. ZX conceived and coordinated the study, wrote the GapFish software, and drafted the manuscript. All authors read and approved the final manuscript.
  14 in total

1.  Enhancements and modifications of primer design program Primer3.

Authors:  Triinu Koressaar; Maido Remm
Journal:  Bioinformatics       Date:  2007-03-22       Impact factor: 6.937

2.  Short read fragment assembly of bacterial genomes.

Authors:  Mark J Chaisson; Pavel A Pevzner
Journal:  Genome Res       Date:  2007-12-14       Impact factor: 9.043

3.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors:  Daniel R Zerbino; Ewan Birney
Journal:  Genome Res       Date:  2008-03-18       Impact factor: 9.043

4.  De novo assembly of human genomes with massively parallel short read sequencing.

Authors:  Ruiqiang Li; Hongmei Zhu; Jue Ruan; Wubin Qian; Xiaodong Fang; Zhongbin Shi; Yingrui Li; Shengting Li; Gao Shan; Karsten Kristiansen; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal:  Genome Res       Date:  2009-12-17       Impact factor: 9.043

5.  Sensitive and accurate detection of copy number variants using read depth of coverage.

Authors:  Seungtai Yoon; Zhenyu Xuan; Vladimir Makarov; Kenny Ye; Jonathan Sebat
Journal:  Genome Res       Date:  2009-08-05       Impact factor: 9.043

Review 6.  Sense from sequence reads: methods for alignment and assembly.

Authors:  Paul Flicek; Ewan Birney
Journal:  Nat Methods       Date:  2009-11       Impact factor: 28.547

7.  The sequence and de novo assembly of the giant panda genome.

Authors:  Ruiqiang Li; Wei Fan; Geng Tian; Hongmei Zhu; Lin He; Jing Cai; Quanfei Huang; Qingle Cai; Bo Li; Yinqi Bai; Zhihe Zhang; Yaping Zhang; Wen Wang; Jun Li; Fuwen Wei; Heng Li; Min Jian; Jianwen Li; Zhaolei Zhang; Rasmus Nielsen; Dawei Li; Wanjun Gu; Zhentao Yang; Zhaoling Xuan; Oliver A Ryder; Frederick Chi-Ching Leung; Yan Zhou; Jianjun Cao; Xiao Sun; Yonggui Fu; Xiaodong Fang; Xiaosen Guo; Bo Wang; Rong Hou; Fujun Shen; Bo Mu; Peixiang Ni; Runmao Lin; Wubin Qian; Guodong Wang; Chang Yu; Wenhui Nie; Jinhuan Wang; Zhigang Wu; Huiqing Liang; Jiumeng Min; Qi Wu; Shifeng Cheng; Jue Ruan; Mingwei Wang; Zhongbin Shi; Ming Wen; Binghang Liu; Xiaoli Ren; Huisong Zheng; Dong Dong; Kathleen Cook; Gao Shan; Hao Zhang; Carolin Kosiol; Xueying Xie; Zuhong Lu; Hancheng Zheng; Yingrui Li; Cynthia C Steiner; Tommy Tsan-Yuk Lam; Siyuan Lin; Qinghui Zhang; Guoqing Li; Jing Tian; Timing Gong; Hongde Liu; Dejin Zhang; Lin Fang; Chen Ye; Juanbin Zhang; Wenbo Hu; Anlong Xu; Yuanyuan Ren; Guojie Zhang; Michael W Bruford; Qibin Li; Lijia Ma; Yiran Guo; Na An; Yujie Hu; Yang Zheng; Yongyong Shi; Zhiqiang Li; Qing Liu; Yanling Chen; Jing Zhao; Ning Qu; Shancen Zhao; Feng Tian; Xiaoling Wang; Haiyin Wang; Lizhi Xu; Xiao Liu; Tomas Vinar; Yajun Wang; Tak-Wah Lam; Siu-Ming Yiu; Shiping Liu; Hemin Zhang; Desheng Li; Yan Huang; Xia Wang; Guohua Yang; Zhi Jiang; Junyi Wang; Nan Qin; Li Li; Jingxiang Li; Lars Bolund; Karsten Kristiansen; Gane Ka-Shu Wong; Maynard Olson; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang; Jun Wang
Journal:  Nature       Date:  2009-12-13       Impact factor: 49.962

8.  ALLPATHS: de novo assembly of whole-genome shotgun microreads.

Authors:  Jonathan Butler; Iain MacCallum; Michael Kleber; Ilya A Shlyakhter; Matthew K Belmonte; Eric S Lander; Chad Nusbaum; David B Jaffe
Journal:  Genome Res       Date:  2008-03-13       Impact factor: 9.043

9.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer.

Authors:  David Hernandez; Patrice François; Laurent Farinelli; Magne Osterås; Jacques Schrenzel
Journal:  Genome Res       Date:  2008-03-10       Impact factor: 9.043

10.  GapFiller: a de novo assembly approach to fill the gap within paired reads.

Authors:  Francesca Nadalin; Francesco Vezzi; Alberto Policriti
Journal:  BMC Bioinformatics       Date:  2012-09-07       Impact factor: 3.169

View more
  3 in total

1.  TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads.

Authors:  Mengyang Xu; Lidong Guo; Shengqiang Gu; Ou Wang; Rui Zhang; Brock A Peters; Guangyi Fan; Xin Liu; Xun Xu; Li Deng; Yongwei Zhang
Journal:  Gigascience       Date:  2020-09-01       Impact factor: 6.524

2.  Robust high-throughput prokaryote de novo assembly and improvement pipeline for Illumina data.

Authors:  Andrew J Page; Nishadi De Silva; Martin Hunt; Michael A Quail; Julian Parkhill; Simon R Harris; Thomas D Otto; Jacqueline A Keane
Journal:  Microb Genom       Date:  2016-08-25

3.  Complete genome sequence of Thermotoga sp. strain RQ7.

Authors:  Zhaohui Xu; Rutika Puranik; Junxi Hu; Hui Xu; Dongmei Han
Journal:  Stand Genomic Sci       Date:  2017-10-11
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.