| Literature DB >> 25708162 |
Rutika Puranik, Guangri Quan, Jacob Werner, Rong Zhou, Zhaohui Xu.
Abstract
BACKGROUND: Despite the large volume of genome sequencing data produced by next-generation sequencing technologies and the highly sophisticated software dedicated to handling these types of data, gaps are commonly found in draft genome assemblies. The existence of gaps compromises our ability to take full advantage of the genome data. This study aims to identify a practical approach for biologists to complete their own genome assemblies using commonly available tools and resources.Entities:
Mesh:
Year: 2015 PMID: 25708162 PMCID: PMC4331810 DOI: 10.1186/1471-2164-16-S3-S7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1The pipeline of genome assembling and gap closure. Clean data were subject to de novo assembling, initial gap filling, and single base corrections with SOAPdenovo and SOAPaligner; the resulting assembly was used for comparative genomics studies and also provided guidance for wet lab validations. Meanwhile, the clean data were assembled separately based on a reference genome using CLC Genomics Workbench. This second assembly was integrated into the first one to yield a hybrid assembly, which was then updated with public data, GapFish results, and Sanger sequencing data until the genome sequence was complete.
Figure 2Schematic overview of the GapFish algorithm. (a) A segment upstream of a gap will be used as the "bait" to search against all Illumina reads that are 90 nt long. If the "bait" is found in a read, GapFish will excise the fragment adjacent to the "bait" at the 3' direction and return the result (the "fish") to the console. At the end of each search, all identified fragments will be sorted and save into a text file. (b) An example output of GapFish when searching with "bait" = 'GAGGCTCCTCAGGCGGTTGTGGAGGGCAATCCCAGAAACTCCG' (total 43 nt). Sequencing errors are apparent in the results, such as the 3position (G -> C) in the second line and the 8position (T -> G) in the fifth line (both are underlined). This type of errors could have led to the collapse of the assembling effort of SOAPdenovo, leaving a gap behind. For solving this type of complications, GapFish-assisted human interventions have proven to be necessary. The sequence second to the last one (also underlined) will be used as the "bait" for the next round of search.
Comparison of the assemblies generated by different methods.
| Methods | Scaffold size (including 'N's) | # of 'N's | Total nt assembled | Coverage* | # of gaps | Max gap |
|---|---|---|---|---|---|---|
| SOAP package | 1,822,593 | 14,240 | 1,808,353 | 97.7% | 28 | ~36 kb |
| CLC package | 1,884,513 | 201,850 | 1,682,663 | 90.9% | 380 | ~21 kb |
| This pipeline | 1,851,618 | 0 | 1,851,618 | 100% | 0 | 0 |
*Coverage was calculated by comparing the total number of nucleotides (nt) assembled (without "N"s) to the size of the complete genome of T. sp. strain RQ7.
Comparison of the big gap region among different Thermotoga genomes.
|
|
| Annotation | ||
|---|---|---|---|---|
| TM0968 | TRQ2_1822 | CTN_1608 | Present | hypothetical protein |
| TM0969 | TRQ2_1821 | CTN_1607 | Present | hypothetical protein |
| TM0970 | Absent | CTN_1606 | Disrupted | hypothetical protein |
| TM0971 | TRQ2_1821 | Present* | Present | hypothetical protein |
| TM0972 | TRQ2_1820 | CTN_1605 | Disrupted | conserved hypothetical protein, GGDEF domain |
| TM0973 | TRQ2_1819 | CTN_1604 | Present | hypothetical protein |
| TM0974 | TRQ2_1818 | CTN_1603 | Present | hypothetical protein |
| TM0975 | Absent | CTN_1602 | Disrupted | hypothetical protein |
| TM0976 | Absent | Present* | Present | hypothetical protein |
| TM0977 | Absent | CTN_1601 | Present | hypothetical protein |
| TM0978 | TRQ2_1817 | CTN_1600 | Present | hypothetical protein |
| TM0979 | TRQ2_1816 | CTN_1599 | Present | hypothetical protein |
| TM0980 | TRQ2_1815 | CTN_1598 | Present | hypothetical protein |
| TM0981 | TRQ2_1814 | CTN_1597 | Disrupted | hypothetical protein |
| TM0982 | TRQ2_1813 | CTN_1596 | Present | hypothetical protein |
| TM0983 | TRQ2_1812 | CTN_1595 | Disrupted | hypothetical protein |
| TM0984 | TRQ2_1811 | CTN_1594 | Disrupted | hypothetical protein |
| TM0985 | TRQ2_1810 | CTN_1593 | Present | hypothetical protein |
| TM0986 | TRQ2_1809 | CTN_1592 & | Present | hypothetical protein |
| TM0987 | TRQ2_1808 | CTN_1590 | Disrupted | hypothetical protein |
| TM0988 | TRQ2_1807 | CTN_1589 | Disrupted | hypothetical protein |
| TM0989 | TRQ2_1806 | CTN_1588 | Present | hypothetical protein |
| TM0990 | TRQ2_1805 | CTN_1587 | Disrupted | hypothetical protein |
| TM0991 | TRQ2_1804 | CTN_1586 | Disrupted | hypothetical protein |
| TM0992 | Absent | CTN_1585 | Disrupted | hypothetical protein |
| TM0993 | Absent | CTN_1584 | Present | hypothetical protein |
| TM0994 | Absent | CTN_1583 | Present | hypothetical protein |
| TM0995 | Absent | CTN_1582 | Present | hypothetical protein |
| TM0996 | TRQ2_1803 | CTN_1581 | Present | hypothetical protein |
| TM0997 | TRQ2_1802 | CTN_1580 | Disrupted | hypothetical protein |
| TM0998 | TRQ2_1801 | CTN_1579 | Present | transcriptional regulator, ArsR family |
| TM0999 | Disrupted | Present* | Present | hypothetical protein |
| TM1000 | Absent | CTN_1578 | Present | hypothetical protein |
| TM1001 | Absent | CTN_1577 | Present | hypothetical protein |
| TM1002 | TRQ2_1800 | CTN_1576 | Disrupted | hypothetical protein |
| TM1003 | Absent | CTN_1575 | Absent | hypothetical protein |
| TM1004 | TRQ2_1800 | CTN_1573 | Absent | hypothetical protein |
Comparative analysis of the ~36 kb big gap region among four Thermotoga genomes, showing the synteny and conservation. This region was missing from the initial assembly of the T. sp. strain RQ7 genome but was included after combining the data from the CLC assembly and the primer walking effort. The asterisks (*) indicate the presence of a homolog that is not annotated as a gene in a particular genome. Absent indicates the complete deletion of the ORF in a particular genome.
Statistics of the assembling process
| Step 1 | Step 2 | Step 3 | Step 4 | Step 5 | Step 6 | Step 7 | |
|---|---|---|---|---|---|---|---|
| 27 | 27 | 15 | 13 | 1 | 0 | 0 | |
| 0 | 0 | 12,511 | 12,511 | 12,511 | 35,746 | 35,746 | |
| 1,808,353 | 1,808,353 | 1,828,147 | 1,828,363 | 1,832,588 | 1,851,716 | 1,851,618 | |
*: "N"s are not counted. Assemblies in Steps 1-6 have overlapping end sequences (terminal redundancy). As a result, the assembly in Step 6 appeared to be slightly bigger than the final assembly.
ORFs differentially annotated in the complete genome
| # of affected ORFs | Putative functions | |
|---|---|---|
| 12 | ABC-type sugar transport and utilization machinery; chemotaxis protein; RNA/DNA processing | |
| 42 | ABC-type sugar transport and utilization machinery, transcriptional regulators, sulfur metabolism system, DNA/RNA helicases, DNA methylases | |
| 9 | - | |