| Literature DB >> 32019935 |
Feifei Xu1, Aaron Jex2,3, Staffan G Svärd4.
Abstract
Giardia intestinalis is a protist causing diarrhea in humans. The first G. intestinalis genome, from the WB isolate, was published more than ten years ago, and has been widely used as the reference genome for Giardia research. However, the genome is fragmented, thus hindering research at the chromosomal level. We re-sequenced the Giardia genome with Pacbio long-read sequencing technology and obtained a new reference genome, which was assembled into near-complete chromosomes with only four internal gaps at long repeats. This new genome is not only more complete but also better annotated at both structural and functional levels, providing more details about gene families, gene organizations and chromosomal structure. This near-complete reference genome will be a valuable resource for the Giardia community and protist research. It also showcases how a fragmented genome can be improved with long-read sequencing technology completed with optical maps.Entities:
Mesh:
Year: 2020 PMID: 32019935 PMCID: PMC7000408 DOI: 10.1038/s41597-020-0377-y
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Comparison of the old and the new G. intestinalis WB genome.
| Old | New | |
|---|---|---|
| Sequencing instrument | LI-COR, ABI 3700 | PacBio RS II |
| # Reads | 224,000* | 411,835 |
| # Bases | NA | 3.6 billion |
| Coverage | 11× | 200× |
| Assembler | ARACHNE 2.0 | HGAP3 |
| Optical mapping | − | + |
| # Chromosomes | 5 | 5 |
| Genome size (Mbp) | 11.7 | 12.6 |
| # Contigs | 306 | 38 |
| # Scaffolds | 92 | 35 |
| # Gaps | 137 | 4 |
| Gap size (Mbp) | 1.6 | 0.9 |
| G + C % | 49.0 | 46.3 |
| ASH % | 0.01 | 0.03 |
| # Genes | 5,901 | 4,963 |
| Mean gene length (aa) | 530 | 635 |
| Coding density % | 81.6 | 81.5 |
| Mean intergenic region (bp) | 481 | 477 |
| Number of introns** | 4 | 8 cis, 5 trans |
| tRNAs | 63 | 65 |
| 5S rRNAs | 8 | 10 |
| 5.8S rRNAs | 1 | 10 |
| 18S rRNAs | 4 | 9 (2 partial) |
| 28S rRNAs | 4 | 4 (5 partial, 12 ψ) |
*200,000 from end sequences with small insert, 2,400 end sequences from 200 kbp long insert.
**There were 4 identified intron-containing genes in the first published draft genome. Not all the discovered intron-containing genes were properly integrated in the newest GiardiaDB either. This version integrates all the discovered introns, which is consistent with our results searching for the de novo introns.
Fig. 1Near-complete five chromosomes. (a) Restriction enzyme (MluI) maps of the five chromosomes aligned with the genomic sequences digested with MluI in silico. Each vertical line inside boxes represents a restriction enzyme cutting site. Gaps in the genomic sequences are represented with a horizontal line outside of boxes. (b) Circular plot comparing the old five chromosomes (to left) to the new five chromosomes (to right). Chromosomal sequences are represented in grey at the outermost circle with gaps in white bands and telomeric repeats in red. BLASTN matches between the two genomes are shown as blue ribbons in the middle. R package circlize (v0.4.8) was used to draw the circular plot[39].
Fig. 2Circular plot of the five chromosomes. Chromosomal sequences are represented in grey at the outermost circle with gaps in white bands and telomeres in red. Inner tracks are arranged as: GC%, ARPs/NEKs, (ψ)Reverse transcriptases/rRNAs, (ψ)VSPs/HCMPs, Coding density, SNPs density, regions with similarity. Regions with similarity represent BLASTN matches against itself with >95% sequence identity and >2000 bp in length. The circular plot was drawn with R package circlize (v0.4.8)[39].
Arrangement of VSP pairs.
| Chr | Gene1 | Gene2 | Arrangement | Gene size (aa) | Distance (bp) | Genes between |
|---|---|---|---|---|---|---|
| 5 | 14586 | d14586 | → ← | 719 | 2689 | |
| 5 | 137722 | 137723 | ← → | 661 | 2513 | |
| 5 | 137714 | 11470 | → ← | 627 | 2669 | |
| 5 | 137708 | 137707 | → ← | 593 | 2777 | |
| 5 | d14331 | 14331 | → ← | 419 | 2637 | |
| 4 | 16501 | d16501 | → ← | 692 | 2526 | |
| 4 | d103992 | 103992 | → ← | 627 | 2758 | |
| 4 | 50229 | d50229 | ← → | 740 | 3242 | ψRT |
| 4 | d112801 | 112801 | ← → | 747 | 6264 | ψRT |
| 4 | 111933 | 111936 | ← → | 741 | 3846 | ψRT |
| 3 | 119706 | 119707 | ← → | 673 | 3301 | ψRT |
| 3 | 115830 | 115831 | → ← | 633 | 2813 | |
| 3 | 136003 | 136004 | → ← | 551 | 2686 | |
| 2 | d11521 | 11521 | → ← | 628 | 2747 | |
| 2 | d117204 | 117204 | → ← | 255 | 2720 | |
| 2 | 117472 | 117473 | ← → | 200 | 3101 | NEK |
| 2 | 50359 | 134710 | ← → | 636 | 12440 | |
| 1 | d115797 | 115797 | → ← | 682 | 2649 | |
| 1 | d112208 | 112208 | → ← | 596 | 2648 |
Fig. 3An 8 kbp genomic region on chromosome 5 with improved annotation and start codon sequence logos. (a) An 8 kbp long genomic region located on chromosome 5. The first part shows the RNA-Seq reads mapped to the region. A coverage cutoff 30 was used for better display. The new genome is drawn directly below the RNA-Seq reads with the old genome aligned at the bottom. Orthologs between the two genomes are linked with light pink bands. Genes are shown in arrowed boxes filled with different colors. White indicates genes without modification, purple (60288) indicates the new BolA-like gene to the new genome, grey represents the unique genes to the old genome (deleted from the new genome). Blue represents genes with adjusted start codons, and in these three genes, they were shortened with updated descriptions. (b) Sequence logo at the start codon of the 626 new genes with updated start codons. (c) Sequence logo at the start codon of the 626 old genes from the old genome. (d) Sequence logo at the start codon of all the 4,963 protein-coding genes. Sequence logos were drawn with R package seqLogo (v1.50)[40].
| Measurement(s) | DNA • sequence_assembly • sequence feature annotation |
| Technology Type(s) | DNA sequencing • sequence assembly process • sequence annotation |
| Sample Characteristic - Organism | Giardia intestinalis |