| Literature DB >> 22768984 |
Subhash J Jakhesara1, Viral B Ahir, Ketan B Padiya, Prakash G Koringa, Dharamshibhai N Rank, Chaitanya G Joshi.
Abstract
Whole genome sequencing of buffalo is yet to be completed, and in the near future it may not be possible to identify an exome (coding region of genome) through bioinformatics for designing probes to capture it. In the present study, we employed in solution hybridization to sequence tissue specific temporal exomes (TST exome) in buffalo. We utilized cDNA prepared from buffalo muscle tissue as a probe to capture TST exomes from the buffalo genome. This resulted in a prominent reduction of repeat sequences (up to 40%) and an enrichment of coding sequences (up to 60%). Enriched targets were sequenced on a 454 pyro-sequencing platform, generating 101,244 reads containing 24,127,779 high quality bases. The data revealed 40,100 variations, of which 403 were indels and 39,218 SNPs containing 195 nonsynonymous candidate SNPs in protein-coding regions. The study has indicated that 80% of the total genes identified from capture data were expressed in muscle tissue. The present study is the first of its kind to sequence TST exomes captured by use of cDNA molecules for SNPs found in the coding region without any prior sequence information of targeted molecules.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22768984 PMCID: PMC5054198 DOI: 10.1016/j.gpb.2012.05.005
Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN: 1672-0229 Impact factor: 7.691
Summary of sequencing data obtained from Bubalus bubalis muscle TST exome capture
| gsMapper assembly (with | ||
| No. of reads | 101,244 | |
| No. of bases | 24,127,779 | |
| No. of mapped reads | 59,211 | 41.40% |
| No. of mapped bases | 6389,585 | 26.48% |
| No. of unmapped reads | 41,530 | 41.02% |
| No. of too short reads | 503 | 0.50% |
| No. of contigs | 750 | |
| No. of bases | 151,663 | |
| No. of contigs | 230 | |
| No. of bases | 102,080 | |
| No. of singleton reads | 29,273 | |
| No. of singleton bases | 6503,535 | |
Note: Reads available after sequencing were mapped against the Bos taurus reference mRNA database. Unmapped reads after merging with too short reads are subjected to de novo assembly to form contigs. Assembled contigs were utilized for analysis.
Pathway classification of genes based on KEGG analysis
| Category | Mapped contigs | Unmapped and too short contigs | Singletons | Total | % |
|---|---|---|---|---|---|
| Metabolism | 114 | 7 | 109 | 230 | 28.01 |
| Organismal systems | 90 | 3 | 89 | 182 | 22.16 |
| Human diseases | 68 | 0 | 75 | 143 | 17.41 |
| Environmental information processing | 53 | 4 | 52 | 109 | 13.27 |
| Cellular processes | 49 | 1 | 39 | 89 | 10.84 |
| Genetic information processing | 37 | 0 | 31 | 68 | 8.28 |
| Total | 411 | 15 | 395 | 821 |
Note: Percentage of enriched pathways in buffalo muscle tissue was calculated after analysis with KEGG pathway. Importantly, genes related with metabolic pathways were enriched in muscle tissue.
Top 10 GO terms enriched in muscle TST exome
| Go category | Proportion in muscle TST exome (%) |
|---|---|
| Molecular function | 24 |
| Metabolism | 17 |
| Intracellular | 16 |
| Binding | 13 |
| Catalytic activity | 7 |
| Cytoplasm | 5 |
| Biosynthesis | 5 |
| Development | 5 |
| Protein binding | 4 |
| Nucleic acid metabolism | 4 |
Figure 1Number of uniquely-mapped reads and SNPs found in buffalo against each bovine chromosome
Percentage of repeat sequences present in Bubalus bubalis muscle TST exome and in whole genome
| Repeat | In captured genomic sequencea (%) | In whole genome sequenceb (%) |
|---|---|---|
| Simple repeat | 0.09 | 4.00 |
| Satellite | 15.38 | 0.63 |
| Low complexity | 0.30 | 2.78 |
| LINE | 13.89 | 26.64 |
| SINE | 2.44 | 35.37 |
| LTR | 4.07 | 7.25 |
| DNA elements | 1.78 | 3.91 |
| RNA | 0.08 | 0.1 |
| Unknown | 0.02 | 0.06 |
| Total | 38.05 | 80.74 |
Note: Analysis of repeat masking of captured data with repeat masker software. Capture resulted in almost 40% reduction in repeat sequences. Percentage of repeats in captured genomic sequences (a) and in whole genome sequence assembly of sequencing data (b) was shown. LINE, long interspersed nuclear element; SINE, short interspersed nuclear element; LTR, long terminal repeat.