| Literature DB >> 27051465 |
Ma Liang1, Castle Raley2, Xin Zheng2, Geetha Kutty1, Emile Gogineni1, Brad T Sherman2, Qiang Sun2, Xiongfong Chen2, Thomas Skelly2, Kristine Jones2, Robert Stephens2, Bin Zhou3, William Lau3, Calvin Johnson3, Tomozumi Imamichi2, Minkang Jiang2, Robin Dewar2, Richard A Lempicki2, Bao Tran2, Joseph A Kovacs1, Da Wei Huang2,4.
Abstract
BACKGROUND: Gene isoforms are commonly found in both prokaryotes and eukaryotes. Since each isoform may perform a specific function in response to changing environmental conditions, studying the dynamics of gene isoforms is important in understanding biological processes and disease conditions. However, genome-wide identification of gene isoforms is technically challenging due to the high degree of sequence identity among isoforms. Traditional targeted sequencing approach, involving Sanger sequencing of plasmid-cloned PCR products, has low throughput and is very tedious and time-consuming. Next-generation sequencing technologies such as Illumina and 454 achieve high throughput but their short read lengths are a critical barrier to accurate assembly of highly similar gene isoforms, and may result in ambiguities and false joining during sequence assembly. More recently, the third generation sequencer represented by the PacBio platform offers sufficient throughput and long reads covering the full length of typical genes, thus providing a potential to reliably profile gene isoforms. However, the PacBio long reads are error-prone and cannot be effectively analyzed by traditional assembly programs.Entities:
Keywords: Bioinformatics analysis; Gene isoforms; Major surface glycoprotein; NGS; PacBio; Pneumocystis; Repetitive sequences; Uclust
Year: 2016 PMID: 27051465 PMCID: PMC4820869 DOI: 10.1186/s13040-016-0090-8
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Fig 1Comparison of three methods for gene isoform identification. Lines with different colors represent different sequences
Fig 2Schema of the clustering-based data analysis procedure using PacBio long reads. The different colors represent the reads belonging to different isoforms. The black stars represent sequencing errors. In this example, contig 3 and 4 are merged because they are the plus and minus strands of the same isoform
Fig 3High sequence identity of 10 msg isoforms of P. jirovecii previously identified by Sanger sequencing of plasmid-cloned PCR products [22]. Text labels on the left side and the top represent plasmid ID. Colors indicate different degrees of identity from low (blue) to high (red). Plasmids containing these 10 isoforms were mixed together to form a benchmark admixture, which was amplified by PCR and sequenced in parallel by PacBio and 454 sequencing
Reconstruction of an ~1.5 kb segment of 10 known msg isoforms of P. jirovecii from the benchmark isoform admixture by PacBio sequencing and clustering-based analysis (see Additional file 2: Table S2 and Additional file 3 for more details)
| Contig no. | Lengtha (bp) | Matched msg isoform | Identity (%) |
|---|---|---|---|
| Contig0001 | 1587 | AR-Cl72 | 100 |
| Contig0002 | 1581 | AR-Cl59 | 99.94 |
| Contig0003 | 1584 | AR-Cl46 | 99.94 |
| Contig0004 | 1584 | AR-Cl14 | 99.87 |
| Contig0005 | 1581 | AR-Cl6 | 99.75 |
| Contig0006 | 1581 | AR-Cl24 | 99.87 |
| Contig0007 | 1593 | AR-Cl20 | 99.94 |
| Contig0008 | 1584 | AR-Cl44 | 99.81 |
| Contig0011 | 1584 | AR-Cl45 | 99.94 |
| Contig0014 | 1583 | AR-Cl58 | 99.87 |
aIdentical between the contigs and known msg isoforms
Fig 4The sensitivity of the new approach based on PacBio sequencing and clustering analysis in detection of minor gene isoforms in a mixture containing multiple isoforms. Twenty-two plasmids, representing 22 different P. jirovecii msg isoforms previously cloned [22], were mixed with various concentrations and amplified by PCR for the full-length msg coding region (~3 kb) followed by PacBio sequencing and clustering analysis. The read frequency and concentrations of plasmids (indicated by diamonds) in the mixture are positively correlated. The concentration is the percentage of each plasmid DNA relative to the total amount of plasmid DNA in the mixture. Those 4 msg isoforms, which were not identified, have the lowest concentrations (0.8–1.3 %) in the mixture
A partial list of P. jirovecii full-length msg isoforms (~3 kb) identified in a clinical sample by PacBio sequencing and clustering-based analysis
| Contig no. | Length (bp) | Matched | Status | Identity to the known (%) | PCR verification |
|---|---|---|---|---|---|
| Contig0020 | 3086 | EF371022 | Known | 98.92 | ND |
| Contig0007 | 3086 | EF371023 | Known | 99.87 | ND |
| Contig0021 | 3008 | EF371024 | Known | 99.83 | ND |
| Contig0025 | 3011 | EF371025 | Known | 99.60 | ND |
| Contig0133 | 3068 | EF371026 | Known | 99.71 | ND |
| Contig0022 | 3104 | EF371028 | Known | 99.58 | ND |
| Contig0006 | 3062 | EF371029 | Known | 99.97 | ND |
| Contig0008 | 3041 | EF371030 | Known | 99.87 | ND |
| Contig0026 | 3032 | EF371031 | Known | 98.76 | ND |
| Contig0012 | 3005 | EF371032 | Known | 99.70 | ND |
| Contig0013 | 2996 | EF371033 | Known | 99.87 | ND |
| Contig0011 | 3092 | EF371035 | Known | 99.88 | ND |
| Contig0003 | 3038 | EF371036 | Known | 99.97 | ND |
| Contig0005 | 3002 | EF371038 | Known | 99.90 | ND |
| Contig0010 | 2996 | EF371040 | Known | 99.80 | ND |
| Contig0015 | 3129 | EF371041 | Known | 99.78 | ND |
| Contig0001 | 3002 | EF371042 | Known | 99.37 | ND |
| Contig0027 | 3065 | EF371045 | Known | 99.64 | ND |
| Contig0014 | 3077 | EF371050 | Known | 99.71 | ND |
| Contig0018 | 3044 | EF371051 | Known | 99.70 | ND |
| Contig0009 | 3029 | EF371052 | Known | 99.87 | ND |
| Contig0017 | 3023 | EF371053 | Known | 99.50 | ND |
| Contig0016 | 3050 | EF371055 | Known | 99.84 | ND |
| Contig0004 | 3026 | EF371056 | Known | 99.97 | ND |
| Contig0010b | 3060 | No | Novel | NA | 99.53 |
| Contig0004b | 3086 | No | Novel | NA | 99.45 |
| Contig0015b | 3039 | No | Novel | NA | 99.20 |
| Contig0053b | 3077 | No | Novel | NA | 98.91 |
| Contig0006b | 3062 | No | Novel | NA | 98.50 |
| Contig0054b | 3074 | No | Novel | NA | 98.32 |
| Contig0138b | 3041 | No | Novel | NA | ND |
A total of 72 unique msg isoforms identified in this study, with only 31 of them shown in this table. The first 24 contigs matched in full-length with the 24 previously identified msg genes from the same clinical sample [22] as shown in the third column with GenBank accession no, NA not applicable, ND not determined by PCR. Additional file 4 contains a complete list of sequences for 72 Msg isoforms