| Literature DB >> 28795166 |
Marcel Martínez-Porchas1, Francisco Vargas-Albores1.
Abstract
The use of k-mers has been a successful strategy for improving metagenomics studies, including taxonomic classifications, or de novo assemblies, and can be used to obtain sequences of interest from the available databases. The aim of this manuscript was to propose a simple but efficient strategy to generate k-mers and to use them to obtain and analyse in silico 16S rRNA sequence fragments. A total of 513,309 bacterial sequences contained in the SILVA database were considered for the study, and homemade PHP scripts were used to search for specific nucleotide chains, recover fragments of bacterial sequences, make calculations and organize information. Consensus sequences matching conserved regions were constructed by aligning most of the primers used in the literature. Sequences of k nucleotides (9- to 15-mers) were extracted from the generated primer contigs. Frequency analysis revealed that k-mer size was inversely proportional to the occurrence of k-mers in the different conserved regions, suggesting a stringency relationship; high numbers of duplicate reactions were observed with short k-mers, and a lower proportion of sequences were obtained with large ones, with the best results obtained using 12-mers. Using 12-mers with the proposed method to obtain and study sequences was found to be a reliable approach for the analysis of 16S rRNA sequences and this strategy may probably be extended to other biomarkers. Furthermore, additional applications such as evaluating the degree of conservation and designing primers and other calculations are proposed as examples.Entities:
Keywords: Bioinformatics; Biological sciences; Microbiology
Year: 2017 PMID: 28795166 PMCID: PMC5537200 DOI: 10.1016/j.heliyon.2017.e00370
Source DB: PubMed Journal: Heliyon ISSN: 2405-8440
Primer contigs generated by the assembly of all of the primers reported for each conserved region of the 16S rRNA gene. Locations are based on E. coli sequence.
| Name | Sequence | Location | References |
|---|---|---|---|
| 1 | AGAGTTTGATYMTGGCTCAG | 8-27 | [ |
| 2 | ASYGGCGNACGGGTGAGTAA | 100-119 | [ |
| 3 | ACTGAGAYACGGYCCARACTCCTACGGRNGGCNGCAGTRRGGAA | 320-363 | [ |
| 4 | GGCTAACTHCGTGNCVGCNGCYGCGGTAANAC | 504-535 | [ |
| 5a | GTGTAGMGGTGAAATKCGTAGAT | 682-704 | [ |
| 5b | CAAACRGGATTAGAWACCCNNGTAGTCCACGC | 778-809 | [ |
| 6a | AAANTYAAANRAATWGRCGGGGRCCCGCACAAG | 906-938 | [ |
| 6b | ATGTGGTTTAATTCGA | 948-963 | [ |
| 6c | CAACGCGARGAACCTTACC | 966-984 | [ |
| 7a | AGGTGNTGCATGGYYGYCGTCAGCTCGTGYCGTGAG | 1045-1080 | [ |
| 7b | TGTTGGGTTAAGTCCCRYAACGAGCGCAACCCT | 1082-1114 | [ |
| 8a | GGAAGGYGGGGAYGACG | 1176-1192 | [ |
| 8b | GGGCKACACACGYGCTAC | 1219-1236 | [ |
| 9 | GCCTTGYACWCWCCGCCCGTC | 1386-1406 | [ |
| 10 | GGGTGAAGTCRTAACAAGGTANCC | 1486-1509 | [ |
Fig. 1Workflow established for obtaining primer contigs and the subsequent generation of k-mers.
Descriptive information of contigs generated after assembly of the reported primers for each conserved region of the 16S rRNA gene. The size of each contig, number of ambiguities detected and the number of iso-k-mers are shown. The number of generated k-mers is dependent on the primer contig size and is easily calculated (k-mers = primer contig size − k + 1), while the number of isomers is related to the number of degeneracies in each k-mer.
| Primer Contig | Iso | Iso | Iso | Iso | Iso | Iso | Iso | ||
|---|---|---|---|---|---|---|---|---|---|
| Name | Length | Ambiguities | |||||||
| 1 | 20 | 2 | 38 | 39 | 38 | 36 | 32 | 28 | 24 |
| 2 | 20 | 3 | 64 | 63 | 62 | 61 | 60 | 56 | 52 |
| 3 | 44 | 8 | 288 | 333 | 390 | 488 | 612 | 734 | 856 |
| 4 | 32 | 6 | 455 | 574 | 765 | 970 | 1,175 | 1,476 | 1,792 |
| 5a | 23 | 2 | 30 | 30 | 30 | 30 | 30 | 30 | 30 |
| 5b | 32 | 4 | 213 | 244 | 275 | 306 | 334 | 350 | 372 |
| 6a | 33 | 7 | 346 | 437 | 504 | 602 | 704 | 926 | 1,148 |
| 6b | 16 | 0 | 8 | 7 | 6 | 5 | 4 | 3 | 2 |
| 6c | 19 | 1 | 20 | 19 | 18 | 16 | 14 | 12 | 10 |
| 7a | 36 | 5 | 110 | 125 | 140 | 167 | 194 | 222 | 246 |
| 7b | 33 | 2 | 51 | 53 | 55 | 57 | 59 | 61 | 63 |
| 8a | 17 | 2 | 24 | 24 | 24 | 22 | 20 | 16 | 12 |
| 8b | 18 | 2 | 22 | 22 | 22 | 22 | 22 | 20 | 16 |
| 9 | 21 | 3 | 59 | 64 | 66 | 68 | 64 | 60 | 56 |
| 10 | 24 | 2 | 34 | 34 | 34 | 36 | 38 | 40 | 38 |
Fig. 2Frequency of k-mers of 9 to 15 nucleotides detected in different conserved regions of 16S rRNA sequences contained in the SILVA database.
Fig. 3Duplicate reactions detected within sequences obtained from the SILVA database when using 9- to 15-mers constructed from the primer contigs matching all conserved regions of the 16S rRNA.
Primer contigs constructed for the different conserved regions. 12-mers registering the highest frequency in each primer contig are underlined and in bold. The number and sequence of each primer contig, as well as position (following the numbering of E. coli rRNA) and frequency for each 12-mer are indicated. Minor primer contigs are italicized.
| Primer Contig | 12-mer | ||
|---|---|---|---|
| Number | Sequence | Position | Frequency |
| 1 | AGAGTTT | 15 | 195,901 (38.2%) |
| 2 | 100 | 405,570 (79.0%) | |
| 3 | ACTGAGAYACGGYCCARACTCCTA | 344 | 500,253 (97.5%) |
| 4 | GGCTAACTHCGTG | 517 | 496,412 (96.7%) |
| 5b | CAAACRGGA | 787 | 493,348 (96.1%) |
| 6a | AAANTYAAA | 915 | 501,792 (97.8%) |
| 7a | AGGTGNTGCAT | 1056 | 499,976 (97.4%) |
| 8a | 1176 | 457,537 (89.1%) | |
| 9 | GCCT | 1390 | 388,911 (75.8%) |
| 10 | GGGTG | 1491 | 172,918 (33.7%) |
Fig. 4Proportion of C1 sequences obtained when using 12-mers matching C3 located at different nucleotide positions. For example, C1 was not detected in sequences when C3 is located at position 295 or lower; meanwhile, when C3 is located at position 340 or higher, 80% or more of the sequences contained C1. The cumulative percentage of C3-positive sequences is indicated by the step line.
Reactions of primers used for DGGE and for the most frequent 12-mers of regions C3 and C6. The vast majority of the SILVA database sequences reacted at both ends, indicating a possible amplification. The reaction occurred only at one end in ∼8% and 4% of sequences when primers and 12-mers were used, respectively. Less than 0.5% did not react with any primer or 12-mer.
| Reaction | DGGE Primers | 12-mers | Primers & 12-mers |
|---|---|---|---|
| Both ends | 468,079 (91.19%) | 489,734 (95.41%) | 466,499 (90.88%) |
| Forward | 20,472 (3.99%) | 10,517 (2.05%) | 8,493 (1.65%) |
| Reverse | 22,376 (4.36%) | 12,058 (2.35%) | 11,386 (2.22%) |
| None | 2,382 (0.46%) | 1,000 (0.19%) | 888 (0.17%) |
| Total | 513,309 | 513,309 | 487,266 (94.93%) |
Fig. 5Alignment of primers for DGGE and the primer contig. The most frequent 12-mer is underlined, while the difference G8 of the primer, which corresponds to R4 of the 12-mer, is shaded.