Literature DB >> 28653026

MONO, DI and TRI SSRs data extraction & storage from 1403 virus genomes with next generation retrieval mechanism.

K V S S R Murthy1, K V V Satyanarayana2.   

Abstract

Now a day׳s SSRs occupy the dominant role in different areas of bio-informatics like new virus identification, DNA finger printing, paternity & maternity identification, disease identification, future disease expectations and possibilities etc., Due to their wide applications in various fields and their significance, SSRs have been the area of interest for many researchers. In the SSRs extraction, retrieval algorithms are used; if retrieval algorithms quality is improved then automatically SSRs extraction system will achieve the most relevant results. For this retrieval purpose in this paper a new retrieval mechanism is proposed which will extracted the MONO, DI and TRI patterns. To extract the MONO, DI and TRI patterns using proposed retrieval mechanism in this paper, DNA sequence of 1403 virus genome data sets are considered and different MONO, DI and TRI patterns are searched in the data genome sequence file. The proposed Next Generation Sequencing (NGS) retrieval mechanism extracted the MONO, DI and TRI patterns without missing anything. It is observed that the retrieval mechanism reduces the unnecessary comparisons. Finally the extracted SSRs provide the useful, single view and useful resource to researchers.

Entities:  

Year:  2017        PMID: 28653026      PMCID: PMC5476967          DOI: 10.1016/j.dib.2017.06.008

Source DB:  PubMed          Journal:  Data Brief        ISSN: 2352-3409


Specifications Table Value of the data Data sets obtained from genomes of VIRUSES with NGS retrieval process have shown the high specificity. These data suggest that SSR extraction is an useful method for providing information for various applications related to studies in VIRUSES. Access to the raw sequencing data in VIRUSES allows researchers to perform further bio-informatics analysis based on their own computational algorithms.

Data

Database has been developed using MySQL. The information stored in the database includes virus names, genome id, A,C,G,T percentages, tract length, category, motif types (MONO, DI and TRI), the sequences of the motifs and frequencies of occurrence in the entire genome. The actual process of database is shown in Fig. 1.
Fig. 1

virus_category table actual data.

virus_category table actual data.

Structure of the database

In this paper, we consider three tables from database and changed the structure to our own format so that additional analysis can be done easily. They are virus_category virus_acgt_count virus_ssrs

Virus category

This table has the information related to virus categories from virus files. The structure is as shown in the Table 1 and actual data was shown in Fig. 1.
Table 1

virus_category.

TypeCollation
virus_namevarchar(100)
genome_idvarchar(20)
category1varchar(20)
category2varchar(20)
virus_category.

Virus ACGT count

This table has the information related to virus A,C,G and T count, its percentage, tract length. The structure is as shown in the Table 2 and actual data was shown in Fig. 2.
Table 2

virus_category.

TypeCollation
virus_namevarchar(100)
genome_idvarchar(20)
A_count_and_pervarchar(20)
C_count_and_pervarchar(20)
G_count_and_pervarchar(20)
T_count_and_pervarchar(20)
tract_lengthint(15)
categoryvarchar(20)
Fig. 2

virus_acgt_count table actual data.

virus_acgt_count table actual data. virus_category.

Virus SSRs

This table has the information related to virus_name, genome_id, motif, frequency and its position. The structure is as shown in the Table 3 and actual data was shown in Fig. 3.
Table 3

virus_ssrs.

TypeCollation
virus_namevarchar(100)
genome_idvarchar(20)
motifvarchar(20)
frequencyint(10)
positionint(10)
Fig. 3

virus_ssrs table actual data.

virus_ssrs table actual data. virus_ssrs.

Description

In this section we give detailed description of the 1403 virus genomes

Category wise description

We used a total of 1403 virus genome sequences. We categorized these genomes as shown in the Table A1(presented in Appendix A). From this categorization (according to Table A1), we observe that virus genomes are further sub grouped into 49 categories. They are Amalgaviridae, Ampullaviridae, Anelloviridae etc., Among the 1403 genomes, 566 genomes belong to ssRNA positive-strand viruses, no DNA, 151 belong to ssRNA negative-strand viruses, 141 belong to Geminiviridae etc.,. From the Fig. 4, observed that ssRNA positive-strand viruses, no DNA (566), ssRNA negative-strand viruses (151), Geminiviridae (141) occupies the major role among the others.
Table A1

Category wise virus genome sequences.

CATEGORYCOUNT
Amalgaviridae4
Ampullaviridae1
Anelloviridae6
Aumaivirus.1
Bacilladnavirus4
Baculoviridae1
Bicaudaviridae1
Birnaviridae4
Botybirnavirus.1
Caudovirales14
Caulimoviridae34
Chrysoviridae2
Circoviridae35
Corticoviridae1
Endornaviridae8
Fuselloviridae4
Geminiviridae141
Hepadnaviridae10
Herpesvirales2
Hypoviridae3
Inoviridae7
Lavidaviridae1
Ligamenvirales6
Microviridae5
Mimiviridae2
Nanoviridae5
Papanivirus.1
Papillomaviridae85
Partitiviridae21
Parvoviridae40
Polyomaviridae39
Poxviridae1
Reoviridae3
Retroviridae42
Salterprovirus2
Satellite Nucleic Acids75
Satellites4
ssRNA negative-strand viruses151
ssRNA positive-strand viruses, no DNA566
Totiviridae26
Turriviridae1
unassigned ssRNA viruses1
unclassified dsDNA phages.1
unclassified dsDNA viruses.2
unclassified Gemycircularvirus.7
unclassified ssDNA viruses.30
unclassified ssRNA viruses.2
Total1403
Fig. 4

category wise virus count.

category wise virus count.

Frequency description

We extracted the overall frequency, MONO, DI and TRI frequencies from the virus_ssrs those are shown in Table 4. From these extracted information MONO has shown the max frequency that is 99, so it has high impact.
Table 4

Virus genome overall frequency, MONO, DI and TRI frequencies.

FREQUENCY
MINAVGMAX
OVERALL11.248225081189452699
MONO102.444856290795539399
DI11.07490419130929989
TRI11.02477846932262749
Virus genome overall frequency, MONO, DI and TRI frequencies.

Virus size description

In this section, we described SSRs by executing SQL queries on virus_category for category wise counts and the results are shown in the Table A2 (presented in Appendix A). Table A2 gives a summary of the total number of genomes categorized based on genome sizes of various virus categories. Two of the Mimiviridae genomes are found to be very high (greater than 1 Mb), 81 ssRNA negative-strand viruses and 89 ssRNA positive-strand viruses, no DNA are found to be between the 10 Kb and 50 Kb. 31 virus genomes have shown size less than <1 Kb.
Table A2

Virus genome sizes and their classification based on different size ranges.

Genome size rangeNo. of genomes
SIZE < 1 Kb
CATEGORYCOUNT
Circoviridae1
Nanoviridae3
Papanivirus.1
Partitiviridae2
Satellite Nucleic Acids20
ssRNA negative-strand viruses2
ssRNA positive-strand viruses, no DNA2
>=1 Kb and <2 Kb
CategoryCount
Aumaivirus.1
Circoviridae27
Nanoviridae2
Partitiviridae12
Reoviridae1
Satellite Nucleic Acids55
Satellites4
ssRNA negative-strand viruses15
ssRNA positive-strand viruses, no DNA12
unclassified ssDNA viruses.6
>=10 Kb < 50 Kb
CategoryCount
Ampullaviridae1
Caudovirales8
Endornaviridae7
Fuselloviridae4
Hypoviridae1
Lavidaviridae1
Ligamenvirales6
Retroviridae8
Salterprovirus2
ssRNA negative-strand viruses81
ssRNA positive-strand viruses, no DNA89
Totiviridae1
Turriviridae1
unclassified dsDNA viruses.1
unclassified ssDNA viruses.1
>=50 Kb < 100 Kb
CategoryCount
Bicaudaviridae1
Caudovirales2
>=100 Kb < 500 Kb
CategoryCount
Baculoviridae1
Caudovirales3
Herpesvirales2
Poxviridae1
Size > 1 Mb
CategoryCount
Mimiviridae2

MIN, MAX and AVG tract length description

We did a preliminary study on the genome sizes of all viruses as shown in the Table A3 (presented in Appendix A). From the Table A3, we observed that, the smallest Mitochondrial genome is Satellite Nucleic Acids of length 216 bp whereas the largest virus genome is Mimiviridae of length 1,241,026 bp. When the average genome sizes of viruses are considered with respect to their category, it has been observed that the average lengths of Mimiviridae genomes are much higher when compared to those of Herpesvirales and Baculoviridae (Refer Fig. 5). The virus genomes of Mimiviridae are around 6 times larger than those of Herpesvirales and 7 times larger than Baculoviridae genomes.
Table A3

Virus ggenome sizes of Mitochondria category wise.

CategorySmallestLargestAverage
Amalgaviridae311033873314.0000
Ampullaviridae234712347123,471.0000
Anelloviridae210937202782.8333
Aumaivirus.115111511151.0000
Bacilladnavirus547259145668.2500
Baculoviridae152844152844152,844.0000
Bicaudaviridae618336183361,833.0000
Birnaviridae274433803203.5000
Botybirnavirus.612661266126.0000
Caudovirales720316531858,854.2857
Caulimoviridae684590737683.9706
Chrysoviridae286032033031.5000
Circoviridae84628831920.8286
Corticoviridae993599359935.0000
Endornaviridae96201723613,734.1250
Fuselloviridae146342384017,159.0000
Geminiviridae245635882664.9504
Hepadnaviridae297433283115.7000
Herpesvirales131808208496170,152.0000
Hypoviridae94061255210,526.0000
Inoviridae572183396957.4286
Lavidaviridae170291702917,029.0000
Ligamenvirales243024058236,293.8333
Microviridae407063605200.4000
Mimiviridae100675712410261,123,891.5000
Nanoviridae96510831010.2000
null92898774157.1429
Papanivirus.814814814.0000
Papillomaviridae691984847556.4353
Partitiviridae30323151730.7143
Parvoviridae372662435048.6000
Polyomaviridae462961305056.7692
Poxviridae142509142509142,509.0000
Reoviridae164627522333.0000
Retroviridae3120130568384.5238
Salterprovirus142551583715,046.0000
Satellite Nucleic Acids21614571127.6133
Satellites132613421335.2500
ssRNA negative-strand viruses800186888945.1523
ssRNA positive-strand viruses, no DNA944199017476.2845
Totiviridae2066113945663.6538
Turriviridae163821638216,382.0000
unassigned ssRNA viruses431243124312.0000
unclassified dsDNA phages.805980598059.0000
unclassified dsDNA viruses.79661491411,440.0000
unclassified Gemycircularvirus.205922182139.1429
unclassified ssDNA viruses.1788105033369.4333
unclassified ssRNA viruses.591661956055.5000
Fig. 5

average tract length analysis.

average tract length analysis.

MONO MOTIF description

We extract the total of 4,692,149 continues MONO, DI and TRI SSRs are extracted from 1403 genomes. Table A4 (presented in Appendix A) shown the max frequency of the MONO motifs.
Table A4

MONO SSRs.

VIRUS_NAMEgenome_idMOTIFMAX FREQUENCYNumber of times occurred
Feline_astrovirus_2_uid218014NC_022249G991
A9313
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874A93
Eupatorium_yellow_vein_virus_satellite_DNA_beta_ui…NC_004515A93
Hedyotis_uncinella_yellow_mosaic_betasatellite_uid…NC_023015A92
Honeysuckle_yellow_vein_mosaic_disease_associated_…NC_009571A92
Malvastrum_yellow_mosaic_virus_satellite_DNA_beta_…NC_008560A92
Mamestra_configurata_NPV_A_uid14168NC_003529A94
Megavirus_chiliensis_uid74349NC_016072A9118
Moumouvirus_uid186430NC_020104A971
C957
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874C92
Canine_papillomavirus___4_uid28243NC_010226C92
Feline_leukemia_virus_uid14686NC_001940C97
Potato_mop_top_virus_uid14789NC_003723C93
Tolypocladium_cylindrosporum_virus_1_uid61451NC_014823C92
Trichechus_manatus_latirostris_papillomavirus_2_ui…NC_016898C92
T9268
Trematomus_polyomavirus_1_uid282773NC_026944T92
Canine_oral_papillomavirus_uid14326NC_001619T92
Chaetoceros_lorenzianus_DNA_Virus_uid63565NC_015211T92
Citrus_chlorotic_dwarf_associated_virus_uid170854NC_018151T92
Ferret_papillomavirus_uid218024NC_022253T92
Megavirus_chiliensis_uid74349NC_016072T9115
Mamestra_configurata_NPV_A_uid14168NC_003529T94
Moumouvirus_uid186430NC_020104T978
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874T92

DI MOTIF description

We extract a total of 12853740 continues DI SSRs are extracted from 1403 genomes. Table A5(presented in Appendix A) shown the max frequency of the DI motifs.
Table A5

DI SSRs.

VIRUS_NAMEgenome_idMOTIFMAX FREQUENCYNumber of times occurred
AC91
Sauropus_leaf_curl_disease_associated_DNA_beta_uid…NC_018671AC91
AG71
Vanilla_distortion_mosaic_virus_uid263828NC_025250AG71
AT92
Moumouvirus_uid186430NC_020104AT92
Zalophus_californianus_papillomavirus_1_uid65277NC_015325CG71
CT73
Baboon_endogenous_virus_M7_uid222253NC_022517CT72
Cowpea_mosaic_virus_uid15283NC_003549CT71
CA91
Sauropus_leaf_curl_disease_associated_DNA_beta_uid…NC_018671CA91
GT83
Spleen_focus_forming_virus_uid14641NC_001500GT81
Norway_rat_hepacivirus_1_uid267736NC_025672GT81
Human_papillomavirus_type_26_uid15507NC_001583GT81
GA62
Vanilla_distortion_mosaic_virus_uid263828NC_025250GA61
Oat_golden_stripe_virus_uid15093NC_002358GA61
GC61
Zalophus_californianus_papillomavirus_1_uid65277NC_015325GC61
TA91
Moumouvirus_uid186430NC_020104TA91
TC71
Cowpea_mosaic_virus_uid15283NC_003549TC71
TGNULLNULL

TRI MOTIF description

We extract a total of 14469215 continues TRI SSRs are extracted from 1403 genomes. Table A6(presented in Appendix A) shown the max frequency of the TRI motifs.
Table A6

TRI SSRs.

VIRUS_NAMEgenome_idMOTIFMAX FREQUENCYNumber of times occurred
AAC71
Penicillium_chrysogenum_virus_uid16141NC_007540AAC71
Santeuil_nodavirus_uid62547NC_015069AAG61
Mamestra_configurata_NPV_A_uid14168NC_003529AAT71
Penicillium_chrysogenum_virus_uid16141NC_007540ACA71
ACC416
Zamilon_virophage_uid230580NC_022990ACC41
Human_papillomavirus_type_49_uid15455NC_001591ACC41
Mamestra_configurata_NPV_A_uid14168NC_003529ACG51
Microviridae_phi_CA82_uid70009NC_015785ACT61
Santeuil_nodavirus_uid62547NC_015069AGA71
Ursus_maritimus_papillomavirus_1_uid29915NC_010739AGC61
AGG64
Procyon_lotor_papillomavirus_1_uid15468NC_007150AGG61
Epsilonpapillomavirus_1_uid14220NC_004195AGG61
AGT61
Mamestra_configurata_NPV_A_uid14168NC_003529AGT61
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874AGT61
AGT62
Mamestra_configurata_NPV_A_uid14168NC_003529ATA61
Himetobi_P_virus_uid14801NC_003782ATA61
Mamestra_configurata_NPV_A_uid14168NC_003529ATC91
ATG52
Potato_yellow_dwarf_virus_uid74995NC_016136ATG51
Puumala_virus_uid14930NC_005225ATG51
ATT411
Mamestra_configurata_NPV_A_uid14168NC_003529ATT43
CAA62
Penicillium_chrysogenum_virus_uid16141NC_007540CAA61
Cucumber_green_mottle_mosaic_virus_uid14681NC_001801CAA61
CAC49
Zamilon_virophage_uid230580NC_022990CAC41
Magnaporthe_oryzae_chrysovirus_1_uid51685NC_014465CAC41
CAG63
Ursus_maritimus_papillomavirus_1_uid29915NC_010739CAG61
Mamestra_configurata_NPV_A_uid14168NC_003529CAG61
Mamestra_configurata_NPV_A_uid14168NC_003529CAT81
Mamestra_configurata_NPV_A_uid14168NC_003529CAT81
CCA413
Zamilon_virophage_uid230580NC_022990CCA41
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874CCA41
CCG45
Phlebiopsis_gigantea_mycovirus_dsRNA_1_uid46855NC_013999CCG41
Halastavi_arva_RNA_virus_uid77939NC_016418CCG41
CCT63
Curionopolis_virus_uid264939NC_025354CCT61
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874CCT61
CGA52
Mamestra_configurata_NPV_A_uid14168NC_003529CGA51
Human_papillomavirus_109_uid36519NC_012485CGA51
CGC49
Phlebiopsis_gigantea_mycovirus_dsRNA_1_uid46855NC_013999CGC41
Horseshoe_bat_hepatitis_B_virus_uid253463NC_024444CGC41
CGG46
Woolly_monkey_sarcoma_virus_uid19547NC_009424CGG41
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874CGG41
Mamestra_configurata_NPV_A_uid14168NC_003529CGT61
Microviridae_phi_CA82_uid70009NC_015785CTA61
Mamestra_configurata_NPV_A_uid14168NC_003529CTC71
CTG49
Saguaro_cactus_virus_uid14981NC_001780CTG41
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874CTG41
Abalone_herpesvirus_Victoria_AUS_2009_uid177933NC_018874CTT71
Santeuil_nodavirus_uid62547NC_015069GAA61
GAC52
Mamestra_configurata_NPV_A_uid14168NC_003529GAC52
GAG73
Procyon_lotor_papillomavirus_1_uid15468NC_007150GAG71
Crocuta_papillomavirus_1_uid174774NC_018575GAG71
GAT52
Puumala_virus_uid14930NC_005225GAT5
Acidianus_bottle_shaped_virus_uid19605NC_009452GAT5
Ursus_maritimus_papillomavirus_1_uid29915NC_010739GCA71
GCC47
Raphanus_sativus_cryptic_virus_1_uid17127NC_008190GCC41
Mycobacteriophage_Velveteen_uid215123NC_022060GCC41
Halorubrum_pleomorphic_virus_3_uid157259NC_017088GCG51
GCT53
Saguaro_cactus_virus_uid14981NC_001780GCT51
Mamestra_configurata_NPV_A_uid14168NC_003529GCT51
GGA65
Procyon_lotor_papillomavirus_1_uid15468NC_007150GGA61
Human_papillomavirus_type_103_uid17119NC_008188GGA61
Halorubrum_pleomorphic_virus_3_uid157259NC_017088GGC41
Mamestra_configurata_NPV_A_uid14168NC_003529GGT51
Mamestra_configurata_NPV_A_uid14168NC_003529GTC61
GTG47
Periplaneta_fuliginosa_densovirus_uid14091NC_000936GTG41
Mamestra_configurata_NPV_A_uid14168NC_003529GTG41
GTT53
Cherry_rasp_leaf_virus_uid15131NC_006271GTT51
Ovine_enzootic_nasal_tumour_virus_uid15410NC_007015GTT51
TAA62
Mamestra_configurata_NPV_A_uid14168NC_003529TAA61
Himetobi_P_virus_uid14801NC_003782TAA61
Microviridae_phi_CA82_uid70009NC_015785TAC61
Mamestra_configurata_NPV_A_uid14168NC_003529TAG51
TAT49
Yaba_like_disease_virus_uid14595NC_002642TAT41
Human_papillomavirus_54_uid15466NC_001676TAT41
Mamestra_configurata_NPV_A_uid14168NC_003529TCA81
TCC64
Curionopolis_virus_uid264939NC_025354TCC61
Mamestra_configurata_NPV_A_uid14168NC_003529TCC61
Mamestra_configurata_NPV_A_uid14168NC_003529TCG61
TCT53
Mamestra_configurata_NPV_A_uid14168NC_003529TCT51
Nyamanini_virus_uid38109NC_012703TCT51
TGA52
Puumala_virus_uid14930NC_005225TGA51
Cycas_necrotic_stunt_virus_uid15397NC_003791TGA51
TGC52
Chicken_gallivirus_1_uid259980NC_024770TGC51
Mamestra_configurata_NPV_A_uid14168NC_003529TGC51
TGG412
Peanut_clump_virus_uid14776NC_003668TGG41
Acinetobacter_bacteriophage_AP22_uid167576NC_017984TGG41
TGT52
Cherry_rasp_leaf_virus_uid15131NC_006271TGT51
Ovine_enzootic_nasal_tumour_virus_uid15410NC_007015TGT51
TTA52
Walleye_dermal_sarcoma_virus_uid14718NC_001867TTA51
Mamestra_configurata_NPV_A_uid14168NC_003529TTA51
TTC54
Squash_leaf_curl_China_virus____B__uid15591NC_007339TTC51
Nyamanini_virus_uid38109NC_012703TTC51
TTGNULL

Experimental design, materials and methods

SSR extraction

Availability of next-generation sequencing techniques leads to the accessibility of genome sequences including that of organelles like virus, fungi, bacteria etc. Studying the hyper-mutating SSRs [1], [2], [3], [4], [5], [6] repeats in virus genomes using Bioinformatics approach would be very interesting and informative as SSRs mining not only helps in understanding and addressing biological questions but also helps in making the best use of these repeats in various diverse applications. Earlier, few studies have attempted to analyze the distribution of SSR repeats in virus genomes but they are confined to a single or a small set of genomes. So far, there are no comprehensive reports in literature that show the distribution of microsatellite repeats in all sequenced virus genomes. In the remaining part of this study, we analyzed SSR repeats in more than 1403 virus genomes and a brief note on the distribution and frequency of these repeats has been presented. This approach scans the input virus genome sequence file and pattern files for MONO, DI and TRI patterns to find all occurrences of these patterns within this file using next generation retrieval mechanisms [7], [8], [9]. If repeat occurs then the successive logic is applied. The successive logic means continuous occurrence of similar patterns. If the successive pattern size >1 then the successive occurrence of pattern information is stored in the database. The process is shown in Fig. 6. The database is constructed in MySQL using JAVA.
Fig. 6

MONO,DI & TRI extraction process.

MONO,DI & TRI extraction process. SSR NGS retrieval algorithm has shown the detailed explanation about the Next Generation Sequencing(NGS) retrieval algorithm. It consists of five segments called I/O, Main, search, tandem repeat checking and database insertion. In input segment virus and pattern files are considered as input. In output segment, the extracted mechanism provides the number of occurrences, positions of MONO, DI and TRI patterns. In Main segment the length of file and pattern are read, for each pattern, ngs_search, check_for_tandem_repeat and ngs_database_insertion segments are called for entire length of input file. In search segment, the pattern is searched in the input file, if match occurs then increments the occurrence count. In tandem repeat checking segment, the different between the occurrence positions are measured, if they are equal to length of the pattern then it is considered one tandem repeat. In database insertion segment, virus name, genome id, pattern, count and position is stored in the database.
Subject areaBio-informatics
More specific subject areaGenomes of VIRUSES
Type of dataTables, figures
How data was acquiredVIRUS SSR markers extraction with NGS string matching
Data formatAnalyzed
Experimental factorsMONO, DI and TRI SSRs: A,C,G,T,AC,AG,…,ACC,…were targeted. NGS retrieval process is applied on genomes VIRUSES. MONO, DI and TRI SSR markers to be used in various detection purposes are extracted with this approach.
Experimental featuresEach of the MONO, DI and TRI markers are extracted from genomes of VIRUSES. All the SSRs showed the 1,2,3-bp in allele size. These differences showed that there are some polymorphisms among the genomes to the number of SSR repeats.
Data source locationBHIMAVARAM, INDIA
Data accessibilityThe data is provided with this article
SSR NGS RETRIEVAL ALGORITHM
Input:Virus files and MONO, DI and TRI pattern filesOutput:The number of occurrences and the positions of the MONO, DI and TRI pattern
/* Main */
1n←T.length, m←P.length
2for each MONO, DI & TRI patterns
3for i ← 0 to n-m do
4begin
5    count←ngs_search(T,P,i,count);
6    tandem_repeat_count←check_for_tandem_repeat(T,P,i,count);
7    ngs_database_insertion(P,i,tandem_repeat_count)
8end for
9end for
/* Search */
18int ngs_search(Char[] T, Char[] P, int i, int count)
19begin
20   j1← P.length;
21    while ( j1>=0 && T[ i - j1] == P[j1])
22    do
23        j1j1-1;
24    done;
25    if (j1== -1)
26        count++;
27    end if
28return count;
29end ngs_search;
/* Tandem repeat checking */
30int check_for_tandem_repeat(Char[] T, Char[] P, int i, int count)
31begin
32    if (diff_of_two_repeats==-P.length)
33        tandem_repeat_count++;
34    else
35        tandem_repeat_count= tandem_repeat_count;
36    end if
37return tandem_repeat_count;
38end check_for_tandem_repeat;
39/* Database insertion */
40ngs_database_insertion(Char[] P, int i, int tandem_repeat_count)
41begin
42    insert into virus_ssrs(virus_name, genome_id, P, tandem_repeat_count,i);
43end ngs_database_insertion;
  3 in total

1.  A genome-wide analysis of simple sequence repeats in Apis cerana and its development as polymorphism markers.

Authors:  Lu Liu; Mingzhu Qin; Lin Yang; Zhenzhen Song; Li Luo; Hongyin Bao; Zhenggang Ma; Zeyang Zhou; Jinshan Xu
Journal:  Gene       Date:  2016-11-09       Impact factor: 3.688

2.  Next generation sequencing (NGS) database for tandem repeats with multiple pattern 2°-shaft multicore string matching.

Authors:  Chinta Someswara Rao; S Viswanadha Raju
Journal:  Genom Data       Date:  2016-01-29

3.  Similarity analysis between chromosomes of Homo sapiens and monkeys with correlation coefficient, rank correlation coefficient and cosine similarity measures.

Authors:  Chinta Someswara Rao; S Viswanadha Raju
Journal:  Genom Data       Date:  2016-01-07
  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.