Literature DB >> 26981434

Next generation sequencing (NGS) database for tandem repeats with multiple pattern 2°-shaft multicore string matching.

Chinta Someswara Rao¹, S Viswanadha Raju².

Abstract

Next generation sequencing (NGS) technologies have been rapidly applied in biomedical and biological research in recent years. To provide the comprehensive NGS resource for the research, in this paper , we have considered 10 loci/codi/repeats TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. Then we developed the NGS Tandem Repeat Database (TandemRepeatDB) for all the chromosomes of Homo sapiens, Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelii genome data sets for all those locis. We find the successive occurence frequency for all the above 10 SSR (simple sequence repeats) in the above genome data sets on a chromosome-by-chromosome basis with multiple pattern 2° shaft multicore string matching.

Entities: Chemical Disease Gene Species

Keywords: Genome; NGS; SSR; String matching; TandemRepeatDB; chromosomes

Year: 2016 PMID： 26981434 PMCID： PMC4778683 DOI： 10.1016/j.gdata.2016.01.015

Source DB: PubMed Journal: Genom Data ISSN： 2213-5960

Introduction

Since the completion of the first human genome sequence, demand for cheaper and faster sequencing methods has increased greatly. This demand has driven the development of second-generation sequencing methods, or next-generation sequencing (NGS). In this paper we developed NGS TandemRepeatDB that stores the successive occurence frequency of SSRs from the considered genomes. Simple sequence repeats (SSRs) are tandemly repeated DNA sequences found in varying abundance in most genomes [1], [2]. These repeats have been extensively used for genetic mapping and population studies [3]. SSRs also provide molecular tools to understand spatial relationships between chromosome segments, which in turn, aid in analyzing temporal relationships between species and genera [4]. In humans about 3% of the genome is occupied by SSRs [5]. The study of repeat frequency and its distribution pattern in the genome is expected to help in understanding their significance. There is accumulating evidence to suggest that SSRs function to regulate gene expression [6], [7]. The availability of complete genome sequences for many organisms has made it possible to carry out genome-wide analyses. In the present study we have screened all the chromosomes of Homo sapiens, Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelii [8] and studied the distribution and successive occurrence frequency of TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA locis [9], [10]. Earlier, few studies [11], [12], [13], [14] have attempted to analyze the distribution of tandem repeats in human like genomes but they are confined to a single or a small set of genomes. This tandem repeat mining helps in understanding and addressing biological questions. It is used in various diverse applications like DNA finger printing, maternity identification, paternity identification, theft identification, suspect findings, and disease identification [15], [16], [9], [10].

Methodology

In the present paper, we constructed the TandemRepeatDB with multiple pattern 2° shaft multicore string matching algorithm. String matching [17], [18], [19], [20] is a process of identifying the pattern (P) in a given text (T). In the present paper chromosomes of genomes are considered as text (T) and the locis are considered as patterns. The multiple pattern 2° shaft multi core string matching algorithm searches the multiple patterns concurrently in a single part (2° = 1 shaft) with multi core processors. In the TandemRepeatDB construction, the text file and the patterns are read and the patterns are searched in text file (T) with multiple pattern 2° shaft multi core string matching algorithm. If perfect tandem repeat occurs then the successive logic is applied. The successive logic means continuous perfect occurrence of similar tandem repeats. If the successive tandem repeat size > 1 then the successive occurrence of tandem repeat information is stored in the database. The database is constructed in MySQL using JAVA. The TandemRepeatDB construction process comprises four stages and its complete architecture is shown in Fig. 1.

Fig. 1

Architecture of TandemRepeatDB.

Reading The Text file is read. The Pattern set is read. Searching All the patterns from the given set are read and categorized basing on their right most characters. One of the patterns from one category is selected. The shift_left_to_right (Pm-1) function is applied for shift position. The multiple pattern 2° shaft multicore string matching algorithm is selected for searching. Search results If a perfect repeat occurs then successive occurrences are searched Storing If the perfect successive repeat size > 1 then it is stored in the TandemRepeatDB with the following information. sample_name sample_chromosome_name position noofoccurences codi/repeat name The multiple pattern 2° shaft multicore string matching algorithm consists input and output, initialization, main function, search function and shift_left_to_right function. In the input and output the genome sets and patterns are taken as input and the sample_id, sample_name, sample_chromosome_name, position, noofoccurences, and codi are returned as output. In the initialization, multi_pattern (all pattern in the set), n (text length), m (pattern length) and all other variables are initialized. In the main function the genome set is read on chromosome by chromosome basis, the individual chromosome is given to shift_left_to_right function. Once the shift value is received the search function is called for all the patterns. The shift_left_to_right function, applies the shift operation with the pattern's rightmost character and the shift position is returned to main function. The search process compares character by character from both the directions until a complete match or mismatch occurs. If match occurs the successive occurrence of the pattern is searched. If the successive occurrence size is greater than 1 then the data is stored into TandemRepeatDB. If the mismatch or complete match occurs the same procedure is repeated until the end of the T.

Structure of the database

In this paper a table is created with the given sample name. The table contains sample_id, sample_name, sample_chromosome_name, position indicating the occurrence position of the codi, noofoccurences indicating the number of occurrences of the repeat and codi indicating the name of the repeat. The table structure is shown in Table 1.

Table 1

Table Structure.

Type	Collation
sample_id	text
sample_name	text
sample_chromosome_name	text
position	int(10)
noofoccurences	int(10)
codi	text

Availability of NGS techniques leads to the accessibility of genome sequences. Studying the perfect successive occurrences of the tandem repeats using Bioinformatics approach would be very interesting and informative. In the remaining part of the study, perfect tandem repeats of all chromosomes in H. sapiens, C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, P. troglodytes, P. anubis and P. abelii genomes are analyzed and a brief note on the successive occurrence frequency of TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats have been presented. In this study 259 chromosomes of H.sapiens, C.jacchus, C.sabaeus, G.gorilla, M.fascicularis, M.mulatta, N.leucogenys, P.troglodytes, P.anubis and P.abelii genomes have been used as shown in Table 2.

Table 2

Genome sequences used in the study.

Genome sequence name	Name & number of chromosomes	Total number of tandem repeats extracted (> 1)
Homo sapiens	1 to 22, MT, X, Y and Un (26)	11,99,985
Callithrix jacchus	1 to 22, X, Y and Un (25)	11,40,529
Chlorocebus sabaeus	1 to 29, MT, X, Y and Un (33)	11,13,445
Gorilla gorilla	1, 2A, 2B, 3 to 22, MT, X and Un (26)	11,63,843
Macaca fascicularis	1 to 20, MT, X and Un (23)	12,31,029
Macaca mulatta	1 to 20, MT, X and Un (23)	12,74,556
Nomascus leucogenys	1a 2 to 6, 7b, 8 to 21, 22a, 23 to 25, X and Un (27)	11,71,594
Pan troglodytes	1, 2A, 2B, 3 to 22, MT, X, Y and Un (27)	12,76,766
Papio anubis	1 to 20, MT, X and Un (23)	13,51,393
Pongo abelii	1, 2A, 2B, 3 to 22, MT, X and Un (26)	13,81,887
10	259	1,23,05,027

Tandem repeat size analysis

In this section, the perfect successive tandem repeats are extracted and analyzed by executing SQL queries on the TandemRepeatDB for all the chromosomes of the considered genomes corresponding to TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. Notations used in tables: First column refers to codi/repeat name. Second column refers to MAX number of codi in the successive occurrences. Third column refers to number of times the MAX number appeared.

H. sapiens

H. sapiens are the binomial nomenclature for the human species. Homo is the human genus, which also includes Neanderthals and many other extinct species of hominid. In the paper, multiple pattern 2° shaft multicore string matching algorithm is used to retrieve the perfect successive tandem repeats from H. sapiens genomes which consists 1 to 22, MT, X, Y and Un chromosomes as shown in Table 2. A total of 11,99,985 perfect successive repeats are extracted from the above chromosomes, which are stored in the homo_sapiens table. Table 3 gives the summary of extracted MAX number of successive occurrences from the homo_sapiens table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2

Table 3

Tandem repeat successive occurrences for all chromosomes of H.sapiens.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	21	Once
AGAA	42	Twice
GATA	22	Once
TCTA	25	Once
TCAT	12	Twice
GAAT	12	Once
AGAT	21	Once
CTTT	78	Once
TATC	25	Once
TCTG	12	Four Times

Query1 and Query2 are executed for the remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the homo_sapiens table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 2.

Fig. 2

Max number of successive occurrences of all repeats for all chromosomes of H.sapiens.

From the Fig. 2, the following observations can be made: CTTT tandem repeat has maximum of 78 successive base pairs, AGAA tandem repeat has maximum of 42 successive base pairs twice, TCTG tandem repeat has maximum of 12 successive base pairs four times, The remaining Tandem repeats have successive base pairs from a minimum of 12 to a maximum of 25, All the above observations have a significant role in the bio-informatic studies.

C. jacchus

The common marmoset is a New World monkey. It originally lived in the Northeastern coast of Brazil. In the paper, the proposed string matching algorithm is used retrieve the perfect successive tandem repeats from C. jacchus genomes which consist 1 to 22, X, Y and Un chromosomes as shown in Table 2. A total of 11,40,529 perfect successive repeats are extracted from the above chromosomes, which are stored in the callithrix_jacchus table. Table 4 gives the summary of the extracted MAX number of successive occurrences from the callithrix_jacchus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2

Table 4

Tandem repeat successive occurrences for all chromosomes of C.jacchus.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	21	Once
AGAA	57	Once
GATA	20	Once
TCTA	18	Twice
TCAT	14	Once
GAAT	13	Once
AGAT	21	Once
CTTT	51	Once
TATC	18	Thrice
TCTG	14	Once

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA t. The extracted MAX number of successive occurrences from the callithrix_jacchus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 3.

Fig. 3

Max number of successive occurrences of all repeats for all chromosomes of C.jacchus.

From the Fig. 3, the following observations can be made: AGAA tandem repeat has maximum of 57 successive base pairs, CTTT tandem repeat has maximum of 51 successive base pairs, TATC tandem repeat has maximum of 18 successive base pairs thrice, Th remaining Tandem repeats have successive base pairs from a minimum of 13 to a maximum of 21, All the above observations have a significant role in the bio-informatics studies.

C. sabaeus

The green monkey, also known as the sabaeus monkey or the Callithrix monkey, is an Old World monkey with golden-green fur and pale hands and feet. The tip of the tail is golden yellow as are the backs of the thighs and cheek whiskers. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from C. sabaeus genomes which consists 1 to 29, MT, X, Y and Un chromosomes as shown in Table 2. A total of 11,13,445 perfect successive repeats are extracted from the above chromosomes, which are stored in the chlorocebus_sabaeus table. Table 5 gives the summary of extracted MAX number of successive occurrences from the chlorocebus_sabaeus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2.

Table 5

Tandem repeat successive occurrences for all chromosomes of C.sabaeus.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	19	Twice
AGAA	54	Once
GATA	20	Once
TCTA	20	Twice
TCAT	14	Once
GAAT	14	Once
AGAT	20	Once
CTTT	42	Once
TATC	20	Once
TCTG	14	Once

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the chlorocebus_sabaeus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats is graphically shown in Fig. 4.

Fig. 4

Max number of successive occurrences of all repeats for all chromosomes of C.sabaeus.

From the Fig. 4, the following observations can be made. AGAA tandem repeat has maximum of 54 successive base pairs, CTTT tandem repeat has maximum of 42 successive base pairs, TCTA tandem repeat has maximum of 20 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 14 to a maximum of 20, All the above observations have a significant role in the bio-informatics studies.

G. gorilla

The western gorilla is a great ape, the type species as well as the most populous species of the genus Gorilla. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from G. gorilla genomes which consists 1, 2A, 2B, 3 to 22, MT, X and Un chromosomes as shown in Table 2. A total of 11,63,843 perfect successive repeats are extracted from the above chromosomes, which are stored in the gorilla_gorilla table. Table 6 gives the summary of the extracted MAX number of successive occurrences from the gorilla_gorilla table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2.

Table 6

Tandem repeat successive occurrences for all chromosomes of G.gorilla.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	19	Thrice
AGAA	41	Once
GATA	20	Once
TCTA	26	Once
TCAT	14	Once
GAAT	12	Ten times
AGAT	20	Twice
CTTT	66	Once
TATC	26	Once
TCTG	16	Once

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the gorilla_gorilla table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 5.

Fig. 5

Max number of successive occurrences of all repeats for all chromosomes of G.gorilla.

From the Fig. 5, the following observations can be made: CTTT tandem repeat has maximum of 66 successive base pairs, AGAA tandem repeat has maximum of 41 successive base pairs, GAAT tandem repeat has maximum of 12 successive base pairs ten times, The remaining tandem repeats have successive base pairs from a minimum of 14 to a maximum of 26, All the above observations have a significant role in the bio-informatics studies.

M. fascicularis

The crab-eating macaque, also known as the long-tailed macaque, is a cercopithecine primate native to Southeast Asia. It is referred to as the cynomolgus monkey in laboratories In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from M. fascicularis genomes which consists 1 to 20, MT, X and Un chromosomes as shown in Table 2. A total of 12,31,029 perfect successive repeats are extracted from the above chromosomes, which are stored in the macaca_fascicular table. Table 7 gives the summary of the extracted MAX number of successive occurrences from the macaca_fascicularis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2

Table 7

Tandem repeat successive occurrences for all chromosomes of M.fascicularis.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	29	Once
AGAA	218	Once
GATA	29	Once
TCTA	33	Once
TCAT	19	Once
GAAT	14	Thrice
AGAT	28	Once
CTTT	221	Once
TATC	33	Once
TCTG	16	Once

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the macaca_fascicularis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 6.

Fig. 6

Max number of successive occurrences of all repeats for all chromosomes of M.fascicularis.

From the Fig. 6, the following observations can be made: CTTT tandem repeat has maximum of 221 successive base pairs, AGAA tandem repeat has maximum of 218 successive base pairs , GAAT tandem repeat has maximum of 14 successive base pairs thrice, The remaining tandem repeats have successive base pairs from a minimum of 16 to a maximum of 33, All the above observations have a significant role in the bio-informatics studies.

M. mulatta

The rhesus macaque (M. mulatta), is one of the best-known species of Old World monkeys. It is listed as the least concern in the IUCN Red List of Threatened Species in view of its wide distribution, presumed large population, and its tolerance of a broad range of habitats. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from M. mulatta genomes which consists 1 to 20, MT, X and Un chromosomes as shown in Table 2. A total of 12,74,556 perfect successive repeats are extracted from the above chromosomes, which are stored in the macaca_mulatta table. Table 8 gives the summary of the extracted MAX number of successive occurrences from the macaca_mulatta table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats.

Table 8

Tandem repeat successive occurrences for all chromosomes of M.mulatta.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	31	Once
AGAA	84	Once
GATA	31	Once
TCTA	21	Twice
TCAT	19	Once
GAAT	15	Once
AGAT	31	Once
CTTT	79	Once
TATC	21	Once
TCTG	12	Twice

The TAGA results are extracted from the table by executing Query1 and Query2 Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the macaca_mulatta table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 7.

Fig. 7

Max number of successive occurrences of all repeats for all chromosomes of M.mulatta.

From the Fig. 7, the following observations can be made: AGAA tandem repeat has maximum of 84 successive base pairs, CTTT tandem repeat has maximum of 79 successive base pairs, TCTA tandem repeat has maximum of 21 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 12 to a maximum of 31, All the above observations have a significant role in the bio-informatics studies.

N. leucogenys

The northern white-cheeked gibbon is a species of gibbon native to South East Asia. It is closely related to the southern white-cheeked gibbon, with which it was previously considered conspecific. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from N. leucogenys genomes which consists 1 to 6, 7b, 8 to 21, 22a, 23 to 25, X and Un chromosomes as shown in Table 2. A total of 11,71,594 perfect successive repeats are extracted from the above chromosomes, which are stored in the nomascus_leucogenys table. Table 9 gives the summary of the extracted MAX number of successive occurrences from the nomascus_leucogenys table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats.

Table 9

Tandem repeat successive occurrences for all chromosomes of N.leucogenys.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	17	Thrice
AGAA	52	Once
GATA	17	Once
TCTA	22	Once
TCAT	12	Once
GAAT	11	Once
AGAT	17	Once
CTTT	33	Once
TATC	23	Once
TCTG	13	Once

The TAGA results are extracted from the table by executing Query1 and Query2 Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the nomascus_leucogenys table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 8.

Fig. 8

Max number of successive occurrences of all repeats for all chromosomes of N.leucogenys.

From the Fig. 8, the following observations can be made: AGAA tandem repeat has maximum of 52 successive base pairs, CTTT tandem repeat has maximum of 33 successive base pairs, TCTG tandem repeat has maximum of 17 successive base pairs thrice, The remaining tandem repeats have successive base pairs from a minimum of 11 to a maximum of 23, All the above observations have a significant role in the bio-informatics studies.

P. troglodytes

The common chimpanzee (P. troglodytes), also known as the robust chimpanzee, is a species of great apes. Colloquially, the common chimpanzee is often called the chimpanzee. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from P. troglodytes genomes which consists 1, 2A, 2B, 3 to 22, MT, X, Y and Un chromosomes as shown in Table 2. A total of 12,76,766 perfect successive repeats are extracted from the above chromosomes, which are stored in the pan_troglodytes table. Table 10 gives the summary of the extracted MAX number of successive occurrences from the pan_troglodytes table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2

Table 10

Tandem repeat successive occurrences for all chromosomes of P.troglodytes.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	17	Four Times
AGAA	43	Once
GATA	18	Twice
TCTA	18	Once
TCAT	10	Five Times
GAAT	11	Once
AGAT	18	Once
CTTT	30	Once
TATC	19	Once
TCTG	13	Once

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the pan_troglodytes table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 9.

Fig. 9

Max number of successive occurrences of all repeats for all chromosomes of P.troglodytes.

From the Fig. 9, the following observations can be made. AGAA tandem repeat has maximum of 43 successive base pairs, CTTT tandem repeat has maximum of 30 successive base pairs, TCAT tandem repeat has maximum of 10 successive base pairs five times, The remaining tandem repeats have successive base pairs from a minimum of 11 to a maximum of 19, All the above observations have a significant role in the bio-informatics studies.

P. anubis

The olive baboon, also called the Anubis baboon, is a member of the family Cercopithecidae. The species is the most widely ranging of all baboons. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from P. anubis genomes which consists 1 to 20, MT, X and Un chromosomes as shown in Table 2. A total of 13,51,393 perfect successive repeats are extracted from the above chromosomes, which are stored in the papio_anubis table. Table 11 gives the summary of the extracted MAX number of successive occurrences from the papio_anubis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1and Query2

Table 11

Tandem repeat successive occurrences for all chromosomes of P.anubis.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	31	Once
AGAA	54	Once
GATA	31	Once
TCTA	22	Once
TCAT	15	Once
GAAT	14	Once
AGAT	32	Once
CTTT	47	Twice
TATC	21	Twice
TCTG	15	Once

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the papio_anubis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 10.

Fig. 10

Max number of successive occurrences of all repeats for all chromosomes of P.anubis.

From the Fig. 10, the following observations can be made. AGAA tandem repeat has maximum of 54 successive base pairs, CTTT tandem repeat has maximum of 47 successive base pairs twice, TCTG tandem repeat has maximum of 21 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 14 to a maximum of 32, All the above observations have a significant role in the bio-informatics studies.

P. abelii

The Sumatran orangutan is one of the two species of orangutans. Found only in the island of Sumatra, in Indonesia, it is rarer than the Bornean orangutan. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from P. abelii genomes which consists 1, 2A, 2B, 3 to 22, MT, X and Un chromosomes as shown in Table 2. A total of 13,81,887 perfect successive repeats are extracted from the above chromosomes, which are stored in the pongo_abelii table. Table 12 gives the summary of extracted MAX number of successive occurrences from the pongo_abelii table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2

Table 12

Tandem repeat successive occurrences for all chromosomes of P.abelli.

codi/Repeat name	MAX number of codi in the successive occurrences	Number of times the MAX number appeared
TAGA	20	Once
AGAA	37	Once
GATA	19	Once
TCTA	18	Twice
TCAT	12	Once
GAAT	13	Once
AGAT	20	Once
CTTT	63	Once
TATC	19	Once
TCTG	11	Twice

Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the pongo_abelii table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 11.

Fig. 11

Max number of successive occurrences of all repeats for all chromosomes of P.abelii.

From the Fig. 11, the following observations can be made. CTTT tandem repeat has maximum of 63 successive base pairs, AGAA tandem repeat has maximum of 37 successive base pairs, TCTA tandem repeat has maximum of 18 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 11 to a maximum of 20, All the above observations have a significant role in the bio-informatics studies.

Conclusions

In this paper we developed the TandemRepeatDB that provides a single portal access to perfect successive repeats in genomes of H. sapiens, C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, P. troglodytes, P. anubis and P. abelii. The database is known to be the first of its kind to host all types of perfect successive tandem repeats for the considered genomes. From the analysis of all the records existing in the TandemRepeatDB, it is observed that CTTT tandem repeat and AGAA tandem repeat occupy the major role. This TandemRepeatDB will be a very valuable resource for researchers studying repeats in the above mentioned genomes.

Conflict of interests

Authors did not have any conflict of interests.

14 in total

Next generation sequencing (NGS) database for tandem repeats with multiple pattern 2°-shaft multicore string matching.

Introduction

Methodology

Structure of the database

Tandem repeat size analysis

H. sapiens

C. jacchus

C. sabaeus

G. gorilla

M. fascicularis

M. mulatta

N. leucogenys

P. troglodytes

P. anubis

P. abelii

Conclusions

Conflict of interests

1. Neurodegenerative diseases. Origins of instability.

2. TRbase: a database relating tandem repeats to disease genes for the human genome.

Review 3. Encoded evidence: DNA in forensic analysis.

4. DNA microsatellites: agents of evolution?

Review 5. Simple sequence repeats as a source of quantitative genetic variation.

6. Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism.

7. A comprehensive genetic map of the human genome based on 5,264 microsatellites.

8. Microsatellites in different eukaryotic genomes: survey and analysis.

9. Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions.

10. SSRD: simple sequence repeats database of the human genome.

1. MONO, DI and TRI SSRs data extraction & storage from 1403 virus genomes with next generation retrieval mechanism.

2. Data of 10 SSR markers for genomes of homo sapiens and monkeys.