Literature DB >> 26981434

Next generation sequencing (NGS) database for tandem repeats with multiple pattern 2°-shaft multicore string matching.

Chinta Someswara Rao1, S Viswanadha Raju2.   

Abstract

Next generation sequencing (NGS) technologies have been rapidly applied in biomedical and biological research in recent years. To provide the comprehensive NGS resource for the research, in this paper , we have considered 10 loci/codi/repeats TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. Then we developed the NGS Tandem Repeat Database (TandemRepeatDB) for all the chromosomes of Homo sapiens, Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelii genome data sets for all those locis. We find the successive occurence frequency for all the above 10 SSR (simple sequence repeats) in the above genome data sets on a chromosome-by-chromosome basis with multiple pattern 2° shaft multicore string matching.

Entities:  

Keywords:  Genome; NGS; SSR; String matching; TandemRepeatDB; chromosomes

Year:  2016        PMID: 26981434      PMCID: PMC4778683          DOI: 10.1016/j.gdata.2016.01.015

Source DB:  PubMed          Journal:  Genom Data        ISSN: 2213-5960


Introduction

Since the completion of the first human genome sequence, demand for cheaper and faster sequencing methods has increased greatly. This demand has driven the development of second-generation sequencing methods, or next-generation sequencing (NGS). In this paper we developed NGS TandemRepeatDB that stores the successive occurence frequency of SSRs from the considered genomes. Simple sequence repeats (SSRs) are tandemly repeated DNA sequences found in varying abundance in most genomes [1], [2]. These repeats have been extensively used for genetic mapping and population studies [3]. SSRs also provide molecular tools to understand spatial relationships between chromosome segments, which in turn, aid in analyzing temporal relationships between species and genera [4]. In humans about 3% of the genome is occupied by SSRs [5]. The study of repeat frequency and its distribution pattern in the genome is expected to help in understanding their significance. There is accumulating evidence to suggest that SSRs function to regulate gene expression [6], [7]. The availability of complete genome sequences for many organisms has made it possible to carry out genome-wide analyses. In the present study we have screened all the chromosomes of Homo sapiens, Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelii [8] and studied the distribution and successive occurrence frequency of TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA locis [9], [10]. Earlier, few studies [11], [12], [13], [14] have attempted to analyze the distribution of tandem repeats in human like genomes but they are confined to a single or a small set of genomes. This tandem repeat mining helps in understanding and addressing biological questions. It is used in various diverse applications like DNA finger printing, maternity identification, paternity identification, theft identification, suspect findings, and disease identification [15], [16], [9], [10].

Methodology

In the present paper, we constructed the TandemRepeatDB with multiple pattern 2° shaft multicore string matching algorithm. String matching [17], [18], [19], [20] is a process of identifying the pattern (P) in a given text (T). In the present paper chromosomes of genomes are considered as text (T) and the locis are considered as patterns. The multiple pattern 2° shaft multi core string matching algorithm searches the multiple patterns concurrently in a single part (2° = 1 shaft) with multi core processors. In the TandemRepeatDB construction, the text file and the patterns are read and the patterns are searched in text file (T) with multiple pattern 2° shaft multi core string matching algorithm. If perfect tandem repeat occurs then the successive logic is applied. The successive logic means continuous perfect occurrence of similar tandem repeats. If the successive tandem repeat size > 1 then the successive occurrence of tandem repeat information is stored in the database. The database is constructed in MySQL using JAVA. The TandemRepeatDB construction process comprises four stages and its complete architecture is shown in Fig. 1.
Fig. 1

Architecture of TandemRepeatDB.

Reading The Text file is read. The Pattern set is read. Searching All the patterns from the given set are read and categorized basing on their right most characters. One of the patterns from one category is selected. The shift_left_to_right (Pm-1) function is applied for shift position. The multiple pattern 2° shaft multicore string matching algorithm is selected for searching. Search results If a perfect repeat occurs then successive occurrences are searched Storing If the perfect successive repeat size > 1 then it is stored in the TandemRepeatDB with the following information. sample_name sample_chromosome_name position noofoccurences codi/repeat name The multiple pattern 2° shaft multicore string matching algorithm consists input and output, initialization, main function, search function and shift_left_to_right function. In the input and output the genome sets and patterns are taken as input and the sample_id, sample_name, sample_chromosome_name, position, noofoccurences, and codi are returned as output. In the initialization, multi_pattern (all pattern in the set), n (text length), m (pattern length) and all other variables are initialized. In the main function the genome set is read on chromosome by chromosome basis, the individual chromosome is given to shift_left_to_right function. Once the shift value is received the search function is called for all the patterns. The shift_left_to_right function, applies the shift operation with the pattern's rightmost character and the shift position is returned to main function. The search process compares character by character from both the directions until a complete match or mismatch occurs. If match occurs the successive occurrence of the pattern is searched. If the successive occurrence size is greater than 1 then the data is stored into TandemRepeatDB. If the mismatch or complete match occurs the same procedure is repeated until the end of the T.

Structure of the database

In this paper a table is created with the given sample name. The table contains sample_id, sample_name, sample_chromosome_name, position indicating the occurrence position of the codi, noofoccurences indicating the number of occurrences of the repeat and codi indicating the name of the repeat. The table structure is shown in Table 1.
Table 1

Table Structure.

TypeCollation
sample_idtext
sample_nametext
sample_chromosome_nametext
positionint(10)
noofoccurencesint(10)
coditext
Availability of NGS techniques leads to the accessibility of genome sequences. Studying the perfect successive occurrences of the tandem repeats using Bioinformatics approach would be very interesting and informative. In the remaining part of the study, perfect tandem repeats of all chromosomes in H. sapiens, C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, P. troglodytes, P. anubis and P. abelii genomes are analyzed and a brief note on the successive occurrence frequency of TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats have been presented. In this study 259 chromosomes of H.sapiens, C.jacchus, C.sabaeus, G.gorilla, M.fascicularis, M.mulatta, N.leucogenys, P.troglodytes, P.anubis and P.abelii genomes have been used as shown in Table 2.
Table 2

Genome sequences used in the study.

Genome sequence nameName & number of chromosomesTotal number of tandem repeats extracted (> 1)
Homo sapiens1 to 22, MT, X, Y and Un (26)11,99,985
Callithrix jacchus1 to 22, X, Y and Un (25)11,40,529
Chlorocebus sabaeus1 to 29, MT, X, Y and Un (33)11,13,445
Gorilla gorilla1, 2A, 2B, 3 to 22, MT, X and Un (26)11,63,843
Macaca fascicularis1 to 20, MT, X and Un (23)12,31,029
Macaca mulatta1 to 20, MT, X and Un (23)12,74,556
Nomascus leucogenys1a 2 to 6, 7b, 8 to 21, 22a, 23 to 25, X and Un (27)11,71,594
Pan troglodytes1, 2A, 2B, 3 to 22, MT, X, Y and Un (27)12,76,766
Papio anubis1 to 20, MT, X and Un (23)13,51,393
Pongo abelii1, 2A, 2B, 3 to 22, MT, X and Un (26)13,81,887
102591,23,05,027

Tandem repeat size analysis

In this section, the perfect successive tandem repeats are extracted and analyzed by executing SQL queries on the TandemRepeatDB for all the chromosomes of the considered genomes corresponding to TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. Notations used in tables: First column refers to codi/repeat name. Second column refers to MAX number of codi in the successive occurrences. Third column refers to number of times the MAX number appeared.

H. sapiens

H. sapiens are the binomial nomenclature for the human species. Homo is the human genus, which also includes Neanderthals and many other extinct species of hominid. In the paper, multiple pattern 2° shaft multicore string matching algorithm is used to retrieve the perfect successive tandem repeats from H. sapiens genomes which consists 1 to 22, MT, X, Y and Un chromosomes as shown in Table 2. A total of 11,99,985 perfect successive repeats are extracted from the above chromosomes, which are stored in the homo_sapiens table. Table 3 gives the summary of extracted MAX number of successive occurrences from the homo_sapiens table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2
Table 3

Tandem repeat successive occurrences for all chromosomes of H.sapiens.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA21Once
AGAA42Twice
GATA22Once
TCTA25Once
TCAT12Twice
GAAT12Once
AGAT21Once
CTTT78Once
TATC25Once
TCTG12Four Times
Query1 and Query2 are executed for the remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the homo_sapiens table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 2.
Fig. 2

Max number of successive occurrences of all repeats for all chromosomes of H.sapiens.

From the Fig. 2, the following observations can be made: CTTT tandem repeat has maximum of 78 successive base pairs, AGAA tandem repeat has maximum of 42 successive base pairs twice, TCTG tandem repeat has maximum of 12 successive base pairs four times, The remaining Tandem repeats have successive base pairs from a minimum of 12 to a maximum of 25, All the above observations have a significant role in the bio-informatic studies.

C. jacchus

The common marmoset is a New World monkey. It originally lived in the Northeastern coast of Brazil. In the paper, the proposed string matching algorithm is used retrieve the perfect successive tandem repeats from C. jacchus genomes which consist 1 to 22, X, Y and Un chromosomes as shown in Table 2. A total of 11,40,529 perfect successive repeats are extracted from the above chromosomes, which are stored in the callithrix_jacchus table. Table 4 gives the summary of the extracted MAX number of successive occurrences from the callithrix_jacchus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2
Table 4

Tandem repeat successive occurrences for all chromosomes of C.jacchus.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA21Once
AGAA57Once
GATA20Once
TCTA18Twice
TCAT14Once
GAAT13Once
AGAT21Once
CTTT51Once
TATC18Thrice
TCTG14Once
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA t. The extracted MAX number of successive occurrences from the callithrix_jacchus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 3.
Fig. 3

Max number of successive occurrences of all repeats for all chromosomes of C.jacchus.

From the Fig. 3, the following observations can be made: AGAA tandem repeat has maximum of 57 successive base pairs, CTTT tandem repeat has maximum of 51 successive base pairs, TATC tandem repeat has maximum of 18 successive base pairs thrice, Th remaining Tandem repeats have successive base pairs from a minimum of 13 to a maximum of 21, All the above observations have a significant role in the bio-informatics studies.

C. sabaeus

The green monkey, also known as the sabaeus monkey or the Callithrix monkey, is an Old World monkey with golden-green fur and pale hands and feet. The tip of the tail is golden yellow as are the backs of the thighs and cheek whiskers. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from C. sabaeus genomes which consists 1 to 29, MT, X, Y and Un chromosomes as shown in Table 2. A total of 11,13,445 perfect successive repeats are extracted from the above chromosomes, which are stored in the chlorocebus_sabaeus table. Table 5 gives the summary of extracted MAX number of successive occurrences from the chlorocebus_sabaeus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2.
Table 5

Tandem repeat successive occurrences for all chromosomes of C.sabaeus.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA19Twice
AGAA54Once
GATA20Once
TCTA20Twice
TCAT14Once
GAAT14Once
AGAT20Once
CTTT42Once
TATC20Once
TCTG14Once
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the chlorocebus_sabaeus table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats is graphically shown in Fig. 4.
Fig. 4

Max number of successive occurrences of all repeats for all chromosomes of C.sabaeus.

From the Fig. 4, the following observations can be made. AGAA tandem repeat has maximum of 54 successive base pairs, CTTT tandem repeat has maximum of 42 successive base pairs, TCTA tandem repeat has maximum of 20 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 14 to a maximum of 20, All the above observations have a significant role in the bio-informatics studies.

G. gorilla

The western gorilla is a great ape, the type species as well as the most populous species of the genus Gorilla. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from G. gorilla genomes which consists 1, 2A, 2B, 3 to 22, MT, X and Un chromosomes as shown in Table 2. A total of 11,63,843 perfect successive repeats are extracted from the above chromosomes, which are stored in the gorilla_gorilla table. Table 6 gives the summary of the extracted MAX number of successive occurrences from the gorilla_gorilla table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2.
Table 6

Tandem repeat successive occurrences for all chromosomes of G.gorilla.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA19Thrice
AGAA41Once
GATA20Once
TCTA26Once
TCAT14Once
GAAT12Ten times
AGAT20Twice
CTTT66Once
TATC26Once
TCTG16Once
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the gorilla_gorilla table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 5.
Fig. 5

Max number of successive occurrences of all repeats for all chromosomes of G.gorilla.

From the Fig. 5, the following observations can be made: CTTT tandem repeat has maximum of 66 successive base pairs, AGAA tandem repeat has maximum of 41 successive base pairs, GAAT tandem repeat has maximum of 12 successive base pairs ten times, The remaining tandem repeats have successive base pairs from a minimum of 14 to a maximum of 26, All the above observations have a significant role in the bio-informatics studies.

M. fascicularis

The crab-eating macaque, also known as the long-tailed macaque, is a cercopithecine primate native to Southeast Asia. It is referred to as the cynomolgus monkey in laboratories In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from M. fascicularis genomes which consists 1 to 20, MT, X and Un chromosomes as shown in Table 2. A total of 12,31,029 perfect successive repeats are extracted from the above chromosomes, which are stored in the macaca_fascicular table. Table 7 gives the summary of the extracted MAX number of successive occurrences from the macaca_fascicularis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2
Table 7

Tandem repeat successive occurrences for all chromosomes of M.fascicularis.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA29Once
AGAA218Once
GATA29Once
TCTA33Once
TCAT19Once
GAAT14Thrice
AGAT28Once
CTTT221Once
TATC33Once
TCTG16Once
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the macaca_fascicularis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 6.
Fig. 6

Max number of successive occurrences of all repeats for all chromosomes of M.fascicularis.

From the Fig. 6, the following observations can be made: CTTT tandem repeat has maximum of 221 successive base pairs, AGAA tandem repeat has maximum of 218 successive base pairs , GAAT tandem repeat has maximum of 14 successive base pairs thrice, The remaining tandem repeats have successive base pairs from a minimum of 16 to a maximum of 33, All the above observations have a significant role in the bio-informatics studies.

M. mulatta

The rhesus macaque (M. mulatta), is one of the best-known species of Old World monkeys. It is listed as the least concern in the IUCN Red List of Threatened Species in view of its wide distribution, presumed large population, and its tolerance of a broad range of habitats. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from M. mulatta genomes which consists 1 to 20, MT, X and Un chromosomes as shown in Table 2. A total of 12,74,556 perfect successive repeats are extracted from the above chromosomes, which are stored in the macaca_mulatta table. Table 8 gives the summary of the extracted MAX number of successive occurrences from the macaca_mulatta table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats.
Table 8

Tandem repeat successive occurrences for all chromosomes of M.mulatta.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA31Once
AGAA84Once
GATA31Once
TCTA21Twice
TCAT19Once
GAAT15Once
AGAT31Once
CTTT79Once
TATC21Once
TCTG12Twice
The TAGA results are extracted from the table by executing Query1 and Query2 Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the macaca_mulatta table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 7.
Fig. 7

Max number of successive occurrences of all repeats for all chromosomes of M.mulatta.

From the Fig. 7, the following observations can be made: AGAA tandem repeat has maximum of 84 successive base pairs, CTTT tandem repeat has maximum of 79 successive base pairs, TCTA tandem repeat has maximum of 21 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 12 to a maximum of 31, All the above observations have a significant role in the bio-informatics studies.

N. leucogenys

The northern white-cheeked gibbon is a species of gibbon native to South East Asia. It is closely related to the southern white-cheeked gibbon, with which it was previously considered conspecific. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from N. leucogenys genomes which consists 1 to 6, 7b, 8 to 21, 22a, 23 to 25, X and Un chromosomes as shown in Table 2. A total of 11,71,594 perfect successive repeats are extracted from the above chromosomes, which are stored in the nomascus_leucogenys table. Table 9 gives the summary of the extracted MAX number of successive occurrences from the nomascus_leucogenys table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats.
Table 9

Tandem repeat successive occurrences for all chromosomes of N.leucogenys.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA17Thrice
AGAA52Once
GATA17Once
TCTA22Once
TCAT12Once
GAAT11Once
AGAT17Once
CTTT33Once
TATC23Once
TCTG13Once
The TAGA results are extracted from the table by executing Query1 and Query2 Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the nomascus_leucogenys table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 8.
Fig. 8

Max number of successive occurrences of all repeats for all chromosomes of N.leucogenys.

From the Fig. 8, the following observations can be made: AGAA tandem repeat has maximum of 52 successive base pairs, CTTT tandem repeat has maximum of 33 successive base pairs, TCTG tandem repeat has maximum of 17 successive base pairs thrice, The remaining tandem repeats have successive base pairs from a minimum of 11 to a maximum of 23, All the above observations have a significant role in the bio-informatics studies.

P. troglodytes

The common chimpanzee (P. troglodytes), also known as the robust chimpanzee, is a species of great apes. Colloquially, the common chimpanzee is often called the chimpanzee. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from P. troglodytes genomes which consists 1, 2A, 2B, 3 to 22, MT, X, Y and Un chromosomes as shown in Table 2. A total of 12,76,766 perfect successive repeats are extracted from the above chromosomes, which are stored in the pan_troglodytes table. Table 10 gives the summary of the extracted MAX number of successive occurrences from the pan_troglodytes table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2
Table 10

Tandem repeat successive occurrences for all chromosomes of P.troglodytes.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA17Four Times
AGAA43Once
GATA18Twice
TCTA18Once
TCAT10Five Times
GAAT11Once
AGAT18Once
CTTT30Once
TATC19Once
TCTG13Once
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the pan_troglodytes table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 9.
Fig. 9

Max number of successive occurrences of all repeats for all chromosomes of P.troglodytes.

From the Fig. 9, the following observations can be made. AGAA tandem repeat has maximum of 43 successive base pairs, CTTT tandem repeat has maximum of 30 successive base pairs, TCAT tandem repeat has maximum of 10 successive base pairs five times, The remaining tandem repeats have successive base pairs from a minimum of 11 to a maximum of 19, All the above observations have a significant role in the bio-informatics studies.

P. anubis

The olive baboon, also called the Anubis baboon, is a member of the family Cercopithecidae. The species is the most widely ranging of all baboons. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from P. anubis genomes which consists 1 to 20, MT, X and Un chromosomes as shown in Table 2. A total of 13,51,393 perfect successive repeats are extracted from the above chromosomes, which are stored in the papio_anubis table. Table 11 gives the summary of the extracted MAX number of successive occurrences from the papio_anubis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1and Query2
Table 11

Tandem repeat successive occurrences for all chromosomes of P.anubis.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA31Once
AGAA54Once
GATA31Once
TCTA22Once
TCAT15Once
GAAT14Once
AGAT32Once
CTTT47Twice
TATC21Twice
TCTG15Once
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the papio_anubis table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 10.
Fig. 10

Max number of successive occurrences of all repeats for all chromosomes of P.anubis.

From the Fig. 10, the following observations can be made. AGAA tandem repeat has maximum of 54 successive base pairs, CTTT tandem repeat has maximum of 47 successive base pairs twice, TCTG tandem repeat has maximum of 21 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 14 to a maximum of 32, All the above observations have a significant role in the bio-informatics studies.

P. abelii

The Sumatran orangutan is one of the two species of orangutans. Found only in the island of Sumatra, in Indonesia, it is rarer than the Bornean orangutan. In the paper, the proposed string matching algorithm is used to retrieve the perfect successive tandem repeats from P. abelii genomes which consists 1, 2A, 2B, 3 to 22, MT, X and Un chromosomes as shown in Table 2. A total of 13,81,887 perfect successive repeats are extracted from the above chromosomes, which are stored in the pongo_abelii table. Table 12 gives the summary of extracted MAX number of successive occurrences from the pongo_abelii table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA repeats. The TAGA results are extracted from the table by executing Query1 and Query2
Table 12

Tandem repeat successive occurrences for all chromosomes of P.abelli.

codi/Repeat nameMAX number of codi in the successive occurrencesNumber of times the MAX number appeared
TAGA20Once
AGAA37Once
GATA19Once
TCTA18Twice
TCAT12Once
GAAT13Once
AGAT20Once
CTTT63Once
TATC19Once
TCTG11Twice
Query1 and Query2 are executed for remaining repeats TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA. The extracted MAX number of successive occurrences from the pongo_abelii table for TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG, and TCTA repeats is graphically shown in Fig. 11.
Fig. 11

Max number of successive occurrences of all repeats for all chromosomes of P.abelii.

From the Fig. 11, the following observations can be made. CTTT tandem repeat has maximum of 63 successive base pairs, AGAA tandem repeat has maximum of 37 successive base pairs, TCTA tandem repeat has maximum of 18 successive base pairs twice, The remaining tandem repeats have successive base pairs from a minimum of 11 to a maximum of 20, All the above observations have a significant role in the bio-informatics studies.

Conclusions

In this paper we developed the TandemRepeatDB that provides a single portal access to perfect successive repeats in genomes of H. sapiens, C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, P. troglodytes, P. anubis and P. abelii. The database is known to be the first of its kind to host all types of perfect successive tandem repeats for the considered genomes. From the analysis of all the records existing in the TandemRepeatDB, it is observed that CTTT tandem repeat and AGAA tandem repeat occupy the major role. This TandemRepeatDB will be a very valuable resource for researchers studying repeats in the above mentioned genomes.

Conflict of interests

Authors did not have any conflict of interests.
  14 in total

1.  Neurodegenerative diseases. Origins of instability.

Authors:  R R Sinden
Journal:  Nature       Date:  2001-06-14       Impact factor: 49.962

2.  TRbase: a database relating tandem repeats to disease genes for the human genome.

Authors:  T Boby; A-M Patch; S J Aves
Journal:  Bioinformatics       Date:  2004-10-12       Impact factor: 6.937

Review 3.  Encoded evidence: DNA in forensic analysis.

Authors:  Mark A Jobling; Peter Gill
Journal:  Nat Rev Genet       Date:  2004-10       Impact factor: 53.242

4.  DNA microsatellites: agents of evolution?

Authors:  E R Moxon; C Wills
Journal:  Sci Am       Date:  1999-01       Impact factor: 2.142

Review 5.  Simple sequence repeats as a source of quantitative genetic variation.

Authors:  Y Kashi; D King; M Soller
Journal:  Trends Genet       Date:  1997-02       Impact factor: 11.639

6.  Simple sequence repeats in Escherichia coli: abundance, distribution, composition, and polymorphism.

Authors:  R Gur-Arie; C J Cohen; Y Eitan; L Shelef; E M Hallerman; Y Kashi
Journal:  Genome Res       Date:  2000-01       Impact factor: 9.043

7.  A comprehensive genetic map of the human genome based on 5,264 microsatellites.

Authors:  C Dib; S Fauré; C Fizames; D Samson; N Drouot; A Vignal; P Millasseau; S Marc; J Hazan; E Seboun; M Lathrop; G Gyapay; J Morissette; J Weissenbach
Journal:  Nature       Date:  1996-03-14       Impact factor: 49.962

8.  Microsatellites in different eukaryotic genomes: survey and analysis.

Authors:  G Tóth; Z Gáspári; J Jurka
Journal:  Genome Res       Date:  2000-07       Impact factor: 9.043

9.  Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions.

Authors:  Subbaya Subramanian; Rakesh K Mishra; Lalji Singh
Journal:  Genome Biol       Date:  2003-01-23       Impact factor: 13.583

10.  SSRD: simple sequence repeats database of the human genome.

Authors:  Subbaya Subramanian; Vamsi M Madgula; Ranjan George; Satish Kumar; Madhusudhan W Pandit; Lalji Singh
Journal:  Comp Funct Genomics       Date:  2003
View more
  2 in total

1.  MONO, DI and TRI SSRs data extraction & storage from 1403 virus genomes with next generation retrieval mechanism.

Authors:  K V S S R Murthy; K V V Satyanarayana
Journal:  Data Brief       Date:  2017-06-10

2.  Data of 10 SSR markers for genomes of homo sapiens and monkeys.

Authors:  K K V V V S Reddy; S Viswanadha Raju; Chinta Someswara Rao
Journal:  Data Brief       Date:  2017-04-13
  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.