Literature DB >> 36117677

Method for identification of 10 SSR markers from monkey genomes and its statistical inference with One & Two-way ANOVA.

Chinta Someswara Rao¹, G N V G Sirisha¹, K Butchi Raju², N V Ganapathi Raju³.

Abstract

DNA tracts that include simple sequence repeats (SSRs), sometimes known as genetic "stutters), are composed of a few to many tandem repetitions of a short base-pair motif. These sequences frequently mutate, changing the amount of repetitions. SSRs are frequently found in promoters, untranslated regions, and even coding sequences, therefore these alterations can significantly affect practically every aspect of gene activity. SSR alleles can also contribute to normal diversity in brain and behavioural features. Mutational expansion of certain triplet repeats is the cause of a number of inherited neurodegenerative diseases. Due to its importance in genetic research, in this paper we explored Ten SSR markers TAGA, TCAT, GAAT, AGAT, AGAA, GATA, TATC, CTTT, TCTG and TCTA that are identified from the genomes of Eleven distinct monkeys: A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P.Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina using pattern matching mechanism. We identified 4bp SSR from eleven monkey dataset's Unchr chromosome mainly in this paper. The proposed approach finds the exact place/location of the SSR's and number of times that it appears in the given genome sequence. The identified patterns are analyzed with One-way and Two-way ANOVA that gives better analysis which is useful for genomic studies. Also, this 4bp Ten SSR markers data is a valuable to illustrate genetic variation of genomic study.•The great specificity of data sets produced from monkey genomes with pattern matching has been demonstrated.•These findings show that SSR identification could be a useful tool for determining genome similarity and comparability.•Researchers can use the raw sequencing data to conduct additional bioinformatics analysis.

Entities: Chemical

Keywords: ANOVA, Analysis of Variance; Genomes; Monkey; Pattern matching; SSR, Simple sequence repeats; SSRs

Year: 2022 PMID： 36117677 PMCID： PMC9474309 DOI： 10.1016/j.mex.2022.101833

Source DB: PubMed Journal: MethodsX ISSN： 2215-0161

Specifications Table

Method details

Tandems of repeating DNA sequences are present in various quantities for the majority of genomes in simple sequence repeats (SSRs). This repetition of genetic mapping and population research has been widely employed. SSRs also give molecular tools for the understanding of spatial links between segments of chromosomes which, in turn, help in the analysis of temporal linkages between species and genera. It is predicted that the study of repeat frequency and their distribution pattern in the genome would assist to comprehend their meaning. There are accumulated indications suggesting SSRs influence gene expression [1], [2], [3]. Complete genome sequences were available for several species and genome-wide analysis were carried out. In this study, we analysed Unchr chromosome of Eleven different monkeys A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P. Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina and Ten SSR loci were investigated for their spread and frequency of occurrence. Previously, few studies have tried to evaluate tandem replacement distributions in monkey genomes [4], but they are restricted to a single or a small number of genomes. This multiple mining employing Analysis of Variance (ANOVA) helps to understand and resolve biological issues. The proposed structure of the method is shown in Fig. 1 that comprises of collected data set and read, SSR identification and Search process and Analysis of variance (ANOVA).

Fig. 1

Structure of the model.

Structure of the model. Fasta format of A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P. Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatusand M.Nemestrina datasets are collected and Ten patterns are considered for reading.

SSR Identification

In this paper Unchr chromosome of A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P. Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina and the ten(TAGA, AGAA, GATA, TCTA, TCAT, GAAT, AGAT, CTTT, TATC, TCTG) SSRs are considered. Using a string matching method, SSRs are retrieved from monkeys. String matching is a search method that looks for repeats in a certain chromosome file.

Search process

Algorithm 1 describes the complete SSR identification heuristic procedure. In this heuristic procedure, the chromosomes and SSRs are given as input, which then invokes the first_occurance_position_heuristic, bad_character_heuristic, and good_suffix_heuristic procedures. Finally, this algorithm displays the pattern and its position and continues the search process until the end of the given chromosome sequence.

Algorithm 1

Complete_Heuristic Process.

Complete_Heuristic (Text, Patt, ∑)

1. n←Text.len#LengthofGenomeSequence

2. m←Patt.len#LengthofGenomepattern

3. α←First_Occurance_Position_Heuristic(Patt,m,∑)

4. β←Bad_Character_Heuristic(Patt,m,∑)

4. γ←Good_Suffix_Heuristic(Patt,m)

5. position←α

6. Whileposition≤n−−m

7. doj←m

8. Whilej>0andPatt[j]=Text[position+j]

9. doj←j−1

10. ifj=0then

11. print ("Pattern occurs at shift", position)

12. position←position+γ[0]

13. else

14. position←position+max(γ[j],j−β[Text[position+j]])

Algorithm 2 describes the first occurrence position heuristic procedure. In this heuristic procedure, the pattern's rightmost character, pattern [m-1], was compared with the corresponding character in the genome sequence; if a match is found, the match position is returned; otherwise, the comparison continues until the end of the genome sequence.

Algorithm 2

First Occurrence Position Heuristic Process.

First_Occurance_Position_Heuristic (Patt, m, ∑)

1. foreachpatternpatt∈patterns

α←0

2. fori←1tom

3. If[Patt[m−1]==Text[i+m−1]]

4. α←i

5. break;

6. else

7. continue;

5. Returnα

Algorithm 3 describes the Bad Character Heuristic Procedure. When the mismatch case occurred, then this heuristic procedure was invoked and returned the shifted position of the pattern. If any of the pattern characters was not matched with the genome sequence, the entire pattern position was shifted; otherwise, the number of characters matched was used to shift the pattern.

Algorithm 3

Bad Character Heuristic Process.

Bad_Character_Heuristic (Patt, m, ∑)

1. foreachcharactera∈∑

2. doβ[a]=0

3. fori←1tom

4. doβ[Patt[i]]←i

5. Returnβ

Algorithm 4 describes the Good Suffix Heuristic Procedure. This heuristic procedure was invoked at the time of a complete pattern match and returned the search position. Look If a substring of a pattern is matched until a bad character has a good suffix, after a mismatch that causes a negative shift in bad character heuristics, we take a step forward equal to the length of the suffix found.

Algorithm 4

Good_Suffix_Heuristic Process.

Good_Suffix_Heuristic (Patt, m)

1. τ←identify_prefix(Patt)

2. Patt′←reverse(Patt)

3. τ′←identify_prefix(Patt′)

4. forj←0tom

5. doγ[j]←m−τ[m]

6. fork←1tom

7. doj←m−τ′[k]

8. Ifγ[j]>l−τ′[k]then

9. γ[j]←1−τ′[k]

10. Returnγ

This procedure is repeated for all SSRs as well as the whole data in the chromosomes.

Analysis of variance (ANOVA)

The analysis of variance (ANOVA) [5], [6], [7] is a set of statistical models and estimate processes for analyzing differences between group means in a sample. It is useful for comparing (testing) the statistical significance of three or more group means. For this, here we are calculated the values. We'll sum them up by multiplying each squared variation by each sample size. For between-group variability, this is known as the sum-of-squares, as shown in Eq. (1). is calculated with Eq. (2) and is calculated with Eq. (3). Within-group variability is measured by how far each sample's value deviates from the sample mean. is calculated with Eq. (4), is calculated with Eq. (5) and is calucalted with Eq. (6)

F-Statistic

It assesses the means of two or more samples significance. Their value is less then sample means are close to each other. We can not rule out the null hypothesis in such instance. It is calculated with Eq. (7)

Data description

Table 1 shows data from ten SSR markers taken from the genomes of eleven monkeys. NCBI (https://www.ncbi.nlm.nih.gov) provides the genome dataset. The results suggest that SSR identification with pattern matching was quite beneficial in revealing variation in chosen genome libraries. These SSR markers may be used to compare and quantify genomic similarities.

Table 1

Ten SSRs were identified from genome sequences.

Data Set	Chromosome name	No.of patterns identified
A.Nancymaae	chrUn	1,10,966
C.C.Imitator	chrUn	1,08,223
Cercocebus_atys	chrUn	1,45,189
M.Leucophaeus	chrUn	1,15,378
P. Paniscus	chrUn	1,27,150
R.Bieti	chrUn	1,19,993
R.Roxellana	chrUn	1,10,616
S.Boliviensis	chrUn	1,34,976
T.Syrichta	chrUn	1,31,908
C.A.Palliatus	chrUn	1,07,098
M.Nemestrina	chrUn	1,57,578
		13,69,075

Ten SSRs were identified from genome sequences. Unchr chromosome of monkeys(A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P. Paniscus, R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina) are taken into account for the identification of the 10 SSR markers listed in Table 1. The numbers of patterns identified in different chromosomes are depicted as the bar plot, which has been shown in Fig. 2.

Fig. 2

Patterns identified in different chromosomes.

Patterns identified in different chromosomes. Fig. A1 depicted in Appendix A (figures part) from A1(a) to (j)) has shown the max size of pattern related to considered genome datasets respectively. Fig. A2 depicted in Appendix A (figures part) from Fig. A2(a) to (k) has shown the Ten patterns related 11 datasets respectively.

Fig. A1

(a) TAGA pattern max size for 11 datasets. (b) AGAA pattern max size for 11 datasets. (c) GATA pattern max size for 11 datasets. (d) TCTA pattern max size for 11 datasets. (e) TCAT pattern max size for 11 datasets. (f) GAAT pattern max size for 11 datasets. (g) AGAT pattern max size for 11 datasets. (h) CTTT pattern max size for 11 datasets. (i) TATC pattern max size for 11 datasets. (j) TCTG pattern max size for 11 datasets.

Fig. A2

(a) All patterns data of A.Nancymaae dataset. (b) All patterns data of C.C.Imitator dataset. (c) All patterns data of C.Atysdataset. (d) All patterns data of M.Leucophaeus dataset. (e) All patterns data of P. Paniscus dataset. (f) All patterns data of R.Bieti dataset. (g) All patterns data of R.Roxellana dataset. (h) All patterns data of S.Boliviensis dataset. (i) All patterns data of T.Syrichta dataset. (j) All patterns data of C.A.Palliatusdataset. (k) All patterns data of M.Nemestrina dataset.

Statistical inference with ANOVA

One-way analysis of variance

This method was employed to compare the averages of two or more samples (using the F distribution). This is only applicable to numerical response data (the "Y"), which is generally one variable, and numerical or (mostly) categorical input data (the "X"), which is always one variable, hence "One-way". One-way analysis of variance is performed among A.Nancymaae, C.C.Imitator, Cercocebus_atys, M.Leucophaeus, P. Paniscus,R.Bieti, R.Roxellana, S.Boliviensis, T.Syrichta, C.A.Palliatus and M.Nemestrina for Ten patterns. The actual results are shown in Tables 2 and 3.

Table 2

One-way ANOVA statistic and p value among 11 datasets.

Data Set	statistic	P value
A.Nancymaae	867.752255	0
C.C.Imitator	4411.909573	0
C.atys	19.92680516	9.56E-34
M.Leucophaeus	10783.28905	0
P. Paniscus	4237.052533	0
R.Bieti	4237.052533	0
R.Roxellana	13315.16044	0
S.Boliviensis	21076.54652	0
T.Syrichta	1671.161835	0
C.A.Palliatus	15423.71891	0
M.Nemestrina	11200.74466	0

Table 3

One-way ANOVA statistic and p value among 10 patterns among 11 datasets.

	statistic	P value
TAGA	815.3285535	0
AGAA	475.3548131	0
GATA	817.3908386	0
TCTA	423.9534899	0
TCAT	79.96858887	1.91E-148
GAAT	83.37284918	5.24E-155
AGAT	740.8585252	0
CTTT	149.8891403	4.32E-284
TATC	425.7145896	0
TCTG	66.91818823	5.45E-122

One-way ANOVA statistic and p value among 11 datasets. One-way ANOVA statistic and p value among 10 patterns among 11 datasets. Complete_Heuristic Process. First Occurrence Position Heuristic Process. Bad Character Heuristic Process. Good_Suffix_Heuristic Process. From Table 2, it is observed that null hypothesis is TRUE for every monkey dataset except C.Atys. So from these p value, we conclude that there is a similarity of C.Atys monkey with others. The statistic of One-way ANOVA of different chromosomes was depicted as the bar plot, which has been shown in Fig. 3.

Fig. 3

The statistic of One-way ANOVA of different chromosomes.

The statistic of One-way ANOVA of different chromosomes. From Table 3, it is also observed that null hypothesis is TRUE for every pattern except four patterns called TCAT,GAAT,CTTT,TCTG. From these p value, we conclude that there is a similarity of TCAT,GAAT,CTTT,TCTG with others. The statistic of One-way ANOVA of different patterns was depicted as the bar plot, which has been shown in Fig. 4.

Fig. 4

The statistic of One-way ANOVA of different patterns.

Two-way analysis of variance

It looks at the impact of two categorical independent variables on a continuous dependent variable. It is used to determine not only the main impact of each independent variable, but also whether they interact. It is performed for each of 11 datasets (A.Nancymaae, C.C.Imitator, C.Atys, M.Leucophaeus, P.Paniscus, R.Bieti, R.Roxellana,S.Boliviensis, T.Syrichta, C.A.Palliatusand M.Nemestrina) among groups between the ten patterns. The actual results are uploaded in mendeley Appendix A[source] & B[source]. Table A.1 to A.11 has shown the Two way ANOVA statistic and p value of 11 datasets for ten patterns. These results are shown the relation among monkey datasets interms of supporting the null hypothesis and other are alternate hypothesis. For example from the statistics and p value, it is observed that relation between TAGA and AGAA has alternative hypothesis, and TCTA and GAAT has null hypothesis. Table A.12 to A.21 has shown the Two way ANOVA hypothesis reject TRUE/FALSE for 10 patterns of 11 datasets respectively. These results had shown that relation among patterns. For example relation AGAA b/w CTTT [meandiff: -0.1085, Lowet:0.135, Upper:0.3786] =>FALSE that means hypothesis reject False and for AGA b/w AGAT [meandiff-6.5503, Lowet: -6.7938, Upper: -6.3068] =>TRUE that means hypothesis reject True Table B.1 to B.11 has shown the Two way ANOVA statistic and p value of individual ten patterns for 11 datasets. From table Table B.1 for TAGA pattern, it is observed that A.Nancymaae and C.C.Imitator has alternative hypothesis based its statistics and p value and A.Nancymaae and C.Atys has null hypothesis. Table B.12 to B.21 has shown the two way ANOVA hypothesis reject TRUE/FALSE among 11 datasets related to ten patterns respectively. From table Table B.12 for TCAT pattern, it is observed that the relation between A.Nancymaae and C.C.Imitator [meandiff: 0.7039, Lowet: 0.4021, Upper: 1.0057]=>TRUE that means hypothesis reject True and for C.C.Imitator b/w S.Boliviensis[meandiff: -0.2713, Lowet: -0.573, Upper: 0.0305]=>FALSE that means hypothesis reject False. Fig. 5(a) to (k) has shown the Multiple comparisons between all pairs(Tukey) between 11 datasets for all 10 patterns. From the Fig. 5(a) to (k) it is observed that, 11 monkey dataset for 10 patterns, these graphs results matched with results discussed in the previous paragraphs.

Fig. 5

(a) For the A.Nancymaae, multiple comparisons among all pairings (Tukey) were performed. (b) For C.C.Imitator, Multiple comparisons among all pairs(Tukey). (c) For the C.Atys, Multiple comparisons among all pairs(Tukey). (d) For the M.Leucophaeus, Multiple comparisons among all pairs(Tukey). (e) For the P. Paniscus, Multiple comparisons among all pairs(Tukey). (f) For the R.Bieti, Multiple comparisons among all pairs(Tukey). (g) For the R.Roxellana, Multiple comparisons among all pairs(Tukey). (h) For the S.Boliviensis, Multiple comparisons among all pairs(Tukey). (i) For the T.Syrichta, Multiple comparisons among all pairs(Tukey). (j) For the C.A.Palliatus, Multiple comparisons among all pairs(Tukey). (k) For the M.Nemestrina, Multiple comparisons among all pairs(Tukey).

Fig. 6

(a) Multiple comparisons between all datasets for TAGA pattern. (b) Multiple comparisons between all datasets for AGAA pattern. (c) Multiple comparisons between all datasets for GATA pattern. (d) Multiple comparisons between all datasets for TCTA pattern. (e) Multiple comparisons between all datasets for TCAT pattern. (f) Multiple comparisons between all datasets for GAAT pattern. (g) Multiple comparisons between all datasets for AGAT pattern. (h) Multiple comparisons between all datasets for CTTT pattern. (i) Multiple comparisons between all datasets for TATC pattern. (j) Multiple comparisons between all datasets for TCTG pattern.

Ethics statements

This work has never been published or submitted to another journal. This information and analysis will not hurt humans or animals.

CRediT authorship contribution statement

Chinta Someswara Rao: Conceptualization, Methodology, Software, Writing – review & editing. G.N.V.G. Sirisha: Data curation, Writing – original draft. K. Butchi Raju: Visualization, Investigation. N V Ganapathi Raju: Supervision, Validation.

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Subject area:	Bio-informatics
More specific subject area:	Genomes of monkeys
Name of your method:	SSR identification
Name and reference of original method:	NA
Resource availability:	Repository name: Two way ANOVA statistic and p valueData identification number:10.17632/w42hmpwvby.210.17632/t3msbvj89t.2Direct URL to data:https://data.mendeley.com/datasets/w42hmpwvby/2https://data.mendeley.com/datasets/t3msbvj89t/2

5 in total

Review 1. Analysis of variance.

Authors: Martin G Larson
Journal: Circulation Date: 2008-01-01 Impact factor: 29.690

2. Genome-wide simple sequence repeats (SSR) markers discovered from whole-genome sequence comparisons of multiple spinach accessions.

Authors: Gehendra Bhattarai; Ainong Shi; Devi R Kandel; Nora Solís-Gracia; Jorge Alberto da Silva; Carlos A Avila
Journal: Sci Rep Date: 2021-05-11 Impact factor: 4.996

3. Characterizing Repeats in Two Whole-Genome Amplification Methods in the Reniform Nematode Genome.

Authors: S T Nyaku; V R Sripathi; K Lawrence; G Sharma
Journal: Int J Genomics Date: 2021-03-06 Impact factor: 2.326

4. Development of SSR Markers and Assessment of Genetic Diversity in Medicinal Chrysanthemum morifolium Cultivars.

Authors: Shangguo Feng; Renfeng He; Jiangjie Lu; Mengying Jiang; Xiaoxia Shen; Yan Jiang; Zhi'an Wang; Huizhong Wang
Journal: Front Genet Date: 2016-06-15 Impact factor: 4.599

5. The influence of breeding history, origin and growth type on population structure of barley as revealed by SSR markers.

Authors: Seyyed Abolghasem Mohammadi; Nayyer Abdollahi Sisi; Behzad Sadeghzadeh
Journal: Sci Rep Date: 2020-11-05 Impact factor: 4.379

5 in total