Literature DB >> 26981409

Similarity analysis between chromosomes of Homo sapiens and monkeys with correlation coefficient, rank correlation coefficient and cosine similarity measures.

Chinta Someswara Rao1, S Viswanadha Raju2.   

Abstract

In this paper, we consider correlation coefficient, rank correlation coefficient and cosine similarity measures for evaluating similarity between Homo sapiens and monkeys. We used DNA chromosomes of genome wide genes to determine the correlation between the chromosomal content and evolutionary relationship. The similarity among the H. sapiens and monkeys is measured for a total of 210 chromosomes related to 10 species. The similarity measures of these different species show the relationship between the H. sapiens and monkey. This similarity will be helpful at theft identification, maternity identification, disease identification, etc.

Entities:  

Keywords:  Chromosomes; Correlation coefficient; Cosine similarity; DNA; Rank correlation coefficient

Year:  2016        PMID: 26981409      PMCID: PMC4778639          DOI: 10.1016/j.gdata.2016.01.001

Source DB:  PubMed          Journal:  Genom Data        ISSN: 2213-5960


Introduction

Similarity measures are most important operations used in analyzing genomic data. One of the most widely used analysis paradigm is guilt-by-association that requires for measuring the similarity between the pair of genes. Guilt-by-association is important for the analysis of genome interactions because relation of two neighbor genes is often easier to interpret than direct interactions between genes [1], [2], [3]. A genome interaction is a measure of how surprising a genome feature is similar when compared to phenomenon of another genome [4], [5], [6], [7]. In this study we consider chromosomes of Homo sapiens and different kinds of monkeys called Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli. We also develop 20 shaft string matching algorithm that consists of input & output, initialization, main function, search function and shift_left_to_right function. The genome sets and different patterns (TAGA, AGAA,GATA,TCTA,TCAT,GAAT,AGAT,CTTT,TATC,TCTG) are taken as input. The sample_id, sample_name, sample_chromosome_name, lineno, position, noofoccurences, codi are returned as output. multiple_pattern(all patterns in the set), n(text length), m(pattern length) and all the remaining variables required in the process are initialized. In the main function the genome set is read on chromosome by chromosome basis, the individual chromosome is given to shift_left_to_right function. The shift_left_to_right function takes the rightmost character of the pattern and compares it with the characters in the text. If match occurs the position (shift value) of the text is returned to the main function. Once it receives the shift value the search function is called. In the search process character by character is compared from both the directions until a complete match or missmatch occurs. In case match occurs the successive occurrence of the pattern is computed. If the successive occurrence size is greater than 2 then the data is stored in the data base(TandemRepeatDB). If mismatch occurs the same procedure is repeated until end of the text T. The relations created and stored in TandemRepeatDB data base with names of homo_sapiens, callithrix_jacchus, chlorocebus_sabaeus, gorilla_gorilla, macaca_fascicularis, macaca_mulatta, nomascus_leucogenys, pan_troglodytes, papio_anubis and pongo_abelli.

Materials and methods

In this study, four benchmarked similarity measures are consider and applied on the values of genome datasets of H. sapiens, C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, P. troglodytes, P. anubis and P. abelli [8]. The similarity measures studied in the paper are Correlation coefficient [9], [10], Rank correlation coefficient [11], [12] and Cosine similarity [13], [14].

Correlation coefficient

A correlation coefficient [9], [10] is a coefficient that illustrates a quantitative measure of correlation and dependence. It shows the statistical relationships between two or more random variables or observed data values. Different correlation coefficients are available in literature, but in this paper, Pearson's correlation coefficient is considered and denoted by r(X,Y) or simply r. The Karl Pearson can be measured by the formula.where cov(X,Y) is the covariance between X and Y variables and is defined as cov(X,Y)=. However, it can also be written as cov(X,Y) = . Further, n is the number of observations used to fit the model, Σ is the summation symbol, Xi is the X value for observation i, is the mean X value, Yi is the Y value for observation i, is the mean Y value, σ and σ are standard deviations of X and Y variables and and . By executing the SQL query πmax( (σ ({homo_sapiens, callithrix_jacchus, chlorocebus_sabaeus, gorilla_gorilla, macaca_fascicularis, macaca_mulatta, nomascus_leucogenys, pan_troglodytes, papio_anubis and pongo_abelli})) on TandemRepeatDB tables, MAXIMUM Tandem Repeats of each repeat in all genome tables are extracted. The queried data is given as input to correlation coefficient measure, the measures are shown in Table 1.
Table 1

Correlation Coefficient measures of Homo sapiens genomes versus Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli.

Homo sapiens (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Callithrix jacchus0.2004460.1020620.1713650.1235960.1767770.1038890.1279860.140170.4060610.032686
Chlorocebus sabaeus0.0194880.0721690.1626140.1927390.0811110.0298020.0010870.1472460.2857240.278168
Gorilla gorilla0.0139410.3699250.2022420.7498650.1797460.013650.1991090.2136990.0372470.152294
Macaca fascicularis0.1079220.1317940.2861450.1948490.1364820.2382170.132570.2117020.2490290.266628
Macaca mulatta0.0189660.1399630.0841730.2501920.1395730.0428750.0439060.1379290.239940.004386
Nomascus leucogenys0.1085120.2909260.2320480.2787720.3125550.3318410.3147330.0892290.0406640.14093
Pan troglodytes0.1318570.1857990.1431490.1841330.2723370.0953680.0527290.1247250.2737240.097109
Papio anubis0.3214650.1543350.0292470.0927620.0108510.464050.1586860.1155160.1573410.117418
Pongo abelli0.5373830.2414320.0131340.475160.2306360.1405260.0702960.2638920.2124570.268534
In all the tables rows represent genome data sets and columns represent Tandem Repeats. The data in tables shows similarity measures of corresponding genome data. Table 1 shows the correlation coefficient measures of H. sapiens genomes versus C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, P. troglodytes, P. anubis and P. abelli genomes. From Table 1, it is observed that every Tandem Repeat has shown the positive correlation, and also observed the following correlations: TATC Tandem Repeat has shown a highest positive correlation(0.4) between H. sapiens and C. jacchus, whereas TCTG has shown a less positive correlation(0.03). TATC Tandem Repeat has shown a highest positive correlation(0.28) between H. sapiens and C. sabaeus, whereas AGAT has shown a less positive correlation(0.001087). TCTA Tandem Repeat has shown a highest positive correlation(0.74) between H. sapiens and G. gorilla, whereas GAAT has shown a less positive correlation(0.01365). TCTG Tandem Repeat has shown a highest positive correlation (0.266) between H. sapiens and M. fascicularis, whereas TAGA has shown a less positive correlation(0.1079). TCTA Tandem Repeat has shown the highest positive correlation(0.25) between H. sapiens and M. mulatta, whereas TAGA has shown a less positive correlation(0.018). AGAT Tandem Repeat had shown the highest positive correlation(0.3147) between H. sapiens and N. leucogenys, whereas CTTT has shown a less positive correlation(0.089). TATC Tandem Repeat has shown a highest positive correlation(0.2737) between H. sapiens and Pantroglodytes, whereas AGAT has shown a less positive correlation(0.052729). GAAT Tandem Repeat has shown the highest positive correlation(0.464) between H. sapiens and P. anubis, whereas TCAT has shown a less positive correlation(0.010851). TAGA Tandem Repeat has shown a highest positive correlation(0.537) between H. sapiens and P. abelli, whereas GATA has shown a less positive correlation(0.013134). The overall highest value 0.74 occurred at TCTA Tandem Repeat of G. gorilla shows a positive correlation between the sets of H. sapiens and G. gorilla. Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9 have shown the correlation coefficient measures among the different genome data sets. Observations which are very similar to those from Table 1 can also be made from the other Table 2, Table 3, Table 4, Table 5, Table 6, Table 7, Table 8, Table 9. Some of the observations are:
Table 2

Correlation coefficient measures of Callithrix jacchus genomes versus Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Callithrix jacchus (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Chlorocebus sabaeus0.3281250.1767770.2181820.0663190.0430150.2008050.0325110.0454340.3564060.389536
Gorilla gorilla0.0190610.0801280.0552010.0123610.2343330.330940.2820060.0603980.1444570.097763
Macaca fascicularis0.0530870.14589600.1514020.0958710.1472960.1175550.1028690.1302120.127185
Macaca mulatta0.0983040.1849990.1573780.082690.1570260.1965330.2572480.1853860.2195810.112572
Nomascus leucogenys0.0765470.0765470.0765470.0765470.0765470.0765470.0765470.0765470.0765470.076547
Pan troglodytes0.6816210.5314280.3246550.3695840.1165140.1716310.3268140.4543150.3332990.8307
Papio anubis0.1707350.1806790.0996690.0369430.0948830.0341430.3290830.2023110.1071750.384421
Pongo abelli0.1664730.1815770.104880.2570850.0891730.211430.2670570.0158410.0266020.169649
Table 3

Correlation coefficient measures of Chlorocebus sabaeus genomes versus Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Chlorocebus sabaeus (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Gorilla gorilla0.6408570.882120.4055420.6343910.6623960.4352530.1061010.7913990.6884710.599238
Macaca fascicularis0.3498420.5985010.4990250.7270160.7687710.3538510.2231260.6219810.4789130.619823
Macaca mulatta0.7701890.491280.8398250.3811850.447220.7140020.2773690.5316690.9391790.620586
Nomascus leucogenys0.5671340.5423750.4375350.5861070.4497860.4875570.4953240.4456570.7154550.250417
Pan troglodytes0.3497790.4130920.5756290.4722250.3818740.5745630.4529580.5036470.5205910.411884
Papio anubis0.5859660.3844260.1519030.3783090.4709330.0869930.5628420.3416850.1958160.452875
Pongo abelli0.3323880.4007080.1752770.3934030.2785540.3468560.306680.2237230.3377360.252836
Table 4

Correlation coefficient measures of Gorilla gorilla genomes versus Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Gorilla gorilla (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Macaca fascicularis0.4913040.2642150.1092730.1617710.2604620.2500620.2762950.2885840.2940690.460102
Macaca mulatta0.4229990.6775410.5508160.3038020.489350.2932010.0855790.5000190.4472140.3397
Nomascus leucogenys0.3176120.2717080.6846410.3197580.3747660.2958690.3582960.143950.4485560.009342
Pan troglodytes0.1072990.1477880.2207630.5307940.0336860.3053510.0045510.1468010.2650160.0804
Papio anubis0.4002780.0525110.0883270.1756210.0371760.095470.1249447.20E-160.0298020.015433
Pongo abelli0.4088630.1398760.092490.1737880.1898050.0806320.1284920.1557250.2134050.021678
Table 5

Correlation coefficient measures of Macaca fascicularis genomes versus Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Macaca fascicularis (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Macaca mulatta0.6761840.0911450.3292040.1361780.5659360.2919370.5249090.2829130.4568320.112097
Nomascus leucogenys0.2997860.0185880.3494820.0895320.6317840.7209370.1925110.0708290.4231140.200311
Pan troglodytes0.1828870.0087550.3395610.0472860.2477750.0163990.2247820.2867070.0833610.238254
Papio anubis0.1145170.5978510.2160250.0482240.1877670.2548940.4035820.1183450.3540190.326006
Pongo abelli0.0213410.1074910.0090685.08E-160.269980.1146330.4090010.20954.96E-160.070376
Table 6

Correlation coefficient measure of Macaca mulatta genomes versus Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Macaca mulatta (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Nomascus leucogenys00.0848950.50.3070950.3307010.396150.1805680.2051610.3856540.228129
Pan troglodytes0.916630.2859890.3116470.030320.2027750.3372250.0725090.3494820.2262210.313863
Papio anubis0.011940.4473680.0461270.1076040.0414170.2379450.1999310.40.0306820.236297
Pongo abelli00.1322210.1144010.2736350.2432220.1147080.056980.2367430.1881630.395974
Table 7

Correlation coefficient measures of Nomascus leucogenys versus Pan troglodytes, Papio anubis and Pongo abelli genomes.

Nomascus leucogenys (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Pan troglodytes0.4431350.417620.3605970.5862380.659820.505450.2234620.6991970.5414330.496358
Papio anubis0.4417070.6882470.7074240.3526730.6374820.357670.6179010.5198750.4593580.486299
Pongo abelli0.1852510.7079470.4724080.354650.3250770.5429450.4257920.6095970.840580.611654
Table 8

Correlation coefficient measures of Pan troglodytes genomes versus Papio anubis and Pongo abelli.

Pan troglodytes (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Papio anubis0.686980.623010.4719270.5108570.2169950.0384720.1866360.3230290.4415410.494709
Pongo abelli0.1434910.1378130.018150.0717530.0815190.0470880.1518720.1200740.1292680.224163
Table 9

Correlation coefficient measures of Papio anubis genomes versus Pongo abelli genomes.

Papio anubis (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Pongo abelli0.5601370.0235090.2648670.1802690.1360780.484370.2086810.0234440.0904340.018164
The highest value 0.8307 corresponding to TCTG Tandem Repeat of P. troglodytes from the Table 2 shows a positive correlation between the sets of C. jacchus and P. troglodytes. The highest value 0.93 corresponding to TATC Tandem Repeat of M. mulatta from the Table 3 shows a positive correlation between the sets of C. sabaeus and M. mulatta. The highest value 0.68 corresponding to GATA Tandem Repeat of N. leucogenys from the Table 4 shows a positive correlation between the sets of G. gorilla and N. leucogenys. The highest value 0.72 corresponding to GAAT Tandem Repeat of N. leucogenys from the Table 5 shows a positive correlation between the sets of M. fascicularis and N. leucogenys. The highest value 0.916 corresponding to TAGA Tandem Repeat of P. troglodytes from the Table 6 shows a positive correlation between the sets of M. mulatta and P. troglodytes. The highest value 0.840 corresponding to TAGA Tandem Repeat of P. abelli from the Table 7 shows a positive correlation between the sets of N. leucogenys and P. abelli. The highest value 0.686 corresponding to TAGA Tandem Repeat of P. anubis from the Table 8 shows a positive correlation between the sets of P. troglodytes and P. anubis. The highest value 0.56 corresponding to TAGA Tandem Repeat of Pongo abelli from the Table 9 shows a positive correlation between the sets of P. anubis and P. abelli.

Rank correlation coefficient

A rank correlation coefficient [11], [12] measures the degree of similarity between two sets of data, and can be used to assess the significance of the relation between them. Different rank correlation coefficients are available in the literature. The Spearman's Rank correlation coefficient is considered and denoted by r, in this paper. It can be measured by the formulawhere di = (RX-RY) is the difference of ranks of Xi and Yi for each i, and n is the number of pairs of observations. By executing the SQL query πmax( (σ {homo_sapiens, callithrix_jacchus, chlorocebus_sabaeus, gorilla_gorilla, macaca_fascicularis, macaca_mulatta, nomascus_leucogenys, pan_troglodytes, papio_anubis and pongo_abelli})) on TandemRepeatDB tables, MAXIMUM Tandem Repeats of each repeat in all genome tables are extracted. The queried data has been arranged in the form of ranks. The ranks are given as input to rank correlation coefficient measure; the measures are shown in Table 10.
Table 10

Rank correlation coefficient measures of Homo sapiens genomes versus Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Homo sapiens (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Callithrix jacchus0.9634620.9976920.9861540.9815380.9792310.9834620.9761540.9942310.9869230.903846
Chlorocebus sabaeus0.9734620.9896150.9903850.9703850.9826920.9492310.9534620.9934620.9792310.911923
Gorilla gorilla0.9792490.9881420.9901190.984190.9797430.9748020.9807310.9777670.944170.980731
Macaca fascicularis0.9768490.9864480.9610390.9881420.9762850.9808020.9616040.9875780.9932240.980802
Macaca mulatta0.9837660.9980520.9753250.9753250.9337660.8831170.9649350.9863640.9863640.977273
Nomascus leucogenys0.9714290.9987010.9779220.9740260.9642860.9694810.9850650.990260.9344160.976623
Pan troglodytes0.9852170.9652170.9843480.9730430.9439130.980870.9513040.9960870.960.95087
Papio anubis0.9785430.9983060.9576510.9813660.9813660.9813660.9808020.984190.9079620.994918
Pongo abelli0.9822130.9985180.9777670.9797430.9752960.9876480.9851780.9871540.9728260.899209
Table 10 shows the rank correlation coefficient measures of H. sapiens genomes versus C. jacchus, C. sabaeus, G. gorilla, M. fascicularis, M. mulatta, N. leucogenys, Pan troglodytes, P. anubis and P. abelli genomes. From the Table 10, it is observed that every Tandem Repeat has shown a positive rank correlation, and also observed the following correlations: AGAA Tandem Repeat has shown a highest positive correlation(0.997) between H. sapiens and C. jacchus, whereas TCTG has shown a less positive correlation(0.903). CTTT Tandem Repeat has shown a highest positive correlation(0.993) between H. sapiens and C. sabaeus, whereas TCTG has shown a less positive correlation(0.911). GATA Tandem Repeat has shown a highest positive correlation(0.990) between H. sapiens and G. gorilla, whereas TATC has shown a less positive correlation(0.944). TATC Tandem Repeat has shown a highest positive correlation(0.993) between H. sapiens and M. fascicularis, whereas GATA has shown a less positive correlation(0.9610). AGAA Tandem Repeat has shown a highest positive correlation(0.998) between H. sapiens and M. mulatta, whereas GAAT has shown a less positive correlation(0.883). AGAA Tandem Repeat has shown a highest positive correlation(0.998) between H. sapiens and N. leucogenys, whereas TATC has shown a less positive correlation(0.934). CTTT Tandem Repeat has shown a highest positive correlation(0.996) between H. sapiens and Pan troglodytes, whereas TCAT has shown a less positive correlation(0.943). AGAA Tandem Repeat has shown a highest positive correlation(0.998) between H. sapiens and Papio anubis, whereas TCAT has shown a less positive correlation(0.907). AGAA Tandem Repeat had shown a highest positive correlation(0.998) between H. sapiens and Pongo abelli, whereas GATA has shown a less positive correlation(0.899). The overall highest value 0.998 occurred at AGAA Tandem Repeat of pongo abelli, P. anubis, N. leucogenys and M. mulatta shows a positive correlation between the sets of H. sapiens and P. abelli, P. anubis, N. leucogenys, M. mulatta. Table 11, Table 12, Table 13, Table 14, Table 15, Table 16, Table 17, Table 18 have shown the Rank Correlation Coefficient measures among the different genome data sets. Observations which are very similar to those from Table 10 can also be made from the other Table 11, Table 12, Table 13, Table 14, Table 15, Table 16, Table 17, Table 18. Some of the observations are:
Table 11

Rank correlation coefficient measures of Callithrix jacchus genomes versus Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Callithrix jacchus(VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Chlorocebus sabaeus0.9376920.9865380.9865380.9434620.9903850.9211540.9673080.9953850.9915380.987308
Gorilla gorilla0.9861660.9851780.9871540.9728260.9812250.9752960.9698620.9817190.9693680.971838
Macaca fascicularis0.9508750.9813660.9390180.9853190.9616040.9813660.9706380.9881420.9791080.966685
Macaca mulatta0.9805190.9961040.9883120.9714290.8733770.8876620.9214290.9870130.970130.967532
Nomascus leucogenys0.9525970.9954550.9772730.9584420.9805190.9779220.959740.9941560.9636360.991558
Pan troglodytes0.9556520.9621740.9826090.949130.880870.9765220.9739130.9943480.9843480.973478
Papio anubis0.9853190.9966120.9536980.9717670.9655560.9559570.9616040.9836250.9457930.944664
Pongo abelli0.919960.997530.9738140.9762850.9698620.9901190.9505930.9881420.9822130.974308
Table 12

Rank correlation coefficient measures of Chlorocebus sabaeus genomes versus Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Chlorocebus sabaeus (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Gorilla gorilla0.9525690.9945650.9906130.9723320.9881420.9209490.9328060.984190.9782610.974308
Macaca fascicularis0.9632980.9920950.9649920.9520050.9813660.9689440.9457930.9870130.9762850.968944
Macaca mulatta0.950.9974030.9746750.9805190.8889610.9467530.8564940.9857140.9668830.971429
Nomascus leucogenys0.980.9956520.9895650.990870.9839130.9621740.9482610.990870.9726090.983043
Pan troglodytes0.9717390.9843480.9873910.9730430.9030430.9447830.9613040.9947830.979130.974783
Papio anubis0.9570860.9847540.9604740.9870130.9728970.9661210.9378880.9802370.9463580.952569
Pongo abelli0.9673910.9673910.9673910.9673910.9673910.9673910.9673910.9673910.9673910.967391
Table 13

Rank correlation coefficient measures of Gorilla gorilla genomes versus Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Gorilla gorilla (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Macaca fascicularis0.9636360.9831170.9506490.9474030.9610390.9714290.9636360.9967530.9363640.984416
Macaca mulatta0.9849620.9969920.9842110.9729320.8699250.9097740.9661650.9962410.9150380.981203
Nomascus leucogenys0.9661650.9969920.9774440.9729320.9804510.9699250.9729320.9887220.951880.987218
Pan troglodytes0.9780770.9850.970.9884620.8492310.9869230.9730770.9896150.9653850.979615
Papio anubis0.990260.9974030.9532470.9707790.960390.9642860.9746750.9941560.9519480.983117
Pongo abelli0.9638460.9896150.9776920.9896150.9765380.9853850.9803850.9673080.9726920.952308
Table 14

Rank correlation coefficient measures of Macaca fascicularis genomes versus Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Macaca fascicularis (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Macaca mulatta0.9785710.9857140.9337660.9590910.939610.9441560.9240260.9870130.9876620.987013
Nomascus leucogenys0.9804510.9789470.9548870.9503760.972180.9751880.9488720.9864660.9338350.985714
Pan troglodytes0.9811690.9779220.9571430.9181820.9246750.9720780.9616880.9850650.9493510.965584
Papio anubis0.9779790.9870130.9717670.9661210.9847540.9655560.9717670.9966120.9169960.988142
Pongo abelli0.9714290.9831170.9512990.9564940.9733770.9766230.9545450.9980520.9720780.932468
Table 15

Rank correlation coefficient measures of Macaca mulatta genomes versus Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Macaca mulatta (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Nomascus leucogenys0.9692980.9964910.9684210.9710530.8543860.9131580.9596490.9850880.8745610.994737
Pan troglodytes0.9466170.9428570.9699250.951880.9323310.9075190.9578950.9834590.9406020.966165
Papio anubis0.9922080.9987010.939610.9792210.9370130.909740.9675320.9974030.8954550.985714
Pongo abelli0.9045110.9969920.9443610.9714290.9030080.8902260.9684210.9977440.9669170.934586
Table 16

Rank correlation coefficient measures of Nomascus leucogenys genomes versus Pan troglodytes, Papio anubis and Pongo abelli genomes.

Nomascus leucogenys (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Pan troglodytes0.9789470.9413530.972180.9819550.8714290.9729320.9503760.9864660.9669170.970677
Papio anubis0.972180.9962410.9654140.9729320.9578950.9609020.9849620.9819550.9533830.981955
Pongo abelli0.9736840.9962410.960150.9781950.9616540.9736840.9714290.9834590.9781950.945113
Table 17

Rank correlation coefficient measures of Pan troglodytes genomes versus Papio anubis and Pongo abelli genomes.

Pan troglodytes (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Papio anubis0.9792210.9519480.9740260.9675320.9292210.9662340.9519480.9772730.9519480.947403
Pongo abelli0.9788460.9653850.9746150.9788460.8965380.9861540.9542310.9630770.9811540.975
Table 18

Rank correlation coefficient measures of Papio anubis genomes versus Pongo abelli genomes.

Papio anubis (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Pongo abelli0.9525970.9974030.9538960.9746750.9870130.9668830.9681820.9954550.9370130.9
The highest value 0.997 corresponding to AGAA Tandem Repeat of P. abelli from the Table 11 shows a positive correlation between the sets of C. jacchus and P. abelli. The highest value 0.997 corresponding to AGAA Tandem Repeat of M. mulatta from the Table 12 shows a positive correlation between the sets of C. sabaeus and M. mulatta. The highest value 0.997 corresponding to AGAA Tandem Repeat of P. anubis from the Table 13 shows a positive correlation between the sets of G. gorilla and P. anubis. The highest value 0.997 corresponding to AGAA Tandem Repeat of P. anubis from the Table 14 shows a positive correlation between the sets of M. fascicularis and P. anubis. The highest value 0.998 corresponding to AGAA Tandem Repeat of P. anubis from the Table 15 shows a positive correlation between the sets of M. mulatta and P. anubis. The highest value 0.996 corresponding to AGAA Tandem Repeat of P. abelli from the Table 16 shows a positive correlation between the sets of N. leucogenys and P. abelli. The highest value 0.986 corresponding to AGAA Tandem Repeat of P. anubis from the Table 17 shows a positive correlation between the sets of Pan troglodytes and P. anubis. The highest value 0.997 corresponding to AGAA Tandem Repeat of P. abelli from the Table 18 shows a positive correlation between the sets of P. anubis and P. abelli.

Cosine similarity

Cosine similarity [13], [14] is a measure of similarity between two data sets. The cosine of two sets can be derived by the Euclidean dot product formula aswhere n is the number of observations, Σ is the summation symbol, Xi is the X value for observation i, Yi is the Y value for observation i. By executing the SQL query πmax( (σ ({homo_sapiens, callithrix_jacchus, chlorocebus_sabaeus, gorilla_gorilla, macaca_fascicularis, macaca_mulatta, nomascus_leucogenys, pan_troglodytes, papio_anubis and pongo_abelli})) on TandemRepeatDB tables, MAXIMUM Tandem Repeats of each repeat in all genome tables are extracted. The queried data has been given as input to cosine similarity measure; the measures are shown in Table 19.
Table 19

Cosine similarity measures of Homo sapiens genomes versus Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Homo sapiens (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Callithrix jacchus0.7905690.9263020.6085220.8228810.775940.8202540.6684720.7967150.8507810.768615
Chlorocebus sabaeus0.730950.8831940.7235150.6805450.7144350.6983230.6626510.7715030.7531820.668943
Gorilla gorilla0.8661690.8574030.8051460.7024740.8187650.6967250.5677340.7047140.6743070.81657
Macaca fascicularis0.6502030.9052040.6062750.8732730.609110.7763160.562960.875190.8581160.859676
Macaca mulatta0.8674610.9428090.689730.7907560.5946610.7446220.7686330.8727970.8017840.860689
Nomascus leucogenys0.5869520.9689630.6905430.6923080.6431720.6845790.7841570.845910.6968670.801784
Pan troglodytes0.7266580.8062550.6665970.725010.6660670.6390940.7013580.8475660.7851180.808019
Papio anubis0.7633810.9449110.556880.8101220.6454970.8632940.7275180.8872620.7450540.906493
Pongo abelli0.8081760.9468640.753710.4985570.5597730.886320.8281770.857690.7024390.7534
Table 19 shows the cosine similarity measures of Homo sapiens genomes versus Callithrix jacchus, Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes. From Table 19, it is observed that every Tandem Repeat has shown a good relation, and also observed the following relations: AGAA Tandem Repeat has shown a good relation (0.926) between H. sapiens and C. jacchus, whereas GATA has shown a weak relation(0.608). AGAA Tandem Repeat has shown a good relation (0.883) between H. sapiens and C. sabaeus, whereas AGAT has shown a weak relation (0.662). TAGA Tandem Repeat has shown a good relation (0.866) between H. sapiens and G. gorilla, whereas AGAT has shown a weak relation (0.567). AGAA Tandem Repeat has shown a good relation (0.905) between H. sapiens and Macaca fascicularis, whereas AGAT has shown a weak relation (0.562). AGAA Tandem Repeat has shown a good relation (0.942) between H. sapiens and M. mulatta, whereas TCAT has shown a weak relation (0.594). AGAA Tandem Repeat has shown a good relation (0.968) between H. sapiens and N. leucogenys, whereas TAGA has shown a weak relation (0.586). CTTT Tandem Repeat has shown a good relation (0.847) between H. sapiens and Pan troglodytes, whereas GATA has shown a weak relation (0.666). AGAA Tandem Repeat has shown a good relation (0.944) between H. sapiens and P. anubis, whereas GATA has shown a weak relation(0.556). AGAA Tandem Repeat has shown a good relation (0.946) between H. sapiens and pongo abelli, whereas TCTA has shown a weak relation (0.498). The overall highest value 0.968 occurred at AGAA Tandem Repeat of N. leucogenys shows a good relation between the sets of H. sapiens and N. leucogenys. Table 20, Table 21, Table 22, Table 23, Table 24, Table 25, Table 26 and 27 have shown the cosine similarity measures among the different genome data sets. Observations which are very similar to those from Table 19 can also be made from the other Table 20, Table 21, Table 22, Table 23, Table 24, Table 25, Table 26, and 27. Some of the observations are:
Table 20

Cosine similarity measures of Callithrix jacchus genomes versus Chlorocebus sabaeus, Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Callithrix jacchus (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Chlorocebus sabaeus0.8510410.8819530.6468110.7088130.7965990.7671460.6746790.7362420.8287750.769682
Gorilla gorilla0.8379570.8261530.6354380.7635590.7304370.7399750.7033370.6524380.6461290.717975
Macaca fascicularis0.7764930.8657680.6662520.8295150.7000710.8508420.6527890.81410.7410590.712274
Macaca mulatta0.8471740.9045340.7379310.7426110.6627290.6845320.7485980.8111070.7365710.722718
Nomascus leucogenys0.8789450.8898980.7225440.6773540.75420.8621160.7234340.8816620.6665410.790981
Pan troglodytes0.7921830.8303360.6750530.7288810.6086750.654670.697860.678680.8762230.711305
Papio anubis0.8324950.9074850.7653450.6835990.7752030.747840.6976690.8418790.7717570.749613
Pongo abelli0.7370140.9191450.6963580.6274940.7311260.8079570.6723120.8188810.682060.650523
Table 21

Cosine similarity measures of Chlorocebus sabaeus genomes versus Gorilla gorilla, Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Chlorocebus sabaeus (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Gorilla gorilla0.7915630.8141740.7244690.6739090.7718690.6057050.4724560.736050.814780.713223
Macaca fascicularis0.6200890.8155910.694550.780720.9094160.7354560.5329520.8539920.7485340.690281
Macaca mulatta0.7302860.8542420.7301590.7211170.6393820.6294160.6575710.850420.7321430.720577
Nomascus leucogenys0.7649960.7836040.7954720.8382350.8099630.6771920.7439190.9108770.674650.7396
Pan troglodytes0.6921440.7770180.7856460.6625030.6815980.6436220.6618920.7236270.7593690.630437
Papio anubis0.7804430.8559080.6075510.8101910.7840630.6054730.7655130.802260.7341170.767217
Pongo abelli0.6918080.8596020.6745410.6295240.7786040.681560.6285420.8398390.7393750.683537
Table 22

Cosine similarity measures of Gorilla gorilla genomes versus Macaca fascicularis, Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Gorilla gorilla (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Macaca fascicularis0.6828630.8845380.5408750.7767010.718080.7286520.5724340.923870.6570180.790569
Macaca mulatta0.8608580.9230770.7927470.6700780.697390.7443520.5685740.9215120.5783990.752512
Nomascus leucogenys0.6891340.9292840.665190.7503060.7806710.7374170.5096150.7833490.4801250.781408
Pan troglodytes0.8257650.7027310.6088450.8690480.7017420.7931070.5688020.7619790.5312340.806872
Papio anubis0.8426510.9259260.5361750.6029110.7268440.7021120.5258880.8703880.5687870.852803
Pongo abelli0.8476710.8738850.6971360.6118870.7032110.7959490.5684730.7933640.5488770.755865
Table 23

Cosine similarity measures of Macaca fascicularis genomes versus Macaca mulatta, Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Macaca fascicularis (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Macaca mulatta0.7570330.9525790.7198470.8618580.712020.860690.7807210.9799580.7819470.823532
Nomascus leucogenys0.6667170.8671490.7497850.8825230.9026980.7755280.6289060.8759360.7208380.79758
Pan troglodytes0.6868030.7292040.7227160.846780.6863520.6292530.5728830.8308680.7343580.759072
Papio anubis0.7445290.954030.7378220.8153740.7863570.6520630.80.9206340.8344150.900284
Pongo abelli0.6554470.8845380.6491840.5987640.6772080.7730210.7459950.9428090.6993950.770675
Table 24

Cosine similarity measures of Macaca mulatta genomes versus Nomascus leucogenys, Pan troglodytes, Papio anubis and Pongo abelli genomes.

Macaca mulatta (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Nomascus leucogenys0.6666670.920.7597370.7360240.6980720.7273240.7754040.8743830.6827320.924785
Pan troglodytes0.9012960.7659580.7453560.6798730.5992650.604550.7627130.8267520.7902770.776736
Papio anubis0.817760.9629630.7186480.7542470.645150.7031010.8163450.9525790.7289120.893188
Pongo abelli0.7797730.9230770.6973740.5047720.5137440.7263450.8037390.940540.780090.825029
Table 25

Cosine similarity measures of Nomascus leucogenys genomes versus Pan troglodytes, Papio anubis and Pongo abelli genomes.

Nomascus leucogenys (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Pan troglodytes0.6551860.7542980.7484550.8249580.7962430.6593460.6739090.7073960.7172180.708064
Papio anubis0.712940.9104460.801110.705480.7439370.6734510.8579310.8757550.6347330.850923
Pongo abelli0.6025340.9104460.7177650.5938480.7015610.7772450.7312630.8492080.8287750.781918
Table 26

Cosine similarity measures of Pan troglodytes genomes versus Papio anubis and pongo abelli genomes.

Pan troglodytes (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Papio anubis0.8847790.8660250.8553370.748850.6520410.5973550.6172130.7789040.759630.837436
Pongo abelli0.6896580.793390.7152030.5835620.5981490.7333330.6594560.8183920.7400330.672977
Table 27

Cosine similarity measures of Papio anubis genomes versus Pongo abelli genomes.

Papio anubis (VS)TAGAAGAAGATATCTATCATGAATAGATCTTTTATCTCTG
Pongo abelli0.702740.9259260.6697460.5016480.8507580.840460.7123890.89810.6619890.835053
The highest value 0.919 corresponding to AGAA Tandem Repeat of P. abelli from the Table 20 shows a good relation between the sets of C. jacchus and P. abelli. The highest value 0.910 corresponding to CTTT Tandem Repeat of N. leucogenys from the Table 21 shows a good relation between the sets of C. sabaeus and N. leucogenys. The highest value 0.929 corresponding to AGAA Tandem Repeat of N. leucogenys from the Table 22 shows a good relation between the sets of G. gorilla and N. leucogenys. The highest value 0.979 corresponding to CTTT Tandem Repeat of M. mulatta from the Table 23 shows a good relation between the sets of M. fascicularis and M. mulatta. The highest value 0.962 corresponding to AGAA Tandem Repeat of P. anubis from the Table 24 shows a good relation between the sets of M. mulatta and P. anubis. The highest value 0.910 corresponding to AGAA Tandem Repeat of P. anubis and P. abelli from the Table 25 shows a good relation between the sets of N. leucogenys, P. anubis and P.abelli. The highest value 0.910 corresponding to AGAA Tandem Repeat of P. anubis from the Table 26 shows a good relation between the sets of P. troglodytes and P. anubis. The highest value 0.925 corresponding to TCTA Tandem Repeat of P. abelli from the Table 27 shows a good relation between the sets of P. anubis and P. abelli.

Purpose of the research

To perform a DNA analysis, DNA is first extracted from a sample. Just one nano-gram of DNA is usually a sufficient quantity to provide good data. In order to match the two DNA sequences, for example, theft evidence to a suspect, a string matching algorithm would search the allele of the 10 STRs [15] for both the evidence sample and the suspect's sample, data base is prepared. If Suspect A is the source of theft sample and Suspect B is in other side, then the similarity between the evidence and suspect is measured from the extracted data with database. This similarity value tells the similarity between A and B. Basing on the resultant values the decision will be taken.

Conclusions

This study measures the similarity between the Homo sapiens and monkeys by considering correlation coefficient, rank correlation coefficient and cosine similarity. From the Table 1, Table 10, Table 19, the linear increasing relationship for all the considered similarity measures can be observed. It is also observed that monkeys have a close correlation with H. sapiens.
  7 in total

Review 1.  Ordering gene function: the interpretation of epistasis in regulatory hierarchies.

Authors:  L Avery; S Wasserman
Journal:  Trends Genet       Date:  1992-09       Impact factor: 11.639

2.  Defining genetic interaction.

Authors:  Ramamurthy Mani; Robert P St Onge; John L Hartman; Guri Giaever; Frederick P Roth
Journal:  Proc Natl Acad Sci U S A       Date:  2008-02-27       Impact factor: 11.205

3.  Putting genetic interactions in context through a global modular decomposition.

Authors:  Jeremy Bellay; Gowtham Atluri; Tina L Sing; Kiana Toufighi; Michael Costanzo; Philippe Souza Moraes Ribeiro; Gaurav Pandey; Joshua Baller; Benjamin VanderSluis; Magali Michaut; Sangjo Han; Philip Kim; Grant W Brown; Brenda J Andrews; Charles Boone; Vipin Kumar; Chad L Myers
Journal:  Genome Res       Date:  2011-06-29       Impact factor: 9.043

4.  Hierarchical modularity and the evolution of genetic interactomes across species.

Authors:  Colm J Ryan; Assen Roguev; Kristin Patrick; Jiewei Xu; Harlizawati Jahari; Zongtian Tong; Pedro Beltrao; Michael Shales; Hong Qu; Sean R Collins; Joseph I Kliegman; Lingli Jiang; Dwight Kuo; Elena Tosti; Hyun-Soo Kim; Winfried Edelmann; Michael-Christopher Keogh; Derek Greene; Chao Tang; Pádraig Cunningham; Kevan M Shokat; Gerard Cagney; J Peter Svensson; Christine Guthrie; Peter J Espenshade; Trey Ideker; Nevan J Krogan
Journal:  Mol Cell       Date:  2012-06-08       Impact factor: 17.970

5.  Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways.

Authors:  Ben Lehner; Catriona Crombie; Julia Tischler; Angelo Fortunato; Andrew G Fraser
Journal:  Nat Genet       Date:  2006-07-16       Impact factor: 38.330

6.  The genetic landscape of a cell.

Authors:  Michael Costanzo; Anastasia Baryshnikova; Jeremy Bellay; Yungil Kim; Eric D Spear; Carolyn S Sevier; Huiming Ding; Judice L Y Koh; Kiana Toufighi; Sara Mostafavi; Jeany Prinz; Robert P St Onge; Benjamin VanderSluis; Taras Makhnevych; Franco J Vizeacoumar; Solmaz Alizadeh; Sondra Bahr; Renee L Brost; Yiqun Chen; Murat Cokol; Raamesh Deshpande; Zhijian Li; Zhen-Yuan Lin; Wendy Liang; Michaela Marback; Jadine Paw; Bryan-Joseph San Luis; Ermira Shuteriqi; Amy Hin Yan Tong; Nydia van Dyk; Iain M Wallace; Joseph A Whitney; Matthew T Weirauch; Guoqing Zhong; Hongwei Zhu; Walid A Houry; Michael Brudno; Sasan Ragibizadeh; Balázs Papp; Csaba Pál; Frederick P Roth; Guri Giaever; Corey Nislow; Olga G Troyanskaya; Howard Bussey; Gary D Bader; Anne-Claude Gingras; Quaid D Morris; Philip M Kim; Chris A Kaiser; Chad L Myers; Brenda J Andrews; Charles Boone
Journal:  Science       Date:  2010-01-22       Impact factor: 47.728

7.  High-throughput, quantitative analyses of genetic interactions in E. coli.

Authors:  Athanasios Typas; Robert J Nichols; Deborah A Siegele; Michael Shales; Sean R Collins; Bentley Lim; Hannes Braberg; Natsuko Yamamoto; Rikiya Takeuchi; Barry L Wanner; Hirotada Mori; Jonathan S Weissman; Nevan J Krogan; Carol A Gross
Journal:  Nat Methods       Date:  2008-09       Impact factor: 28.547

  7 in total
  6 in total

1.  P3OI-MELSH: Privacy Protection Target Point of Interest Recommendation Algorithm Based on Multi-Exploring Locality Sensitive Hashing.

Authors:  Desheng Liu; Linna Shan; Lei Wang; Shoulin Yin; Hui Wang; Chaoyang Wang
Journal:  Front Neurorobot       Date:  2021-04-23       Impact factor: 2.650

2.  MONO, DI and TRI SSRs data extraction & storage from 1403 virus genomes with next generation retrieval mechanism.

Authors:  K V S S R Murthy; K V V Satyanarayana
Journal:  Data Brief       Date:  2017-06-10

3.  Data of 10 SSR markers for genomes of homo sapiens and monkeys.

Authors:  K K V V V S Reddy; S Viswanadha Raju; Chinta Someswara Rao
Journal:  Data Brief       Date:  2017-04-13

4.  Fast detection and data compensation for electrodes disconnection in long-term monitoring of dynamic brain electrical impedance tomography.

Authors:  Ge Zhang; Meng Dai; Lin Yang; Weichen Li; Haoting Li; Canhua Xu; Xuetao Shi; Xiuzhen Dong; Feng Fu
Journal:  Biomed Eng Online       Date:  2017-01-07       Impact factor: 2.819

5.  MuscNet, a Weighted Voting Model of Multi-Source Connectivity Networks to Predict Mild Cognitive Impairment Using Resting-State Functional MRI.

Authors:  Jialiang Li; Zhaomin Yao; Meiyu Duan; Shuai Liu; Fei Li; Haiyang Zhu; Zhiqiang Xia; Lan Huang; Fengfeng Zhou
Journal:  IEEE Access       Date:  2020-09-22       Impact factor: 3.476

6.  A fast and efficient algorithm for DNA sequence similarity identification.

Authors:  Machbah Uddin; Mohammad Khairul Islam; Md Rakib Hassan; Farah Jahan; Joong Hwan Baek
Journal:  Complex Intell Systems       Date:  2022-08-23
  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.