Computción y Sistems ISSN: 1405-5546 computcion-y-sistems@cic.ipn.mx Instituto Politécnico Ncionl México Btyrshin, Ildr Z.; Kubyshev, Nily; Solovyev, Vlery; Vill-Vrgs, Luis A. Visuliztion of Similrity Mesures for Binry Dt nd 2 x 2 Tbles Computción y Sistems, vol. 20, núm. 3, 2016, pp. 345-353 Instituto Politécnico Ncionl Distrito Federl, México Avilble in: http://www.redlyc.org/rticulo.o?id=61547469004 How to cite Complete issue More informtion bout this rticle Journl's homepge in redlyc.org Scientific Informtion System Network of Scientific Journls from Ltin Americ, the Cribben, Spin nd Portugl Non-profit cdemic project, developed under the open ccess inititive
Visuliztion of Similrity Mesures for Binry Dt nd 2 x 2 Tbles Ildr Z. Btyrshin 1, Nily Kubyshev 1, Vlery Solovyev 2, Luis A. Vill-Vrgs 1 1 Instituto Politécnico Ncionl, México, Centro de Investigción en Computción (CIC), Mexico 2 Kzn Federl University, Kzn, Russi btyr1@gmil.com, mki.solovyev@mil.ru Abstrct. We propose methods of 3D visuliztion of the min similrity mesures for binry dt nd 2 x 2 tbles. We present the shpes of Jccrd, Dice, Sokl & Sneth, Roger & Tnimoto nd other similrity mesures. Such visuliztion of the similrity mesures gives the direct, visul, method of comprison of these mesures nd helps to understnd the similrity nd the difference between them. Bsed on the visuliztion of the two known prmetric fmilies of similrity mesures the pper proposes the new prmetric fmily of mesures generlizing these two fmilies nd giving the possibility to construct similrity mesures occupying the intermedite position between them. Keywords. Similrity mesure, binry vrible, contingency tble, visuliztion. 1 Introduction Similrity mesures hve numerous pplictions in computtionl linguistics, ecology, medicine, biology, socil sciences, etc. They ply importnt role in pttern recognition, mchine lerning, clssifiction nd sttistics [1, 5-7, 10, 12-14, 16, 18, 19]. Dozens of similrity (or dissimilrity) mesures for binry dt hve been proposed nd the problem of their comprison nd selection for specific ppliction is studied in mny works [2-10, 15, 17-20]. In different ppers, such mesures re referred to s ssocition coefficients, similrity coefficients, resemblnce mesures etc. Different pproches for compring similrity mesures re bsed on: similrity of the properties of these mesures, similrity of formuls, possibility of trnsformtion of one mesure into nother one, ordering of the mesures, distnce between them etc. [2, 3, 6-12,18-20]. To the best of our knowledge, there re not works on 3D visuliztion of the binry similrity mesures. Such visuliztion cn be useful for compring the shpes of similrity mesures nd selecting mesure more suitble for specific pplictions. The pper proposes the methods of 3D visuliztion of the most populr similrity mesures used for binry dt nd 2 x 2 tbles. Such visuliztion of similrity mesures gives the direct, visul, method of comprison of these mesures nd cn help to understnd the similrity nd the difference between them. Severl uthors hve proposed different prmetric fmilies of similrity nd dissimilrity mesures [3, 9, 19, 20]. Bsed on the visuliztion of the two known prmetric fmilies of similrity mesures the pper proposes the new prmetric fmily of the similrity mesures generlizing these two fmilies nd giving possibility to construct similrity mesures occupying intermedite position between them. The pper hs the following structure. Section 2 considers some bsic definition relted with the similrity mesures for binry dt nd describes the most populr similrity mesures. Section 3 proposes the methods of 3D visuliztion of similrity mesures for binry dt nd visulize the most populr mesures. Section 4 proposes the new prmetric fmily of similrity mesures. The lst section contins discussion nd conclusion.
346 Ildr Z. Btyrshin, Nily Kubyshev, Vlery Solovyev, Luis A. Vill-Vrgs 2 Bsic Definitions Consider objects described by n binry ttributes, descriptors or properties. The object x is coded by the vector x = (x 1,, x n ) of n ttribute vlues such tht x k = 1 if the object possesses the property k nd x k = 0 otherwise. Such dt re clled lso presence/bsence dt [8, 11]. For ny two objects x = (x 1,, x n ) nd y = (y 1,, y n ) the following four numbers re clculted: is the number of ttributes such tht x k = 1, y k = 1; b is the number of ttributes such tht x k = 1, y k = 0; c is the number of ttributes such tht x k = 0, y k = 1; d is the number of ttributes such tht x k = 0, y k = 0. The numbers nd d lso referred to s the numbers of positive nd negtive mtches, correspondingly [9, 17]. Note tht the following is fulfilled for these four number: + b + c + d = n, (1) where n is the number of binry ttributes. These four numbers re represented in Tble 1 lso known s 2 2 contingency tble [1]. Below there re presented some populr similrity mesures defined for such tbles [4, 5, 10]. Jccrd (1908): S J (x, y) = +b+c. (2) Dice (1945), Czeknowski (1913), Sorensen (1948): S CDS (x, y) = Sokl & Sneth (1963): S SS I (x, y) = 2 2+b+c. (3) +2b+2c. (4) Sokl & Michener (1958) or simple mtching : x S SM (x, y) = +d +b+c+d Rogers & Tnimoto (1960): S RT (x, y) = Sokl & Sneth (1963): S SS II (x, y) = Rssel & Ro (1940): Fith (1983): Tble 1. 2 x 2 contingency tble S RR (x, y) = +d +2b+2c+d 2+2d 2+b+c+2d +b+c+d S F (x, y) = +0.5d +b+c+d 3 Visuliztion of Similrity Mesures (5) (6) (7) (8) (9) Let us consider prmetric fmilies of similrity mesures tht include the known similrity mesures s prticulr cses [9, 20]. The similrity mesures (2)-(4) cn be generlized s follows: T θ = +θ(b+c), (10) where θ is some positive rel number. The similrity mesures (5)-(7) cn be considered s the prticulr cses of the following prmetric fmily of functions: S θ = +d +d+θ(b+c), (11) y 1 0 1 b 0 c d
Visuliztion of Similrity Mesures for Binry Dt nd 2 x 2 Tbles 347 Fig. 1(). Jccrd similrity mesure (view 1) Fig. 1(b). Jccrd similrity mesure (view 2) Fig. 2(). Rogers & Tnimoto similrity mesure (view 1) Fig. 2(b). Rogers & Tnimoto similrity mesure (view 2) For us it will be more convenient to use the following nottion of these prmetric fmilies of similrity mesures: S () (x, y) = +t(b+c), (12) S (+d) (x, y) = +d +d+t(b+c), (13) where t is some positive rel number. The similrity mesures (2)-(4) re obtined from (12) for the prmeter vlues t= 1, 0.5, 2, correspondingly. The similrity mesures (5)-(7) re obtined from (13)
348 Ildr Z. Btyrshin, Nily Kubyshev, Vlery Solovyev, Luis A. Vill-Vrgs for the prmeter vlues t= 1, 2, 0.5, correspondingly. Tking into ccount tht from (1) it follows b + c = n ( + d), (14) the formuls (12) nd (13) cn be given in such form: S () (x, y) = S (+d) (x, y) = +t(n d), (15) +d +d+t(n d), (16) The prmetric fmilies of the similrity mesures (15) nd (16) hve been considered in [20] in the following forms: S (A) (x, y) = S (A+D) (x, y) = A A+θ(1 A D), (17) A+D A+D+θ(1 A D), (18) where A =, D = d. Further we will use the n n formuls (15) nd (16) for the considered prmetric fmilies of similrity mesures tht will be referred to s ()-fmily nd (+d)-fmily of similrity mesures, correspondingly. We propose to use the reltionship (14) for representtion of other similrity mesures. The similrity mesures (8) nd (9) do not belong to the considered fmilies of mesures, but, using the reltion (1), they lso cn be written s the functions of nd d: S RR (x, y) = n, (19) S F (x, y) = +0.5d. (20) n As it is cler from the formuls (15), (16), (19), (20) for fixed numbers n nd t one cn build ll of these formuls in 3D spce s the functions of 2 vribles nd d. (The formul (19) will depend relly only from ). From (1) nd (14) we obtin: 0 + d n. (21) This condition defines restrictions on the domin of the considered functions. In ll figures below we use the vlue n = 100 nd build the grphics of ll functions for vlues nd d chnging from 0 to 100 with the step 1, with the domin restriction (21). Figures 1() nd 1(b) show in two different projections Jccrd similrity mesure obtined from the prmetric formuls (12) nd (15) for prmeter vlue t=1 s follows: S J (x, y) = n d. (22) The domin (21) is presented on the plne S=0 by tringle with bold sides. Two blck lines show the profiles of the surfce of the similrity mesure: 1) for vlue =50 nd ll vlues of d; 2) for vlue d=50 nd ll vlues of. The vlue S = 0.5 depicts the vlue of the mesure S for = 50 nd d= 0. When d = 0 we obtin in (22) S=/n tht corresponds on Figure 1() to the line incresing from 0 to 1 when d=0 nd is incresing from 0 to 100. Figure 1(b) is obtined from Figure 1() by rottion of the xis to show the profile of the surfce for smll vlues of nd lrge vlues of d. This sitution corresponds to lrge number of negtive mtches d nd hence to smll vlues of nomintor nd denomintor in (2). The similr comments cn be done for the figures of other similrity mesures shown lter. Figures 2() nd 2(b) show two projections of Rogers & Tnimoto similrity mesure. From (6), (13) nd (16) we obtin for t=2: S RT (x, y) = +d 2n d. (23) Figure 3 shows the surfces of the following similrity mesures belonging to the prmetric ()-fmily of mesures (from the left to the right): 1) Dice-Czeknowski-Sorensen, 2) Jccrd, 3) Sokl-Sneth-I. Figure 4 shows the surfces of the following similrity mesures belonging to the prmetric (+d)-fmily of mesures (from the left to the right): 1) Sokl-Sneth-II, 2) Sokl & Michener, 3) Rogers nd Tnimoto. For ll of these similrity mesures the formuls like (22) nd (23) cn be esily obtined from their originl definitions by replcement b+c by n--d, see (1) nd (14).
Visuliztion of Similrity Mesures for Binry Dt nd 2 x 2 Tbles 349 Fig. 3. ()-fmily of similrity mesures in 2 views Fig. 4. (+d)-fmily of similrity mesures in 2 views Figure 5 depicts the surfces of Rssel & Ro nd Fith mesures in the sme projection s the similrity mesures shown on Figures 1() nd 2(). Rssel & Ro nd Fith mesures do not belong nor to ()-fmily nor to (+d)-fmily of similrity mesures nd one cn see tht they
350 Ildr Z. Btyrshin, Nily Kubyshev, Vlery Solovyev, Luis A. Vill-Vrgs Fig. 5. Russel & Ro (on the left side) nd Fith (on the right side) similrity mesures Fig. 6. (+pd)-fmily of similrity mesures: Jccrd (on the left side) nd Sokl & Michener (on the right side) Fig. 7. (+pd)-fmily of similrity mesures: Dice & Czeknowski & Sorensen (on the left side) nd Sokl & Sneth II (on the right side) hve the shpes quite different from the shpes of similrity mesures from these fmilies shown on Figure 3 nd 4. The min problem with these two mesures tht they do not stisfy the reflexivity property S(x,x)= 1 tht requires tht reflexive similrity mesure should hve the vlue 1 on the border of the domin where +d= n nd b= c= 0. One cn see tht the similrity mesures both from ()-fmily nd from (+d)-fmily re reflexive.
Visuliztion of Similrity Mesures for Binry Dt nd 2 x 2 Tbles 351 Fig. 8. (+pd)-fmily of similrity mesures: Sokl & Sneth I (on the left side) nd Rogers & Tnimoto (on the right side) 4 New Prmetric Fmily of Similrity Mesures As one cn see from Figures 3 nd 4 the shpes of the similrity mesures from ()-fmily nd (+d)-fmily re sufficiently different. The similrity mesures S(x,y) from ()-fmily re bsed on the positive mtches of binry ttributes in x nd y. The similrity mesures from (+d)-fmily re bsed both on positive nd on negtive mtches. Discussions pro nd contr of these two types of similrities mesures cn be found for exmple in [5, 10, 17, 19]. We propose the new prmetric fmily of binry similrity mesures formlly generlizing both these fmilies nd giving the possibility to build the similrity mesures intermedite between these two fmilies. Below re the two equivlent forms of the new prmetric fmily of mesures clled (+pd)-fmily: S (+pd) (x, y) = S (+pd) (x, y) = +pd +pd+t(b+c), (24) +pd +pd+t(n d), (25) where t is the positive rel number nd p is the number from the intervl [0,1]. When p = 0 we obtin the ()-fmily of similrity mesures nd when p = 1 we obtin the (+d) fmily of similrity mesures. Chnging prmeter p between 0 nd 1 one cn move similrity mesure from ()-fmily to (+d) fmily. Generlly, the prmeters p nd t cn be tuned in some procedure of selection of suitble similrity mesure for specific ppliction. The selected vlue of the prmeter p cn reflect the trde-off or reltive importnce of positive nd negtive mtches in the constructed similrity mesure. Figures 6, 7, 8 show the shpes of binry similrity mesures from (+pd)-fmily when prmeter p is chnged from 0 (on the left sides) to 1 (on the right sides) such tht on the left sides we hve similrity mesures from ()-fmily nd on the right sides the mesures from (+d)-fmily. The prmeter t hs the vlues 1, 0.5 nd 2 on Figures 6, 7 nd 8, respectively. On Figure 6. the similty mesures re chnged from Jccrd (on the left side) to Sokl & Michener (on the right side). On Figure 7. the similty mesures re chnged from Dice & Czeknowski & Sorensen (on the left side) nd Sokl & Sneth II (on the right side). On Figure 8. the similty mesures re chnged from Sokl & Sneth I (on the left side) nd Rogers & Tnimoto (on the right side). 5 Discussion nd Conclusion The pper proposes the methods of visuliztion of the populr similrity mesures for binry dt nd contingency 2 x 2 tbles. Such visuliztion
352 Ildr Z. Btyrshin, Nily Kubyshev, Vlery Solovyev, Luis A. Vill-Vrgs helps to understnd the reltionships between these mesures nd cn explin why these similrity mesures joined in clusters of similr mesures obtined in different works where the clustering of these mesures is pplied [6,12]. The new prmetric fmily of the similrity mesures is proposed. This fmily generlizes the two known prmetric fmilies of similrity mesures nd gives the possibility to construct similrity mesures intermedite between these two fmilies. Such intermedite position cn reflect the trde-off or reltive importnce of positive nd negtive mtches in the construction of similrity mesures from the new prmetric clss of similrity mesures. The proposed methodology of visuliztion of binry similrity mesures cn be extended on other binry similrity nd ssocition mesures considered in literture. Acknowledgements The work is prtilly supported by the projects SIP 20162204 of IPN, 240844 of CONACYT, Mexico, by RFBR project 15-29-01173 nd by the Russin Government Progrm of competitive growth of Kzn Federl University. References 1. Agresti, A. (2002). Ctegoricl dt nlysis. Wiley nd Sons. 2. Btgelj, V. & Bren, M. (1995). Compring resemblnce mesures. Journl of Clssifiction, Vol. 12, No. 1, pp. 73 90. DOI: 10.1007/BF01202268. 3. Bulieu, F.B. (1989). A clssifiction of presence/bsence bsed dissimilrity coefficients. Journl of Clssifiction, Vol. 6, No. 1, pp. 233 246. DOI: 10.1007/BF01908601. 4. Choi, S.S., Ch, S.H., & Chrles, C.T. (2010). A survey of binry similrity nd distnce mesures. Journl of Systemics, Cybernetics nd Informtics, pp. 43 48. 5. Clifford, H.T. & Stephenson, W. (1975). An introduction to numericl clssifiction (Vol. 229). New York: Acdemic Press. 6. Durte, J.M., Sntos, J.B.D., & Melo, L.C. (1999). Comprison of similrity coefficients bsed on RAPD mrkers in the common ben. Genetics nd Moleculr Biology, Vol. 22, No. 3, pp. 427 432. DOI: 10.1590/S1415-47571999000300024. 7. Goodmn, L.A. & Kruskl, W.H. (1954). Mesures of ssocition for cross clssifictions. Journl of the Americn Sttisticl Assocition, Vol. 49, pp. 732 764. DOI: 10.1007/978-1-4612-9995-0_1. 8. Gower, J.C. (1971). A generl coefficient of similrity nd some of its properties. Biometrics, pp. 857 871. 9. Gower, J.C. & Legendre, P. (1986). Metric nd Eucliden properties of dissimilrity coefficients. Journl of clssifiction, Vol. 3, No. 1, pp. 5 48. DOI: 10.1007/BF01896809. 10. Legendre, P. & Legendre, L.F. (1998). Numericl ecology, 2 nd English Ed., Elsevier. 11. Lesot, M-J., Rifqi, M., & Benhdd, H. (2009) Similrity mesures for binry nd numericl dt: survey. Int. J. Knowledge Engineering nd Soft Dt Prdigms, Vol. 1, No. 1, pp. 63 84. DOI: 10.1504/IJKESDP.2009.021985. 12. Meyer, A.D.S., Grci, A.A.F., Souz, A.P.D., & Souz Jr, C.L.D. (2004). Comprison of similrity coefficients used for cluster nlysis with dominnt mrkers in mize (Ze mys L). Genetics nd Moleculr Biology, Vol. 27, No. 1, pp. 83 91. DOI: 10.1590/S1415-47572004000100014. 13. Pori, S., Cmbri, E., & Gelbukh, A. (2015). Deep convolutionl neurl network textul fetures nd multiple kernel lerning for utternce-level multimodl sentiment nlysis. Proceedings of EMNLP, pp. 2539 2544. 14. Pori, S., Gelbukh, A., Hussin, A., Howrd, N., Ds, D., & Bndyopdhyy, S. (2013). Enhnced SenticNet with ffective lbels for concept-bsed opinion mining. IEEE Intelligent Systems, Vol. 28, No. 2, pp. 31 38. 15. Rodríguez-Slzr, M.E., Álvrez-Hernández, S., & Brvo-Núñez, E. (2001). Coeficientes de socición. Plz y Vldés Editores, México. 16. Sidorov, G., Gelbukh, A., Gómez-Adorno, H., & Pinto, D. (2014). Soft similrity nd soft cosine mesure: Similrity of fetures in vector spce model. Computción y Sistems, Vol. 18, No. 3, pp. 491 504. DOI: 10.13053/CyS-18-3-2043. 17. Sokl, R.R. & Sneth, P.H.A. (1963). Principles of Numericl Txonomy. WH Freemn. 18. Tn, P.N., Kumr, V., & Srivstv, J. (2002). Selecting the right interestingness mesure for ssocition ptterns. Proceedings of the eighth ACM SIGKDD interntionl conference on Knowledge discovery nd dt mining, pp. 32 41. DOI: 10.1145/775047.775053.
Visuliztion of Similrity Mesures for Binry Dt nd 2 x 2 Tbles 353 19. Tversky, A. (1977). Fetures of similrity. Psychologicl review, Vol. 84, No. 4, pp. 327 352. DOI: 10.1037/0033-295X.84.4.327. 20. Wrrens, M.J. (2013). A comprison of multi-wy similrity coefficients for binry sequences. Interntionl Journl of Reserch nd Reviews in Applied Sciences, Vol. 16, No. 1, p. 12. Ildr Z. Btyrshin grduted from the Moscow Physicl-Technicl Institute. He received PhD from the Moscow Power Engineering Institute nd Dr. Sci. (hbilittion) degree from the Highest Attesttion Committee of Russi. He is currently the Titulr Professor C of CIC IPN. He served s Co-Chir of 10 Interntionl Conferences on Soft Computing, Artificil Intelligence nd Computtionl Intelligence. He is n uthor nd editor of 20 books nd specil volumes of journls nd n uthor of more thn 200 ppers in journls nd conference proceedings. Nily Kubyshev recieved her PhD from Lobchevsky Stte University of Nizhny Novgorod, Russi, nd Dr. Sci. (hbilittion) degree from the Highest Attesttion Committee of Russi. She is currently with IPN, Mexico, s Postdoctorl Resercher. Vlery Solovyev did his reserch with the Higher School of Informtion Technologies nd Informtion Systems, University of Kzn, Russi. He grduted with Doctor of Science in Computer Science degree t the Russin Acdemy of Sciences in 1995. He is currently working s Senior Resercher nd hed of Lbortory of Medicl Informtics. His reserch interests include dt mining, computtionl linguistics, cognitive science. He is the President of The Assocition of Cognitive Science (Russi). Luis A. Vill-Vrgs received his Ph.D. in computer science from the Polytechnic University of Ctloni in 1999. From December 1999 to Februry 2001 he ws with the Lbortory for Computer Science s Postdoctorl Fellow t the Msschusetts Institute of Technology. From October 2001 to Jnury 2007 he ws with the Mexicn Petroleum Institute. Since Jnury 2007 he hs been with the Center for Computer Reserch t The Ntionl Polytechnic Institute in Mexico, where he ws director from 2010 to 2016. Article received on 15/07/2016; ccepted 10/09/2016. Corresponding uthor is Ildr Z. Btyrshin.