Advanced mage Tacking Appoach fo Augmented Realit Applications evgen M. Goovi and Dmto S. Shaapov t-jim 1 Konstitucii Sq., Khakiv 6145, Ukaine ceo@it-jim.com, shaapov@it-jim.com Abstact augmented ealit is popula and apidl gowing diection. t is successfull used in medicine, education, engineeing and entetainment. n the pape, basic pinciples of tpical augmented ealit sstem ae descibed. An efficient hbid visual tacking algoithm is poposed. The appoach is based on combining of the optical flow technique with diect tacking methods. t is demonstated that developed technique allows to achieve stable and pecise esults. Compaative epeimental esults ae included. Kewods augmented ealit, make, visual tacking, local featues, optical flow, diect tacking.. NTRODUCTON Augmented ealit (AR) is etemel popula and fast gowing field [1]-[4]. ts basic idea is based on coeistence of eal and snthetic (compute geneated) objects. Pactical usefulness is that AR allows to make some things moe intuitive fo the use [5]. As a esult, AR is applied in such fields as medicine [6], education [5], entetainment and man moe. Nowadas humanit uses mobile phones and tablets dail. This enabled a stong inteest fo AR applications diectl ondevice [3]-[4]. As a esult, new softwae and hadwae fo AR applications ae constantl developed [3]-[4]. t is known, that efficienc of AR is stictl elated with obustness and pecision of applied compute vision algoithms [1]-[]. Fom algoithmic point of view, AR equied pecise camea pose estimation (6-degees of feedom). Availabilit of this infomation allows to augment an additional content such as 3d model, image and video, tet data, audio, etc. Camea pose estimation is accomplished b means compute vision algoithms applied fo object ecognition. A tpical AR object is called make. t can be consideed as a pedefined image with known popeties. And it localization within the camea fame gives the full infomation about camea position and oientation. Thee ae two geneal tpes of AR sstems: make-based and makeless [1]-[]. The fome elies on a bina makes on the scene which can be easil tacked [7]. Howeve, makeless sstems ae moe popula since some natual image featues can be used fo detection and tacking [1]-[4]. Tpicall, plana objects ae used fo this pupose. Since eal scenes can be challenging due to occlusions, vaing geomet and illumination changes, thee is an active eseach in AR tacking algoithms. n addition, the goal is ealtime pefomance, which could be difficult, especiall fo mobile devices. An initial step befoe image tacking is detection. Output of this pocedue is make location within the camea fame (make cones). Fo bina makes the contous analsis is tpicall applied [1]-[], [7]. While fo image makes a common option is to use kepoint desciptos [8]-[9]. A good method fo eal-time pefomance is oiented and otated bief (ORB) [1]. As fo image tacking, a common wa is to use the optical flow (OF) algoithm [11]-[1]. The appoach estimates the dift of inteesting piels (kepoints) in adjacent camea fames. n ode to make the image tacking moe obust, socalled diect tacking methods ae often applied [13]-[15]. A ke idea hee is based on iteative estimation of the tansfomation between the template and test images using the whole image patch. An efficient second-ode minimization (ESM) algoithm demonstated good pefomance and faste convegence with espect to the Gauss-Newton scheme [13]. Howeve, the dawback of ESM is equiement fo a high amount of iteations in the case of fast camea motion. n the pape, we descibe a hbid algoithm based on OF and ESM methods. t is demonstated that such combining allows to make the tacking moe obust and fast. As a esult, camea pose is estimated pecisel giving a ealistic effect of augmented content. n Section, main pinciples of AR sstem ae descibed. n paticula, geomet peculiaities, make tpes and pinciples of data augmentation. Section descibes the developed hbid algoithm. Epeimental esults ae discussed.. AUGMENTED REALTY PRNCPLES A. Pojective Geomet A ke of eve AR application is estimation of the camea pose. Let s conside mathematical backgound of a poblem. t is known that 3D and D wolds can be elated using the pojection equation [16] s mi P M i, (1) 978-1-59-6754-1/17/$31. 17 EEE 66
whee s is a scale facto, m i, Mi ae wold and pojected point coodinates espectivel, P is pojection mati P K [ R t] f f c c 1 11 1 31 1 3 13 3 33 t1 t. () t3 Hee K is intinsic camea mati, [ R t] is an etinsic mati descibing the oientation and tanslation of the camea. The fome does not depend on the scene and detemines the camea paametes, namel, ( c, c ) is a pincipal point, f, f ae focal lengths epessed in piel units. The latte epesents an Euclidean tansfomation fom a wold coodinate sstem to the camea coodinate sstem [16]. Fig. 1 illustates the pinciple of point pojection onto the image plane h11 h1 h13 M (,, z ) H M (,, z), H h1 h h3. (3) h31 h3 h33 Tpicall, h 33 1 and, hence, homogaph is estimated up to a scale facto. As a esult, 8 unknown mati elements should be found fom a set of point coespondences (at least 4 points). n ode to pefom consistent estimation, high amount of piel pais can be analzed simultaneousl using andom sample consensus method (RANSAC) [1], [1]. Let s conside how diffeent makes can be localized in camea fame. Fig. illustates how tpical bina (fiducial) make can be ecognized. Fig. 1. Pojective geomet. One can see the plana object. An abita point M i ( X, Y, Z) is pojected on the focal plane esulting in pojection m i ( u, v).thus, in the case of pecise estimation of matices K and [ R t] Additional content can be augmented on camea fame in ealistic wa. One can show that it is enough to know eact D-3D coespondences fo fou piels fo a full camea pose econstuction [1]-[], [16]. B. Make Detection and Data Augmentation Let s conside how AR algoithm can be built. Fistl, one should emphasize that camea mati K is estimation once duing camea calibation pocedue [1], [1], [16]. The situation is moe challenging fo etinsic camea paametes estimation. t is known that elation between two abita pojections is descibed via the homogaph mati [1], [16] z p p h11 h1 h 13, h31 h3 h33 p z p h1 h h 3, h31 h3 h33 Fig.. Fiducial make ecognition. Fistl, contous ae etacted fom the input camea fame afte thesholding [1]. Secondl, quadilateal shaped contous ae found. Finall, bina code matching is accomplished fo make candidates finalizing the make ecognition pocedue. Appaentl, that putting of bina makes into the scene is not pactical. Theefoe moden AR sstems use much moe convenient image makes. n this case make detection in camea fame is pefomed using local featue desciptos. We used ORB descipto in ou AR sstem. Fig. 3 illustates a pinciple of plana image make detection. Fig. 3. mage make detection eample. A pedefined amount of kepoints is detected [1], [1]. Secondl, ORB desciptos ae calculated fo each kepoint. Finall, matching using Hamming distance metics is accomplished [1]. One should emphasize that desciptos matching pocedue is ve impotant fo the homogaph estimation [1]. Detemined point-to-point coespondences ae used fo updating the camea pose. Fig. 4 illustates an eample of data augmentation using estimated camea pose. 67
waped SSD ( F (, ) R (, )). (5), Hee R is a efeence image (make), F is the waped camea fame accoding to the homogaph of the pevious fame Fig. 4. Eample of augmentation (left bina make with 3d model, ight image make with augmented image). One can see an eample of bina make and augmented 3d model of buttefl (left). An eample of image augmentation on the image make is illustated on the ight side. n geneal, make detection algoithms can be applied consequentl fo camea fames afte initialization. Howeve, possibilities of local featue desciptos ae limited due to vaing scale and viewing angles [1], [1]. n addition, faste and moe efficient tacking techniques can be used. Net subsection descibes the developed tacking algoithm which is applied in fame-b-fame basis.. EXPERMENTAL ANALYSS This section contains a desciption of ke steps of tacking algoithm and illustates the epeimental esults. A. Hbid Tacking Algoithm Desciption n ode to keep augmentation stable and ealistic, local image featues should be popel tacked and kept within camea fames. We popose to combine the OF algoithm with ESM pocedue. The OF pefoms the estimation of the flow vecto fo a given inteesting piel using the local spatial window aound the analzed piel whee u v t, (4) t, ae the hoizontal and vetical image gadients, t is time gadient, i.e. diffeence between the fames, ( u, v) ae OF vecto components. Summation in (4) is accomplished in a local window aound the piel of inteest. The challenge is that OF often fails when amount of kepoints is not enough o camea movement is too fast. n ode to make image tacking moe obust, we popose to efine the estimation using ESM algoithm. The ke diffeence between local featue based and diect tacking is that fo ESM the whole image patch tansfomation is estimated. Tansfomation paametes ae evaluated based on minimization of a paticula metic. Fo instance, sum of squaed diffeences [13] waped 1 F F H F 1. (6) Thus, the pespective of pevious camea fame can be used as initial estimation. Let s conside the Talo epansion waped R F JΔH 1 Hes ΔH, (7) whee J is Jacobian, Hes is Hessian, Δ H is an unknown waping mati (homogaph multiplie). The idea of ESM is based on appoimation of the Hessian component [13] allowing to povide the faste algoithm convegence with espect to conventional Gauss-Newton scheme waped ΔH ( J J) ( F R ), (8a) J J Hes, (8b) whee ( J J) is a pseudoinvese etended Jacobian mati, J is Jacobian of identit waping [13]-[14]. As a esult, such scheme allows to iteativel update the waping paamete Δ H. We have integated the poposed two-step tacking algoithm into mobile AR sstem. Fig. 5 illustates a high-level scheme of the developed method. The initialization (make detection) is accomplished using ORB descipto. Fo make tacking, OF algoithm is applied fo consequent camea fames. Estimated piel coespondences ae used fo calculation of the pojection tansfomation. Secondl, waped camea fame is used as an input fo iteative ESM algoithm poviding additional efinement. Finall, camea pose is econstucted [1]-[]. The net subsection contains epeimental analsis of the developed AR tacking algoithm. B. Epeimental Results Let s analze the pefomance of the method quantitativel. Fo this pupose, we have ceated a benchmak image sequence with known gound tuth data (pecise camea pose infomation). The fist impotant thing is the analsis of tacking capabilit and ESM convegence speed. Fig. 6 illustates the amount of equied iteations fo each fame. One can obseve that OF allows to substantiall educe the amount of ESM iteations. Also, fo the ESM onl the tacking algoithm fails in the middle of image sequence. n contast, consequent application of OF and ESM allows to tack the image make and keep the modeate amount of ESM iteations fo full convegence. 68
Fig. 5. Block-scheme of developed AR sstem. Fig. 6. ESM iteations. Anothe pactical eample is camea pose econstuction. Fig. 7 illustates tue camea pose and two econstucted tajectoies. Stat and destination points ae maked b aows (Fig. 7). One can obseve that combined make tacking algoithm povides almost total coincidence with the gound tuth. The maimum deflection in this case was not highe than seveal millimetes (geen cicles). Fo the case of OF usage fo the image tacking, the accuac is wose. The camea pose deflection was accumulated though the fames sequence esulting in aound total 1 centimetes eo (ed cicles). The test image make was located at the oigin (,, ). Developed tacking algoithm was integated into softwae development kit (SDK) intended fo mobile devices (Andoid and ios). Cuentl we ae woking on algoithm impovements and optimization. V. CONCLUSON n the pape, a obust tacking algoithm fo AR applications was poposed. Compaative analsis of OFLK and ESM appoaches indicated on a potential of combining of these appoaches. As a esult, hbid tacking method was developed. We have demonstated that consequent application two poposed methods gives a possibilit to achieve good esults in tems of accuac and speed. This is cucial fo mobile devices and tablets with limized hadwae capabilities. n the nea futue we ae planning to full adopt the developed algoithms fo mobile devices. Fig. 7 Estimated camea pose. REFERENCES [1] D. Baggio, S. Emami, D. Esciva, Masteing OpenCV with pactical Compute Vision Pojects. Packt Publishing, 1. [] S. Siltanen,Theo and applications of make based augmented ealit. VTT Science 3, Espoo 1. [3] https://www.wikitude.com/poducts/wikitude-sdk/ [4] https://www.vufoia.com/ [5] P.Chen,X. Liu, W. Cheng, R. Huang. A eview of using Augmented Realit in Education fom 11 to 16. nnovations in Smat Leaning.Pat of the seies Lectue Notes in Educational Technolog, 16, pp. 13-18. [6] http://medicalfutuist.com/ [7] S. Gaido-Juado, R. Muñoz-Salinas, F.J. Madid-Cuevas, M.J. Maín- Jiménez. Automatic geneation and detection of highl eliable fiducial makes unde occlusion. Patten Recognition. Volume 47, ssue 6, June 14, Pages 8 9. [8] H. Ba, A. Ess, T. Tutelaas and Luc Van Gool. SURF: Speeded Up Robust Featues. Compute Vision and mage Undestanding (CVU), Vol. 11, No. 3, 8, pp. 346 359. [9] D. Lowe. Distinctive image featues fom scale-invaiant kepoints. ntenational jounal of compute vision, Vol. 6, No., 4, pp. 91-11. [1] E. Rublee, V. Rabaud, K. Konolige and G. R. Badski. ORB: An efficient altenative to SFT o SURF, CCV, 11, pp. 564-571. [11] J. Shi and C. Tomasi. Good featues to tack. n: EEE Conf. on Compute Vision and Patten Recognition, 1994, pp. 593-6. 69
[1] G. Badsk and A. Kaehle, Leaning OpenCV, O Reall Media nc., 8. [13] S. Benhimane and E. Malis. Real-time image-based tacking of planes using efficient second-ode minimization. EEE/RSJ ntenational Confeence on ntelligent Robots and Sstems, Vol. 1, 4, pp. 943-948. [14] S. Benhimane and E. Malis. Homogaph-based d visual tacking and sevoing. nt. J. Robot. Res. No. 6, 7, pp. 661 676. [15] S. Liebeknecht, S. Benhimane, P. Meie, N. Navab, Benchmaking template-based tacking algoithms. Vitual Realit 15(-3):99-18. June 11. [16] R. Hatle and A. Zisseman. Multiple View Geomet in Compute Vision. Cambidge Univesit. Pess, second edition, 4. 7