154 25 Jonathan Richad Shewchuk Faste Neaest Neighbos: Voonoi Diagams and k-d Tees SPEEDING UP NEAREST NEIGHBOR CLASSIFIERS Can we pepocess taining pts to obtain sublinea quey time? 2 5 dimensions: Voonoi diagams Medium dim (up to 30): k-d tees Lage dim: locality sensitive hashing [still eseachy, not widely adopted] Lagest dim: exhaustive k-nn, but can use PCA o andom pojection [o anothe dimensionality eduction method] Voonoi Diagams Let P be a point set. The Voonoi cell of w 2 P is Vo w = {p 2 Rd : pw pv 8v 2 P} [A Voonoi cell is always a convex polyhedon o polytope.] The Voonoi diagam of P is the set of P s Voonoi cells. voo.pdf, vomcdonalds.jpg, voonoigegoeichinge.jpg, saltflat-1.jpg [Voonoi diagams sometimes aise in natue (salt flats, gia e, cystallogaphy).]
Faste Neaest Neighbos: Voonoi Diagams and k-d Tees 155 gia e-1.jpg, peovskite.jpg, votex.pdf [Believe it o not, the fist published Voonoi diagam dates back to 1644, in the book Pincipia Philosophiae by the famous mathematician and philosophe René Descates. He claimed that the sola system consists of votices. In each egion, matte is evolving aound one of the fixed stas (votex.pdf). His physics was wong, but his idea of dividing space into polyhedal egions has suvived.] Size (e.g. # of vetices) 2 O(n dd/2e ) [This uppe bound is tight when d is a small constant. As d gows, the tightest asymptotic uppe bound is somewhat smalle than this, but the complexity still gows exponentially with d.]... but often in pactice it is O(n). [Hee I m leaving out a constant that may gow exponentially with d.] Point location: Given quey point q, find the point w 2 P fo which q 2 Vo w. 2D: O(n log n) time to compute V.d. and a tapezoidal map fo pt location O(log n) quey time [because of the tapezoidal map] [That s a petty geat unning time compaed to the linea quey time of exhaustive seach.] dd: Use binay space patition tee (BSP tee) fo pt location [Unfotunately, it s di cult to chaacteize the unning time of this stategy, although it is likely to be easonably fast in 3 5 dimensions.] 1-NN only! [A standad Voonoi diagam suppots only 1-neaest neighbo queies. If you want the k neaest neighbos, thee is something called an ode-k Voonoi diagam that has a cell fo each possible k neaest neighbos. But nobody uses those, fo two easons. Fist, the size of an ode-k Voonoi diagam is O(k 2 n) in 2D, and wose in highe dimensions. Second, thee s no softwae available to compute one.] [Thee ae also Voonoi diagams fo othe distance metics, like the L 1 and L 1 noms.] [Voonoi diagams ae good fo 1-neaest neighbo in 2 o 3 dimensions, maybe 4 o 5, but fo anything beyond that, k-d tees ae much simple and pobably faste.]
156 Jonathan Richad Shewchuk k-d Tees Decision tees fo NN seach. Di eences: [compaed to decision tees] Choose splitting featue w/geatest width: featue i in max i, j,k (X ji X ki ). [With neaest neighbo seach, we don t cae about the entopy. Instead, what we want is that if we daw a sphee aound the quey point, it won t intesect vey many boxes of the decision tee. So it helps if the boxes ae nealy cubical, athe than long and thin.] Cheap altenative: otate though the featues. [We split on the fist featue at depth 1, the second featue at depth 2, and so on. This builds the tee faste, by a facto of O(d).] Choose splitting value: median point fo featue i, o X ji+x ki 2. Median guaantees blog 2 nc tee depth; O(nd log n) tee-building time. [... o just O(n log n) time if you otate though the featues. By contast, splitting nea the cente does moe to impove the aspect atios of the boxes, but it could unbalance you tee. You can altenate between medians at odd depths and centes at even depths, which also guaantees an O(log n) depth.] Each intenal node stoes a sample point. [... that lies in the node s box. Usually the splitting point.] [Some k-d tee implementation have points only at the leaves, but it s usually bette to have points in intenal nodes too, so when we seach the tee, we might stop seaching ealie.] 1 5 7 9 6 2 10 3 4 6 11 8 2 1 10 5 4 7 8 3 9 [Daw this by hand. kdteestuctue.pdf ] oot epesents R 2 ight halfplane lowe ight quate plane 11 Goal: given quey pt q, find a sample pt w such that qw apple(1 + ) qs, whee s is the closest sample pt. = 0 ) exact NN; >0 ) appoximate NN. The alg. maintains: Neaest neighbo found so fa (o k neaest). goes down # Binay heap of unexploed subtees, keyed by distance fom q. goes up " q neaest so fa [Daw this by hand. kdteequey.pdf ] [A quey in pogess.] [Each subtee epesents an axis-aligned box. The quey ties to avoid seaching most of the subtees by seaching the boxes close to q fist. We measue the distance fom q to a box and use it as a key fo the subtee in the heap. The seach stops when the distance to the kth-neaest neighbo found so fa apple the distance to the neaest unexploed box (divided by 1 + ). Fo example, in the figue above, the quey neve visits the box at fa uppe left o the box at fa lowe ight, because those boxes don t intesect the cicle.]
Faste Neaest Neighbos: Voonoi Diagams and k-d Tees 157 Q heap containing oot node with key zeo 1 while Q not empty and minkey(q) < 1+ B emovemin(q) w B s sample point min{, qw } [Optimization: stoe squae of instead.] B 0, B 00 child boxes of B if dist(q, B 0 ) < 1+ then inset(q, B0, dist(q, B 0 )) [The key fo B 0 is dist(q, B 0 )] if dist(q, B 00 ) < 1+ then inset(q, B00, dist(q, B 00 )) etun point that detemined Fo k-nn, eplace with a max-heap holding the k neaest neighbos [... just like in the exhaustive seach algoithm I discussed last lectue.] Woks with any L p nom fo p 2 [1, 1]. [k-d tees ae not limited to the Euclidean (L 2 ) nom.] Why -appoximate NN? q [Daw this by hand. kdteepoblem.pdf ] [A wost-case exact NN quey.] [In the wost case, we may have to visit evey node in the k-d tee to find the neaest neighbo. In that case, the k-d tee is slowe than simple exhaustive seach. This is an example whee an appoximate neaest neighbo seach can be much faste. In pactice, settling fo an appoximate neaest neighbo sometimes impoves the speed by a facto of 10 o even 100, because you don t need to look at most of the tee to do a quey. This is especially tue in high dimensions emembe that in high-dimensional space, the neaest point often isn t much close than a lot of othe points.] Softwae: ANN (David Mount & Sunil Aya, U. Mayland) FLANN (Maius Muja & David Lowe, U. Bitish Columbia) GeRaF (Geogios Samaas, U. Athens) [andom foests!]
158 Jonathan Richad Shewchuk Example: im2gps [I want to emphasize the fact that exhaustive neaest neighbo seach eally is one of the fist classifies you should ty in pactice, even if it seems too simple. So hee s an example of a moden eseach pape that uses 1-NN and 120-NN seach to solve a poblem.] Pape by James Hays and [ou own] Pof. Alexei Efos. [Goal: given a quey photogaph, detemine whee on the planet the photo was taken. Called geolocalization. They evaluated both 1-NN and 120-NN with a complex set of featues. What they did not do, howeve, is teat each photogaph as one long vecto. That s okay fo tiny digits, but too expensive fo millions of tavel photogaphs. Instead, they educed each photo to a small descipto made up of a vaiety of featues that extact the essence of each photo.] [Show slides (im2gps.pdf). Soy, images not included hee. http://gaphics.cs.cmu.edu/pojects/im2gps/] [Featues, in ough ode fom most e ective to least: 1. GIST: A compact image descipto based on oiented edge detection (Gabo filtes) + histogams. 2. Textons: A histogam of textues, ceated afte assembling a dictionay of common textues. 3. A shunk 16 16 image. 4. A colo histogam. 5. Anothe histogam of edges, this one based on the Canny edge detecto, invented by ou own Pof. John Canny. 6. A geometic descipto that s paticulaly good fo identifying gound, sky, and vetical lines.] [Bottom line: With 120-NN, thei most sophisticated implementation came within 64 km of the coect location about 50% of the time.] RELATED CLASSES [If you like machine leaning and you ll still be hee next yea, hee ae some couses you might want to take.] CS C281A (sping): Statistical Leaning Theoy [C281A is the most diect continuation of CS 189/289A.] EE 127 (sping), EE 227BT (fall): Numeical Optimization [a coe pat of ML] [It s had to oveemphasize the impotance of numeical optimization to machine leaning, as well as othe CS fields like gaphics, theoy, and scientific computing.] EE 126 (both): Random Pocesses [Makov chains, expectation maximization, PageRank] EE C106A/B (fall/sping): Into to Robotics [dynamics, contol, sensing] Math 110: Linea Algeba [but the eal gold is in Math 221] Math 221: Matix Computations [how to solve linea systems, compute SVDs, eigenvectos, etc.] CS 194-26 (fall): Computational Photogaphy (Efos) CS 294-43 (fall): Visual Object and Activity Recognition (Efos/Daell) CS 294-112 (fall): Deep Reinfocement Leaning (Levine) CS 298-115 (fall): Algoithmic Human-Robot Inteaction (Dagan) CS 298-131 (fall): Special Topics in Deep Leaning (Song/Daell) VS 265 (?): Neual Computation CS C280 (?): Compute Vision CS C267 (?): Scientific Computing [paallelization, pactical matix algeba, some gaph patitioning]