Yes, because Cover Trees are sometimes too slow. In fact, I asked myself this question, not for the build time, but for the search time if the data has a structure. Imagine, what would happen if your data was more a less a regular grid? When I tried that, starting with a point at (0,0), then (1,0)… the first node (0,0) had references to all the last points (9,9), (9,8)… And I figured, it would be slower than a tree search. So I decided to give kd-trees a shot for this kind of search on a regular grid.
Implementation
The implementation is not optimized, the only noticeable thing is that I don’t create subnodes for all points, I only split nodes when they have more than n points, and then according to the best dimension (best is defined by the biggest distance on one axis in the node, the root distance being set by hand). The search is of course really close to the cover tree implementation.
Also, when doing the search, I can override the original distance function with a new one. This is because when doing the search, some dimensions may be of greater interest than others. At least, this is the case I have. When creating the tree, I don’t have access to the actual distance, and it could even change during the process.
Results
OK, some comparisons now. I’ve changed the number of nodes to 1 million. I’ve relaunched the cover tree test, and this is the result:
Build time 12:20:49.484375 Out time (linear) 00:00:04.500000 Out time (cover_tree) 00:00:00.062500
The first figure is true, it is more than 12 hours. You have to search a lot to be profitable to use a cover tree.
The kd-tree:
Build time 00:00:01.890625 Out time (linear) 00:00:02.921875 Out time (kd-tree) 00:00:00.875000
Strangely the timing for the linear search is a bit different, but I guess this is due to the long time the program had to run. The timing is not as good, but it is only one test. I would need to test far more points, but due to the high cover tree build time, I didn’t try yet.
Now, if we use another distance (infinite instead of L2):
Out time (linear) 00:00:04.718750 Out time (kd-tree) 00:00:00.890625
The linear time is a bit higher, like the one from the cover tree. What is interesting is that the kdtree time is comparable to the one from the L2 search time. And this is what I wanted to check.
Conclusion
Kd-trees are not something earth-shaking, but sometimes it is still very relevant. Of course, I didn’t optimize any of the cover tree or kd tree implementation, so there is place for improvement. What I really like for the kdtree is that if you change the distance function, it can still work, whereas a cover tree, being implemented on a specific distance, cannot easily.
Kd-tree code on Github is here.
Nice. Fwiw, I find that the metric used in the search makes quite a difference in runtime; see the numbers and plot in
http://cstheory.stackexchange.com/questions/5607/does-kd-tree-search-look-at-more-leaves-in-l1-than-in-lmax
I still don’t understand this.
(What are good combinations of prescale and search metric, e.g. prescale all x /= |x| ?
Dunno that either.)
Also a cutoff, quit after searching __ leaves, gets a good speedup,
although it loses any guarantees of nearest; FLANN does that in 128d.
Hi Denis,
I’d say that the prescale question is all a matter of data distribution. If your data is not isotropic, it may benefit from a prescaling if the knn build does not favor any axis (meaning cutting through the larger remaining one, as I do here).
As for the distance, the L1 approximation is bigger than the Lmax, so when you are searching through the nodes for the nodes to traverse, L1 will give you a wider choice than L2 or Lmax. I guess that is the correct reason.
And yes, the cutoff can be interesting. In my application, I need the correct neighbors (well in fact, I won’t use it for nearest search, but for Parzen window search, but the acceleration structure is the same!).
Thanks for your comment!