A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality (2024)

  • Journal List
  • HHS Author Manuscripts
  • PMC3255306

As a library, NLM provides access to scientific literature. Inclusion in an NLM database does not imply endorsem*nt of, or agreement with, the contents by NLM or the National Institutes of Health.
Learn more: PMC Disclaimer | PMC Copyright Notice

A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality (1)

Link to Publisher's site

Proc Int Jt Conf Neural Netw. Author manuscript; available in PMC 2012 Feb 8.

Published in final edited form as:

PMCID: PMC3255306

NIHMSID: NIHMS330625

PMID: 22247818

Xueyi Wang

Author information Copyright and License information PMC Disclaimer

The publisher's final edited version of this article is available at Proc Int Jt Conf Neural Netw

Abstract

The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 106 records and 104 dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.

I. Introduction

The k-nearest neighbor (k-NN) algorithm is widely used in many areas such as pattern recognition, machine learning, and data mining. The traditional k-NN algorithm is called a lazy learner, as the buildup stage is cheap but the searching stage is expensive — the distances from a query object to all the training objects need to be calculated in order to find nearest neighbors for the query object [24]. One advantage of the traditional k-NN algorithm is that the running time is sublinear to k. For example, given a training set with n objects and m dimensions, if we maintain a max-heap structure for a set of k current nearest training objects, then the time complexity of the traditional k-NN algorithm is O(mnlgk) for a single query.

In 2D or 3D space, graph-based searching methods such as Voronoi diagram [17] and proximity graph [21] are efficient in searching for nearest neighbors, but it is very hard to extend these methods to higher dimensions. A number of spatial searching methods such as ball-trees [18], kd-trees [4], metric-trees [5] [22], quadtree [16], and R-trees [11] have been proposed to efficiently reduce the distance calculations and find exact nearest neighbors in higher dimensions. These methods iteratively divide training objects and build tree structures using criteria such as absolute coordinates and relative distances, so that a query object needs to check distances with only a limited number of training objects instead of the whole dataset. One problem for these methods is that when the dimensionality of a dataset is high, most of the training objects in the data structures will end up being evaluated and the searching efficiency is no better to or even worse than the traditional k-NN algorithm [10] [18], especially for large k values.

Due to the difficulty of accelerating the k-NN algorithm in high dimensional space, some methods have focused on finding approximate answers. For example, the hashing method from [9] and the priority queue based method from [3] achieved a speedup of several fold over the traditional k-NN by outputting k neighbors within (1+ε) of the true nearest neighbor distances. Hart [13] and Wilson [23] used techniques called condensing and editing to reduce objects from the dataset and accelerate the searching for nearest neighbors.

In this paper, we present a new algorithm called kMkNN (k-Means for k-Nearest Neighbors) to efficiently search exact k nearest training objects for a query object. The kMkNN algorithm incorporates two simple methods, the k-means clustering and the triangle inequality, into the nearest neighbors searching and achieves good performance compared to other algorithms. The basic idea is that we first classify training objects into different clusters regardless of the classes of training objects, and then for a given query object q, we use the triangle inequality to avoid distance calculations for some training objects in the clusters that are far from q.

The kMkNN algorithm has two stages. In the buildup stage, we separate the dataset into clusters and record the distance from each training object to its closest cluster center. In the searching stage, we first calculate the distances from a query object q to all cluster centers. Then we visit each training object starting from the nearest cluster to q and maintain a set of k current nearest objects and the largest distance dmax for the k nearest objects. For a training object p with its closest cluster center c, since we have the triangle inequality for the distances of p, q, and c such that ∥qc∥ < ∥pc∥ + ∥pq∥, if dmax < abs(∥qc∥ – ∥pc∥), then dmax < ∥pq∥. Then we do not need to explicitly calculate the distance ∥pq∥.

Figure 1 shows a scenario of the kMkNN algorithm with a two-classes dataset. One class is shown as triangles, the other class is shown as pluses, and the query object is shown as a black circle. The dataset is separated into 6 clusters and most of the query objects’ nearest neighbors are in the bottom-right cluster. In the searching stage, kMkNN first calculates the distances from the query object to all cluster centers. Then it searches k nearest training objects starting from the bottom-right cluster and uses the triangle inequality to avoid distance calculations for some training objects in some clusters such as the top-left one.

Open in a separate window

A scenario of the kMkNN algorithm. The plus and triangle signs denote two classes of objects and the black dot denotes the query object. The kMkNN algorithm first classifies objects into clusters and then finds the k-nearest neighbors for a query object starting from the nearest cluster.

If the training objects can be well separated into clusters, then the kMkNN algorithm should be able to significantly reduce the distance calculations and accelerate searching for nearest training objects by using the triangle inequality. If the training objects are uniformly distributed or condensed together and there is no good way to build well separated clusters (for example, a uniformly distributed random dataset), then the triangle inequality may not apply and kMkNN may end up calculating the distances to all the training objects. It should be noted, though, that since the number of clusters is much smaller than the total number of training objects, the cost of extra distance calculations from the query object to the cluster centers is small compared to the total number of distances calculations. So the worst-case performance of kMkNN will be comparable to the traditional k-NN.

We tested the kMkNN algorithm on a collection of 20 datasets. The datasets, collected from the UCI Machine Learning Repository [8] and the National Cancer Institute, have up to 106 objects and 104 dimensions. We conducted three experiments on the datasets and all of them used a small k value 9 and a large k value 101. In the first experiment, we compared kMkNN to the traditional k-NN algorithm on the reduction of distance calculations and the speedup of running time. The results show a 2- to 80-fold reduction in distance calculations and a 2- to 60-fold speedup in running time for 16 datasets. The performance of 2 worst datasets is comparable to the traditional k-NN algorithm (less than 5% of the slowdown of the running time). In the second experiment, we compare kMkNN to the kd-tree based algorithm on the speedup over the traditional k-NN for all datasets. The results show that kMkNN performs much better. In the last experiment, we compare to the ball-tree based algorithm [18] on the overall running time (the breakdown of the buildup and searching time is not available for the ball-tree based program we obtained [18]) and kMkNN shows better performance for most of datasets. All three experiments show that the kMkNN algorithm can effectively reduce the distance calculations and accelerate the searching for nearest neighbors in high dimensional spaces. It can be considered as an alternative for existing tree-based exact k-NN algorithms.

II. METHODS

Assume there are n objects pi for (1 ≤ in) in a training dataset and each object pi has m dimensions. We present a new algorithm kMkNN (k-Means for k-Nearest Neighbors) that searches for exact k nearest neighbors to a query object q, shown as follows:

Algorithm 1

The kMkNN algorithm

The BUILDUP stage

Input: n training objects pi for (1 ≤ in); each with m dimensions.

Output: kc cluster centers cj for (1 ≤ jkc) with the assignment of each object pi to its nearest cluster.

  1. Set kc=sn , where s > 0.

  2. Classify the n objects into kc clusters using a k-means clustering algorithm.

  3. Calculate and record the distance dij from each object pi to its nearest cluster center cj.

  4. For each cluster center cj, sort the distances dij in descending order for all the training objects associated to cj.

The SEARCHING stage

Input: kc cluster centers cj for (1 ≤ jkc) with the assignments of each object pi to its nearest center cj; a query object q; the number of nearest neighbors k.

Output: A set of k nearest training objects to q.

  1. Initialize the set of k nearest objects of q by using a maximum distance as the k distances. Set the maximum distance in the set of k current nearest objects to dmax.

  2. Calculate the distances ∥q – cj∥ for all cluster centers cj that (1 ≤ jkc) and sort the distances in ascending order.

  3. For each cluster cj from the nearest to the farthest to q, do the step 4.

  4. For each object pi in the cluster cj from the farthest to the nearest to the cluster center cj:

if dmax ≤ ∥q – cj∥ − ∥picj∥, go to step 3 to visit the next cluster;

otherwise, calculate the distance ∥qpi∥ from q to pi and if dmax > ∥qpi∥, remove the object with distance dmax from the set of k current nearest objects and insert pi with the distance ∥qpi∥; update dmax.

The kMkNN algorithm runs in two stages. In the buildup stage, we separate the n training objects into kc clusters using a k-means clustering algorithm. After the clustering, we record the distance from each object to its nearest cluster center and sort all the distances for each cluster in a descending order. In the searching stage, we first initiate the distances of k nearest neighbors to a maximum value and calculate and sort the distances ∥qcj∥ from q to each cluster center cj for (1 ≤ jkc). Then starting from the nearest cluster to q, we use the triangle inequality to check each object pi in a cluster cj. If dmax ≤ ∥qcj∥ − ∥picj∥, where dmax is the maximum distance in the set of k current nearest objects to q, then we skip the whole cluster cj. Otherwise, we calculate ∥qpi∥ and if ∥qpi∥ < dmax, we remove the object with the distance dmax from the set of k current nearest objects, insert pi, and update dmax. After we check all the clusters cj for (1 ≤ jkc), the set of k current nearest objects contains the k nearest neighbors to q among all training objects pi for (1 ≤ in).

In the searching stage, given a query object q, we have two assumptions on how to visit the clusters and the objects in the clusters: (a) it is most likely that we can find most of the k nearest training objects to q in only a few nearest clusters to q; (b) for each object in a cluster other than the cluster closest to q, it is more likely that the distance from the object to the cluster center is shorter than the distance from q to the cluster center.

Assumption (a) is obvious, so in the algorithm, we visit each cluster from the nearest to the farthest to q after we calculate the distance ∥qcj∥ from q to each cluster center cj for (1 ≤ jkc) and sort the distances.

Assumption (b) does not hold for the cluster nearest to q (for example, the lower right cluster in Figure 1), but it does hold for other clusters (for example, the clusters other than the lower right one in Figure 1). The objects in those clusters will have shorter distances to their corresponding centers than the distances from q to those cluster centers, that is, q can be considered as an outlier for the majority of clusters.

Based on assumption (b), we can use the triangle inequality to reduce unnecessary distance calculations from some training objects pi to q. For the cluster nearest to q, we mostly likely will check all objects and obtain a set of k current nearest objects with a small dmax. For other clusters, say, cj, most likely we have ∥qcj∥ > ∥picj∥, that is, q is farther from cj than a training object pi associated with cj. Also based on the triangle inequality, we have ∥qcj∥ − ∥picj∥ < ∥qpi∥. If dmax ≤ ∥qcj∥ − ∥picj∥, then we have dmax ≤ ∥piq∥ and we do not need to calculate the distance ∥piq∥. Note that we do not use dmax ≤ abs(∥qcj∥ − ∥picj∥), since it is rare to have ∥picj∥ > ∥qcj∥, in which case q is closer to cj than pi for a non-nearest cluster to q, so we check dmax < ∥qcj∥ − ∥picj∥ only in the implementation.

To further reduce distance calculations, for each cluster, we check from the farthest training object associated with the cluster center to the nearest one. The reason is that, first of all, for any cluster cj other than the nearest one, generally ∥qcj∥ > ∥picj∥ for any object pi associated with cj. Given two objects pi1 and pi2 associated with cj, if ∥pi1cj∥ > ∥pi2cj∥ (i.e. pi1 is farther from cj than pi2), we have ∥qcj∥ − ∥pi1cj∥ < ∥qcj∥ − ∥pi2cj∥ and pi1 is more likely to be a candidate of nearest neighbors to q than pi2, so in the buildup stage we sort the distances from all training objects to their cluster centers in a descending order and in the searching stage we check the objects in that order.

By checking training objects from the farthest one to the nearest one for a cluster, we can even avoid checking the triangle inequality for some objects. For example, for any cluster cj other than the one nearest to q, if we have dmax < ∥qcj∥ − ∥picj∥ for a training object pi, then we can skip the remaining training objects in cj and move to the next cluster. Since we have ∥picj∥ > ∥pixcj∥ for any of the remaining objects pix in the cluster cj, ∥qcj∥ − ∥picj∥ < ∥qcj∥ − ∥pixcj∥.

To efficiently obtain the largest distance dmax from the set of k current nearest objects to q, we can build a max-heap for the k distances that the root object has the largest distance (i.e. dmax) to q. Every time when we find a new training object with a smaller distance to dmax, we can efficiently remove the root object and insert new object in O(lgk) time. During the initialization, we set all the k values to a same maximum value, so the initial heap is built automatically.

One more issue is to choose the number of clusters kc. If the kc is not chosen carefully, then the k-means clustering may yield poor results. For kMkNN, the purpose of the clustering process is to separate the dataset into groups, so we may reduce the number of distance calculations in the searching stage. If we use a small number of kc, then each cluster will have many training objects and if the clusters are not separated very well, then the query object may be “close” to most of clusters, so kMkNN may result in a large number of distance calculations and the running time will increase. On the other hand, if we use a large number of kc, although each cluster will have a few training objects and we may have fewer distance calculations, the running time may also increase since a large number of kc have high extra costs.

Since it is very likely that we need to calculate distances from q to all objects in the cluster nearest to q and it is unlikely that we need to calculate distances from q to objects in other clusters, if we generate kc=O(n) clusters with O(n) objects in each cluster and if the number of nearest neighbors k<n , then in the searching stage, kMkNN calculates distances to the training objects in only a few of the nearest clusters and finds k nearest neighbors. Calculating the distances from a query object q to all cluster centers takes O(mn) , sorting of all the clusters from the nearest to the farthest to q takes O(n1gn) , checking triangle inequality for all training objects (in the worst case) takes O(n), and calculating distances from q to all training objects in a few nearest clusters takes O(mn) (in the worst case; note kMkNN may calculate the distances from q to most but not all training objects in those nearest clusters – – the actual number of calculations is determined from the triangle inequality), so if m << n, then we may achieve a near optimal performance by choosing kc=n . In practice the dataset is never ideal and many factors will influence performance.

The ideal complexity analysis above does not consider the constant factor associated with each time complexity term, so in the implementation we set kc=O(n)=sn by introducing a constant s > 0. Our experiments in Section 3 show that kMkNN achieves good running time for most of datasets when s = 2.0, but the optimal value of s may vary depending on a specific dataset and the number k.

We note that the searching stage of the kMkNN algorithm adds a small overhead when comparing to the traditional k-NN algorithm in the worst case. Given O(n) clusters, calculating and sorting the distances from q to all cluster centers takes O(mn1gmn) time, and comparing dmax with ∥qcj∥ − ∥picj∥ for all training objects using the triangle inequality takes O(n) time in the worst case, so the total extra cost is O(mn1gmn+n) , which is less than the O(mn) running time for the traditional k-NN. The small overhead mean that even in the worst case scenario when the triangle inequality does not work and we end up calculating all the distances, the running time of kMkNN is still close to the traditional k-NN.

In the buildup stage, the goal is to split the dataset into clusters, so any clustering algorithm will work. We use Lloyd’s algorithm [19] in the implementation, which is a simple heuristic algorithm that iteratively converges to a local minimum. There are faster k-means clustering algorithms available [6] [7] and we can use these algorithms to speed up the buildup stage of kMkNN.

There is no guarantee that Lloyd’s algorithm [19] will converge to the global minimum, that is, minimizing the sum of all the distances from all the objects to their nearest centers — an NP-hard problem in general [1] and an O(nmk+1lgn) problem, given m dimensions and k clusters [14]. Ideally, if the clustering algorithm converges to a local minimum closer to the global minimum, then each cluster is relatively denser and we may reduce the number of distance calculations in the searching stage. Some k-means clustering algorithms [2] [15] have shown to be able to produce better results than Lloyd’s algorithm, but for this paper, we already achieve excellent performance by using the naive Lloyd’s algorithm. It will be interesting to see how performance can be further improved by integrating these algorithms.

III. EXPERIMENTS AND DISCUSSION

A. Datasets

We test the kMkNN algorithm on a collection of 20 datasets, as listed in Table 1. The C++ code of kMkNN and all the processed datasets are available upon request.

TABLE I

The 20 Datasets

Dataset#Objects#Dimensions
abalone41778
arcene90010000
car17286
chess280566
dorothea1950770
ds1.10pca2673310
ds1.100pca26733100
gisette135005000
image231019
ipums7018760
isolet7797617
kddcup9949402141
letter2000016
multiple feat.2000659
musk clean1476166
musk clean26598166
poker100012310
satalog643536
semeion1593256
spambase460157

Open in a separate window

18 datasets (abalone, arcene [12], car, chess, dorothea [12], gisette [12], image, ipums [20], isolet, kddcup99, letter, multiple, musk1, musk2, poker, satalog, semeion, spambase) are from the UCI Machine Learning Repository. For the chess, we use the dataset King-Rook vs. King. For the dorothea, we use the first 654 features as each object has different number of features. For the ipums, we use the ipums.la.97 dataset and predict farm status. For the kddcup99, we use the 10% subset. For the musk1 and musk2, we use the clean1 and clean2 in the Musk (Version 2).

2 other datasets (ds1.10pca and ds1.100pca) are derived from dataset ds1 in the Open Compound Database provided by the National Cancer Institute (NCI), whereas they are the linear projection of the top 10 and 100 dimensions by principle component analysis (PCA).

B. Experiments

We compared the performance of kMkNN to the traditional k-NN, kd-tree based, and ball-tree based algorithms in three experiments. All experiments used a 10-fold cross validation. We tested the performance of kMkNN on a small k value 9 and a large k value 101. Furthermore, we used s = 2.0, which generates kc=2.0n clusters. We tested s = 1.0, 2.0, and 3.0 and s = 2.0 shows the best overall performance.

In the first experiment, we compared kMkNN to the traditional k-NN algorithm on the reduction of distance calculations and the speedup of running time. The results are shown in Tables II and III. For the distance calculations and the running time, we recorded the total number of distance calculations and the total running time in all cross validations. For the running time of the kMkNN algorithm, we ignored the buildup time and recorded the searching time. For the traditional k-NN algorithm, the total number of distance calculations is 0.9n2 for a 10-fold cross validation test with a dataset with n objects, so the number of distance calculations is the same for any k value.

TABLE II

The Number of Distance Calculations of the KMKNN and the Traditional K-NN Algorithms.

DatasetDistance Calculations
kMkNN reduction
k-NN*k = 9k = 101
abalone1.6 × 10716.311.0
arcene7.3 × 1052.62.3
car2.7 × 1065.42.0
chess7.1 × 10825.813.7
dorothea3.4 × 1066.74.5
ds1.10pca6.4 × 10819.17.7
ds1.100pca6.4 × 1082.71.4
gisette2.3 × 1071.01.0
image4.8 × 10613.26.2
ipums4.4 × 10985.163.9
isolet5.5 × 1071.41.2
kddcup992.2 × 101180.472.2
letter3.6 × 10814.86.0
multiple feat.3.6 × 1067.04.2
musk clean12.0 × 1051.81.3
musk clean23.9 × 1075.32.9
poker9.0 × 101153.528.3
satalog3.7 × 1078.05.5
semeion2.3 × 1061.01.0
spambase1.9 × 10715.29.6

Open in a separate window

*For the traditional k-NN algorithm, the number of distance calculations is the same for different k.

TABLE III

The Running Time of the KMKNN and the Traditional K-NN Algorithms

Running Time
DatasetTraditional k-NN (s)kMkNN Speedup
k = 9k = 101k = 9k = 101
abalone1.81.822.612.2
arcene10.710.72.72.3
car0.20.34.42.6
chess77.376.030.922.5
dorothea3.43.47.34.9
ds1.10pca91.391.333.320.0
ds1.100pca182.4175.44.52.2
gisette158.2157.91.01.0
image0.60.617.76.8
ipums977.9981.661.353.5
isolet52.852.91.51.3
kddcup9938200.038239.159.155.0
letter50.250.224.814.0
multiple feat.3.53.57.94.6
musk clean10.10.11.71.3
musk clean213.213.27.33.9
poker130220.3130357.960.149.9
satalog5.96.013.79.6
semeion0.90.91.21.1
spambase3.53.518.612.2

Open in a separate window

In the second experiment, we compared kMkNN to the kd-tree based algorithm on the speedup over the traditional k-NN. The results are shown in Table IV. We used the kd-tree based algorithm implemented in MATLAB and the speedup was calculated by comparing the running time to the traditional k-NN algorithm implemented in MATLAB (labeled “exhaustive”). For both the kd-tree based and the traditional k-NN algorithms in MATLAB, we ignored the classifier buildup time and compared the searching/classifying time only, as the classifier buildup time in MATLAB may have some extra costs.

TABLE IV

The Speedup of the Running Time Over The Traditional K-NN Algorithm for the KMKNN and the Kd-Tree Algorithms

Datasetk = 9k = 101
kMkNNkd-TreekMkNNkd-Tree
abalone18.53.5313.52.15
arcene2.70.232.60.24
car4.61.513.41.20
chess27.97.7420.64.01
dorothea7.00.304.70.26
ds1.10pca28.62.9314.81.17
ds1.100pca3.60.301.90.23
gisette1.00.231.00.23
image14.60.907.30.60
ipums42.811.6339.06.42
isolet1.40.221.20.22
kddcup99 10%44.81.0746.31.05
letter22.50.9510.80.48
multiple feat.7.40.224.40.22
musk v2 clean12.30.231.30.26
musk v2 clean26.10.223.30.21
poker37.618.4833.77.36
satalog12.00.487.90.40
semeion1.00.211.00.22
spambase17.90.7811.50.70

Open in a separate window

In the last experiment, we compared to the ball-tree based algorithm [18] on the overall running time (buildup + searching time). The results are shown in Table V. For the ball-tree based algorithm, since only an executable version is available from [18], we could not break down the buildup and searching time. So we compared the total running time.

TABLE V

The Overall Running Time of the KMKNN and the Ball-Tree Algorithms

Datasetk = 9 (s)k = 101 (s)
kMkNNBall-TreekMkNNBall-Tree
abalone6.33.24.64.8
arcene183.21865.3202.42320.6
car0.91.30.82.5
chess166.140.7180.078.5
dorothea41.3136.449.1197.7
ds1.10pca131.9121.6150.8236.0
ds1.100pca927.05913.6912.98225.9
gisette1965.547435.91992.548432.2
image2.34.02.37.0
ipums2790.1823.12887.61375.7
isolet694.68548.3646.28665.7
kddcup99 10%2435.1>864006162.4>86400
letter125.4129.1120.5249.6
multiple feat.54.5139.453.0215.9
musk v2 clean10.86.70.88.7
musk v2 clean292.7286.782.9465.2
poker124089.944437.1127338.683055.0
satalog31.041.930.062.6
semeion11.4121.111.3129.3
spambase8.522.37.331.6

Open in a separate window

C. Results and Discussion

In the first experiment, the comparison to the traditional k-NN algorithm in Tables II and III shows that the kMkNN algorithm can effectively reduce the distance calculations as well as accelerate the k-nearest neighbor searching, whereas the worst case performance is still close to the traditional k-NN algorithm. Both tables show that kMkNN effectively reduces the distance calculations by 2- to 80-fold and reduces the running time by 2- to 60-fold in 16 datasets, whereas kMkNN show a slight downgrade of performance (less than 5%) in two datasets (gisette and semeion). Comparing k = 9 and 101, both the reduction of distance calculations and the speedup of the running time are decreased slowly (k = 101 is less than 2-fold worse than k = 9), which shows that the performance of kMkNN is robust with k increases.

When the number of objects increases and the number of clusters kc increases, generally the clusters tend to be more condensed and we can better use the triangle inequality, so we see a slight increase of the speedup in both criteria. For example, the datasets kddcup99 and poker, which have 5 × 105 and 106 objects each, show 70-fold and 30-fold speedup in the running time. When the dimension size increases, generally the clusters tend to be more expanded, so we see a slight decrease of the speedup in both criteria. But kMkNN still performs well for datasets with high dimensions. For example, the dataset arcene, which has 104 dimensions, shows over 2-fold speedup in the running time.

For the datasets gisette and semeion, the performance of kMkNN is not as good as the traditional k-NN algorithm. The main reason is that the objects in both datasets form single condensed clusters instead of spreading out. For example, the dataset gisette is for classifying two highly confusable digits ‘4’ and ‘9’ and the objects in the dataset are similar to each other. Although it has 5000 dimensions, it forms a single condensed cluster due to the similarity of objects. Although in the buildup stage kMkNN separates the objects into many clusters, for most of query objects the distances from the query object to the cluster centers and the distances from the training objects to the cluster centers are close, so the triangle inequality does not work well and we end up calculating most of the distances. We also see the similar worst performance in the simulated datasets with uniformly distributed random objects. Still, as shown in Tables II and III, in the worst case scenario, the performance of kMkNN is close to the traditional k-NN algorithm due to the small overhead.

In the second experiment, Table IV shows that kMkNN is on average 3-40 times faster than the kd-tree based algorithm for all 20 datasets. The kd-tree based algorithm is usually used for searching low dimensional datasets with small k values. Even when we compare to the traditional k-NN algorithm, for the 20 datasets we used, the kd-tree based algorithm shows better performance in only a few datasets but shows 4-5 times worse performance in about half of the datasets. It may suggest that the kd-tree based algorithm is not suitable for searching nearest neighbors in high dimensional space and kMkNN or other spatial searching methods should be used instead.

In the third experiment, Table V shows that kMkNN performs better than the ball-tree based algorithm for most of datasets. The ball-tree algorithm [18] is one of the spatial searching methods that improves the kd-tree based algorithm and is efficient for searching nearest neighbors in high dimensional space. Although we could not obtain the source code for the ball-tree algorithm and can compare only the total running time, we can still see from the Table V that kMkNN has a better overall performance when combing the buildup and searching time. It shows that kMkNN can be considered as an alternative for these spatial searching methods in the nearest neighbors searching.

There are two advantages of the kMkNN algorithm over the tree-based spatial searching algorithms. The first advantage is that kMkNN has a small overhead in searching nearest neighbors. Given a dataset of n training objects, generally a tree-based algorithm has O(lgn) levels and O(n) nodes, where each node has some extra costs for boundary checking. If a search for nearest neighbors involves only a few nodes at each level, then a total of O(lgn) nodes will be visited and the algorithm can be much better than kMkNN. But if a search involves most of nodes in each level, then a total of O(n) nodes will be visited and the algorithm can be much worse than kMkNN. Even when the k nearest objects are eventually from a few nodes, too much extra time may be wasted on walking through the internal/leaf nodes, so the algorithm can perform badly due to the extra costs for each node. kMkNN is like a tree with two levels, one root node and kc leaf nodes. There are more operations inside each leaf node, but the whole data structure is flat so there is no extra cost for walking through the internal nodes.

Furthermore, the implementation of the kMkNN algorithm uses arrays instead of pointers, which are used in the tree-based algorithms, so the constant associated with the time complexity of kMkNN is small. When both kMkNN and a spatial searching algorithm have the same number of steps for a searching, the kMkNN will run faster in practice because of the small constant associated with each step.

The second advantage is kMkNN has a better way of organizing objects. The tree-based algorithms use various criteria such as individual dimensions to separate the objects. There are two potential problems. One is that these criteria may not match the way the objects are distributed or queried, so the objects may not separate or be queried well. For example, one reason why the kd-tree method is worse than some other tree-based methods in high dimensions is that the kd-tree method uses the dimensions to separate objects, but the actual objects are usually organized by their relative distances but not by their dimensions. The other problem is that these algorithms always do binary separation each time, but maybe the objects are better separated in three or more groups directly. kMkNN overcomes the first problem by using the criterion of object distances, to separate the objects and overcomes the second problem by separating the objects into multiple clusters directly. It may be possible to build a new tree-based algorithm that uses the distance criteria and allows multiple children per parent node. If a dataset contains many small clusters and we can organize the clusters in a hierarchical manner, then a multiple level k-means clustering algorithm may be able to further improve the performance of kMkNN.

IV. CONCLUSION

We present a new algorithm kMkNN for the nearest neighbors searching problem. The algorithm implements two simple methods: the k-means clustering and the triangle inequality, into the nearest neighbors searching and the experiments show that the algorithm performs well and can be considered as alternative choice for searching nearest neighbors in high dimensional spaces.

One interesting future work is to integrate better k-means clustering algorithms to kMkNN that generate better clusters than Lloyd’s algorithm and see if the performance of kMkNN can be improved further. Another work is to see if we can determine an optimal number of clusters based on the number of training objects, the number of dimensions, and the distribution of the training objects. One further future work is that, since kMkNN and the spatial searching methods may work well on different datasets, it will be interesting to analyze these datasets and try to find correlation between the distribution of the objects in the datasets and the performance of various algorithms.

Acknowledgment

We thank Dr. Andrew Moore at Carnegie Mellon University for providing the ball-tree program. We thank Erik Lawrence for collecting some datasets and running some experiments.

This work was supported by NIH Grant #P20 RR0116454.

References

[1] Aloise D, Deshpande A, Hansen P, Popat P. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning. 2009;vol. 75(no. 2):245–248. [Google Scholar]

[2] Arthur D, Vassilvitskii S. k-means++: the advantages of careful seeding. Proc. 18th annual ACM-SIAM symposium on Discrete algorithms.2007. pp. 1027–1035. [Google Scholar]

[3] Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM. 1998;vol. 45(no. 6):891–923. [Google Scholar]

[4] de Berg M, Cheong O, van Kreveld M, Overmars M. Computational geometry: algorithms and applications. 3rd ed Springer-Verlag; New York: 2008. [Google Scholar]

[5] Ciaccia P, Patella M, Zezula P. M-tree: an efficient access method for similarity search in metric spaces; Proc. 23rd International Conference on VLDB.1997. pp. 426–435. [Google Scholar]

[6] Elkan C. Using the triangle inequality to accelerate k-means. Proc. 12th International Conference on Machine Learning.2003. pp. 147–153. [Google Scholar]

[7] Frahling G, Sohler C. A fast k-means implementation using corsets. Proc. 22nd annual symposium on Computational geometry.2006. pp. 135–143. [Google Scholar]

[8] Frank A, Asuncion A. UCI Machine Learning Repository. University of California, School of Information and Computer Science; Irvine, CA: 2010. [ http://archive.ics.uci.edu/ml] [Google Scholar]

[9] Gionis A, Indyk P, Motwani R. Similarity search in high dimensions via hashing. Proc. 25th International Conference on VLDB.1999. pp. 518–529. [Google Scholar]

[10] Goodman JE, O’Rourke J. Handbook of Discrete and Computational Geometry. 2nd ed CRC Press; 2004. ch. 39. [Google Scholar]

[11] Guttman A. R-trees: a dynamic index structure for spatial searching. Proc. 3rd ACM SIGMOD International Conference on Management of Data.1984. pp. 47–57. [Google Scholar]

[12] Guyon IM, Gunn SR, Ben-Hur A, Dror G. Result analysis of the NIPS 2003 feature selection challenge. Advances in Neural Information Processing Systems. 2004 [Google Scholar]

[13] Hart P. The condensed nearest neighbor rule. IEEE Trans. on Inform. Th. 1968;vol. 14:515–516. [Google Scholar]

[14] Inaba M, Katoh N, Imai H. Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. Proc. 10th ACM Symposium on Computational Geometry.1994. pp. 332–339. [Google Scholar]

[15] Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY. A local search approximation algorithm for k-means clustering. Computational Geometry: Theory and Applications. 2004;vol. 28:89–112. [Google Scholar]

[16] Kim YJ, Patel JM. Performance comparison of the R-tree and the quadtree for kNN and distance join queries. IEEE Trans. on Knowledge and Data Engineering. 2010;vol. 22(no. 7):1014–1027. [Google Scholar]

[17] Kolahdouzan M, Shahabi C. Voronoi-based k nearest neighbor search for spatial network databases. Proc. 30th International Conference on VLDB.2004. pp. 840–851. [Google Scholar]

[18] Liu T, Moore AW, Gray A. Efficient exact k-NN and nonparametric classification in high dimensions. Proc. of Neural Information Processing Systems. 2003 [Google Scholar]

[19] Lloyd SP. Least squares quantization in PCM. IEEE Trans. on Inform. Th. 1982;vol. 28(no. 2):129–137. [Google Scholar]

[20] Ruggles S, Sobek M. Integrated public use microdata series. version 2.0 Historical Census Projects, University of Minnesota; Minneapolis: 1997. [Google Scholar]

[21] Toussaint GT. Proximity graphs for nearest neighbor decision rules: recent progress. Proc. 34th symposium on computing and statistics.2002. pp. 17–20. [Google Scholar]

[22] Uhlmann JK. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters. 1991;vol. 40:175–179. [Google Scholar]

[23] Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man. Cyberne. 1972;vol. 2:408–420. [Google Scholar]

[24] Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou Z, Steinbach M, Hand DJ, Steinberg D. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2008;vol. 14:1–37. [Google Scholar]

A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality (2024)
Top Articles
Latest Posts
Article information

Author: Sen. Emmett Berge

Last Updated:

Views: 5649

Rating: 5 / 5 (60 voted)

Reviews: 83% of readers found this page helpful

Author information

Name: Sen. Emmett Berge

Birthday: 1993-06-17

Address: 787 Elvis Divide, Port Brice, OH 24507-6802

Phone: +9779049645255

Job: Senior Healthcare Specialist

Hobby: Cycling, Model building, Kitesurfing, Origami, Lapidary, Dance, Basketball

Introduction: My name is Sen. Emmett Berge, I am a funny, vast, charming, courageous, enthusiastic, jolly, famous person who loves writing and wants to share my knowledge and understanding with you.