CSE 251A Project 1: K-Means Clustering based
Prototype Selection for Nearest Neighbor
Yi Rong
Computer Science & Engineering
University of California, San Diego
rongyi@ucsd.edu
hi@rongyi.ai
Abstract
In the report, we discuss our attempt to choose a better
prototype for nearest neighbor other than randon selection.
We present a KMeans based prototype selection method that
clearly outperforms the naive random selection in all our
experiments.
Keywords: K-NN, K-Means
1 Introduction
The k-Nearest Neighbors method (kNN) [
2
] is one of the
most well-known approach for classication tasks in Ma-
chine Learning. kNN is very simple to implement, and is lazy
learning classier, requiring no training process to work.
However, kNN suers from several drawbacks, such as low
eciency during inference and high storage requirements.
In the base case where
𝑘 =
1, we would need to at least
store and traverse the entire dataset once. That is why sev-
eral approaches has been suggest to optimize the speed and
eciency of nearest neighbor, among which is prototype
selection.
Prototype selection is an approximation for nearest neigh-
bor, the idea of which is to nd a subset of the training data
such that the classication accuracy when performing near-
est neighbor on that subset is closed to that on the entire
dataset. Therefore the accuracy depends on whether those
selected samples can represent the entire dataset in terms of
their relative distances in the feature space.
The simplest way to do prototype selection is uniformly
randomly select
𝑀
points in
𝑁 > 𝑀
samples. Our approach
is to perform K-Means Clustering [
1
] on each of the 10 labels
to divide them into
𝑀/
10 clusters, and choose
𝑀/
10 points
in the dataset closest to those cluster centers. The selected
𝑀 points will be our prototypes.
We rst formulate the Nearest Neighbor problem and
Prototype Selection method in Section 2, and then present
our approach using K-Means Clustering in Section 3. Then
Student ID: A14113185
Author’s address: Yi Rong, Computer Science & Engineering, University of
California, San Diego, rongyi@ucsd.edu, hi@rongyi.ai.
2018. 2475-1421/2018/1-ART1 $15.00
hps://doi.org/
we evaluate our result in Section 4. Finally we conclude our
report in Section 5.
2 Problem Formulation
Nearest Neighbor can be described as follows: The training
dataset consisted of the features
𝑋 = {𝑥
1
, 𝑥
2
, ..., 𝑥
𝑛
}
and the
corresponding labels
𝑌 = {𝑦
1
, 𝑦
2
, ..., 𝑦
𝑛
}
, with a mapping
𝑓 (𝑥
𝑖
) = 𝑦
𝑖
. Given a new input data
𝑥
, we want to nd the
nearest element to
𝑥
in the set
𝑋
, denoted
𝑥
and return its
label
𝑦
= 𝑓 (𝑥
)
. The 1-nearest neighbor (1NN) method can
be formulated as follows:
𝑦
= 𝑓 (𝑎𝑟𝑔 min
𝑥
𝑖
𝑋
||𝑥
𝑖
𝑥||)
Prototype Selection is the process of selecting a subset
𝑋
from
𝑋
to perform the same 1NN process on that selected
set.
𝑦
= 𝑓 (𝑎𝑟𝑔 min
𝑥
𝑖
𝑋
||𝑥
𝑖
𝑥||), 𝑋
= 𝑆 (𝑋, 𝑌, 𝑀) 𝑋
Here
𝑆 ()
is the ltering function to perform prototype
selection based on the number of target prototypes
𝑀 |𝑋 |
,
the features of the dataset
𝑋
, and the labels of the dataset
𝑌
. And
||𝑥 ||
is the distance metrics of interest in the target
scenario.
In this project, we want to nd a better
𝑆
than a naive base-
line of
𝑆
= 𝑟𝑎𝑛𝑑𝑜𝑚_𝑠𝑎𝑚𝑝𝑙𝑒
, where the resulting prototype
subset is 𝑋
= 𝑟𝑎𝑛𝑑𝑜𝑚_𝑠𝑎𝑚𝑝𝑙𝑒 (𝑋, 𝑀).
3 K-Means Clustering based Prototype
Selection
The idea behind doing K-Means Clustering based Prototype
Selection is that for each class
𝑌
𝑖
, we want to pick
𝑀
𝑖
points
among their feature space
𝑋
𝑖
such that those points, when
compared against a new sample in terms of distance, would
better represent the entire feature space. Here
𝑀 =
Í
𝑀
𝑖
, and
in practice we just take
𝑀
𝑖
= 𝑀/|𝑌 |
. And K-Means Clustering
is used to partition the feature space into
𝑀
𝑖
clusters, and
we pick a point in
𝑋
𝑖
closest to the cluster center for each
cluster. Those points together forms the nal prototypes.
We describe the K-Means Clustering based Prototype Se-
lection process as follows:
1
Conference’17, July 2017, Washington, DC, USA Yi Rong
Table 1. Error rate for dierent approaches
M / Error Rate (%) 500 1000 2000 5000 10000
Random Selection 15.05 ± 0.4111 11.77 ± 0.297 8.82 ± 0.088 6.56 ± 0.172 5.33 ± 0.204
K-Means Clustering 8.28 ± 0.00233 7.35 ± 0.00267 6.13 ± 0.0027 4.97 ± 0.00118 4.45 ± 0.00218
Algorithm 1 K-Means Cluster based Selection
Input X, Y, M
Output
ˆ
X
1: X: Input Dataset Features
2: Y: Input Dataset Labels
3: M: Number of prototypes to sample
4:
ˆ
X: Output subset of X produced by this precedure
5: C Len(Unique(Y)) Count the unique classes of Y
6: N
𝐶
(𝑀/C) Number of clusters for each class
7: for 𝑐 Y do
8: 𝑋
𝑐
all features with label 𝑐
9: 𝐶𝑒𝑛𝑡𝑒𝑟𝑠
𝐶
KMeans(N
𝐶
).t(𝑋
𝑐
).cluster_center()
10: 𝑝 pairwise_distances_argmin_min(𝐶𝑒𝑛𝑡𝑒𝑟𝑠
𝐶
, X)
11:
ˆ
X append 𝑝
12: end for
13: Return
ˆ
X
In the above algorithm, the
KMeans
function is equivalent
to the NumPy [
3
] implementation, which construct the K-
Means clustering object, and its
fit
function computes the
clustering on the parameter dataset, and are able to return all
the cluster centers. The
pairwise_distances_argmin_min
function is used to nd all the points closest to the corre-
sponding cluster centers, which also have a NumPy imple-
mentation under the same name.
4 Evaluation
4.1 Experiment Results
During our experiments, we ran the 1NN accuracy test with
random selection and K-Means based selection 6 times each,
and record the error percentage of our classiers. We ran on
the MNIST dataset [
4
], with 60000 samples in the training
set, and 10000 in the test set.
We use the error rate
𝐸
(percentage of wrong classication)
as our metrics to evaluate the two approaches, in which case
the accuracy of our classier is just 1 𝐸.
𝐸 =
Í
𝑥,𝑦 𝑋_𝑡𝑒𝑠𝑡
𝑓 (𝑎𝑟𝑔 min
𝑥
𝑖
𝑆 (𝑋,𝑌,𝑀)
||𝑥
𝑖
𝑥 || 𝑦)
|𝑋 _𝑡𝑒𝑠𝑡 |
We further calculate the 95% condence interval with
the mean and std deviation from 6 rounds of results (data
available in Appendix) with the following formula:
𝐵 = 𝜇 ±𝑡 ×
𝜎
𝑛
In our case
𝑡
2
.
776 for 95% condence, and
𝑛 =
5. Both
approaches has a positive standard deviation across multiple
runs, meaning they have some random procedures, there-
fore we provide the condence intervals in our performance
results.
3.00%
5.00%
7.00%
9.00%
11.00%
13.00%
15.00%
17.00%
500 1000 2000 5000 10000
rand kmeans
Error Rate: Random vs K-Means
Figure 1. Performance comparison between Random Selec-
tion and K-Means Clustering
From the Table 1, we can see that K-Means approach (when
𝐾
equals to
𝑀/||𝑐𝑙𝑎𝑠𝑠𝑒𝑠 ||
, meaning
𝐾 = 𝑀/
10 in MNIST)
outperforms the Uniform Random Selection for every
𝑀
.
We can further observe in the Graph that for small
𝑀
, the
improvements of K-Means over random is profound, but
as
𝑀
increase (to 10000), the dierence between the two
shrinks.
4.2 Future Work
In our above attempt, we use a same number of clusters /
prototypes for all label classes. For the MNIST cases where
there’s only 10 classes, each class is assigned
𝑀
10
points to
represent that class. This works well enough to outperform
the uniformly random selection, but we can further improve
the approach by sampling a weighted number of points for
each label class. For example, we can count the number of
samples for each labels, and assign number of prototypes
according to how often they appear in the dataset:
𝑀
𝑖
=
Í
𝑦
𝑘
𝑌
𝑦
𝑘
== 𝑦
𝑖
|𝑌 |
× 𝑀,
5 Conclusion
In conclusion, we presented an alternative approach to do
prototype selection for nearest neighbor other than choos-
ing
𝑀
points randomly. Utilizing K-Means Clustering and
2
CSE 251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor Conference’17, July 2017, Washington, DC, USA
label-based partitioning, we showed that our approach out-
performs the random one in every experiment.
References
[1]
Khaled Alsabti, Sanjay Ranka, and Vineet Singh. 1997. An ecient
k-means clustering algorithm. (1997).
[2]
Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classi-
cation. IEEE transactions on information theory 13, 1 (1967), 21–27.
[3]
Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf
Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian
Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus,
Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal-
dane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre
Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser,
Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array
programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362.
hps://doi.org/10.1038/s41586-020-2649-2
[4]
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST
handwritten digit database. ATT Labs [Online]. Available:
http://yann.lecun.com/exdb/mnist 2 (2010).
A Appendix
A.1 Experiments RAW Data
RAW results of all the experiments (6 runs for each combi-
nation):
Method M Average Standard Deviation
Random 500 0.150517 0.003311
Random 1000 0.117667 0.002391
Random 2000 0.088217 0.000711
Random 5000 0.065617 0.001389
Random 10000 0.053300 0.001646
K-Means 500 0.082783 0.001878
K-Means 1000 0.073517 0.002154
K-Means 2000 0.061267 0.002176
K-Means 5000 0.049750 0.000954
K-Means 10000 0.044500 0.001752
A.2 K-Means Clustering Selection Implementation
def kmeans_pts (X, Y, M):
P = []
Cls = len(np.unique(Y))
Nc = int(M/Cls)
x_idx = [ [] for _ in range(Cls) ]
for i in range(len(Y)):
x_idx[Y[i]].append(i)
for x in x_idx:
cc = KMeans(n_clusters =
Nc).fit(X[x]).cluster_centers_
px, _ = pairwise_distances_argmin_min(cc,X)
P.append(px)
P = np.concatenate(P).ravel()
return P # prototype result
A.3 Full Source Code for Experiments
import numpy as np
from keras.datasets import mnist
from sklearn.neighbors import
KNeighborsClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import
pairwise_distances_argmin_min
import random
import time
def kmeans_pts(X, Y, M):
P = []
Cls = len(np.unique(Y))
Nc = int(M / Cls)
x_idx = [[] for _ in range(Cls)]
for i in range(len(Y)):
x_idx[Y[i]].append(i)
for x in x_idx:
cc = KMeans(n_clusters=Nc).fit(X[x]).cl
uster_centers_
px, _ =
pairwise_distances_argmin_min(cc, X)
P.append(px)
P = np.concatenate(P).ravel()
return P # prototype result
def rand_pts(X, _Y, M):
P = random.sample(range(0, X.shape[0]), M)
return P
# Dataset and Pre-Processing
(x_train, y_train), (x_test, y_test) =
mnist.load_data()
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255 # normalize [0, 1]
x_test /= 255
# Run 6 rounds of experiments with different M
for rand and kmeans
runs = 6
Ms = np.array([500, 1000, 2000, 5000, 10000],
dtype=int)
types = np.array(["rand", "kmeans"],
dtype=object)
for run in range(runs):
print("run", run + 1)
3
Conference’17, July 2017, Washington, DC, USA Yi Rong
for idx, ss in enumerate(Ms):
for type in types:
print("\t", type, ss)
model = KNeighborsClassifier(n_neigh
bors=1)
start = time.time()
if type == "rand":
prototypes = rand_pts(x_train,
y_train, Ms[idx])
else:
prototypes = kmeans_pts(x_train,
y_train, Ms[idx])
end = time.time()
print("\t\tduration", end - start)
model.fit(x_train[prototypes],
y_train[prototypes])
print("\t\taccuracy",
model.score(x_test, y_test))
print("\t\terror", 1 -
model.score(x_test, y_test))
4