CSE 251A Project 1: K-Means Clustering based
Prototype Selection for Nearest Neighbor
Yi Rong
βˆ—
Computer Science & Engineering
University of California, San Diego
rongyi@ucsd.edu
hi@rongyi.ai
Abstract
In the report, we discuss our attempt to choose a better
prototype for nearest neighbor other than randon selection.
We present a KMeans based prototype selection method that
clearly outperforms the naive random selection in all our
experiments.
Keywords: K-NN, K-Means
1 Introduction
The k-Nearest Neighbors method (kNN) [
2
] is one of the
most well-known approach for classiξ˜›cation tasks in Ma-
chine Learning. kNN is very simple to implement, and is lazy
learning classiξ˜›er, requiring no training process to work.
However, kNN suers from several drawbacks, such as low
eciency during inference and high storage requirements.
In the base case where
π‘˜ =
1, we would need to at least
store and traverse the entire dataset once. That is why sev-
eral approaches has been suggest to optimize the speed and
eciency of nearest neighbor, among which is prototype
selection.
Prototype selection is an approximation for nearest neigh-
bor, the idea of which is to ξ˜›nd a subset of the training data
such that the classiξ˜›cation accuracy when performing near-
est neighbor on that subset is closed to that on the entire
dataset. Therefore the accuracy depends on whether those
selected samples can represent the entire dataset in terms of
their relative distances in the feature space.
The simplest way to do prototype selection is uniformly
randomly select
𝑀
points in
𝑁 > 𝑀
samples. Our approach
is to perform K-Means Clustering [
1
] on each of the 10 labels
to divide them into
𝑀/
10 clusters, and choose
𝑀/
10 points
in the dataset closest to those cluster centers. The selected
𝑀 points will be our prototypes.
We ξ˜›rst formulate the Nearest Neighbor problem and
Prototype Selection method in Section 2, and then present
our approach using K-Means Clustering in Section 3. Then
βˆ—
Student ID: A14113185
Author’s address: Yi Rong, Computer Science & Engineering, University of
California, San Diego, rongyi@ucsd.edu, hi@rongyi.ai.
2018. 2475-1421/2018/1-ART1 $15.00
hps://doi.org/
we evaluate our result in Section 4. Finally we conclude our
report in Section 5.
2 Problem Formulation
Nearest Neighbor can be described as follows: The training
dataset consisted of the features
𝑋 = {π‘₯
1
, π‘₯
2
, ..., π‘₯
𝑛
}
and the
corresponding labels
π‘Œ = {𝑦
1
, 𝑦
2
, ..., 𝑦
𝑛
}
, with a mapping
𝑓 (π‘₯
𝑖
) = 𝑦
𝑖
. Given a new input data
π‘₯
, we want to ξ˜›nd the
nearest element to
π‘₯
in the set
𝑋
, denoted
π‘₯
βˆ—
and return its
label
𝑦
βˆ—
= 𝑓 (π‘₯
βˆ—
)
. The 1-nearest neighbor (1NN) method can
be formulated as follows:
𝑦
βˆ—
= 𝑓 (π‘Žπ‘Ÿπ‘” min
π‘₯
𝑖
βˆˆπ‘‹
||π‘₯
𝑖
βˆ’π‘₯||)
Prototype Selection is the process of selecting a subset
𝑋
β€²
from
𝑋
to perform the same 1NN process on that selected
set.
𝑦
βˆ—
= 𝑓 (π‘Žπ‘Ÿπ‘” min
π‘₯
𝑖
βˆˆπ‘‹
β€²
||π‘₯
𝑖
βˆ’π‘₯||), 𝑋
β€²
= 𝑆 (𝑋, π‘Œ, 𝑀) βŠ† 𝑋
Here
𝑆 ()
is the ξ˜›ltering function to perform prototype
selection based on the number of target prototypes
𝑀 ≀ |𝑋 |
,
the features of the dataset
𝑋
, and the labels of the dataset
π‘Œ
. And
||π‘₯ ||
is the distance metrics of interest in the target
scenario.
In this project, we want to ξ˜›nd a better
𝑆
than a naive base-
line of
𝑆
β€²
= π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š_π‘ π‘Žπ‘šπ‘π‘™π‘’
, where the resulting prototype
subset is 𝑋
β€²
= π‘Ÿπ‘Žπ‘›π‘‘π‘œπ‘š_π‘ π‘Žπ‘šπ‘π‘™π‘’ (𝑋, 𝑀).
3 K-Means Clustering based Prototype
Selection
The idea behind doing K-Means Clustering based Prototype
Selection is that for each class
π‘Œ
𝑖
, we want to pick
𝑀
𝑖
points
among their feature space
𝑋
𝑖
such that those points, when
compared against a new sample in terms of distance, would
better represent the entire feature space. Here
𝑀 =
Í
𝑀
𝑖
, and
in practice we just take
𝑀
𝑖
= 𝑀/|π‘Œ |
. And K-Means Clustering
is used to partition the feature space into
𝑀
𝑖
clusters, and
we pick a point in
𝑋
𝑖
closest to the cluster center for each
cluster. Those points together forms the ξ˜›nal prototypes.
We describe the K-Means Clustering based Prototype Se-
lection process as follows:
1
Conference’17, July 2017, Washington, DC, USA Yi Rong
Table 1. Error rate for dierent approaches
M / Error Rate (%) 500 1000 2000 5000 10000
Random Selection 15.05 Β± 0.4111 11.77 Β± 0.297 8.82 Β± 0.088 6.56 Β± 0.172 5.33 Β± 0.204
K-Means Clustering 8.28 Β± 0.00233 7.35 Β± 0.00267 6.13 Β± 0.0027 4.97 Β± 0.00118 4.45 Β± 0.00218
Algorithm 1 K-Means Cluster based Selection
Input X, Y, M
Output
Λ†
X
1: X: Input Dataset Features
2: Y: Input Dataset Labels
3: M: Number of prototypes to sample
4:
Λ†
X: Output subset of X produced by this precedure
5: C ← Len(Unique(Y)) ⊲ Count the unique classes of Y
6: N
𝐢
← (𝑀/C) ⊲ Number of clusters for each class
7: for 𝑐 ∈ Y do
8: 𝑋
𝑐
← all features with label 𝑐
9: πΆπ‘’π‘›π‘‘π‘’π‘Ÿπ‘ 
𝐢
← KMeans(N
𝐢
).ξ˜›t(𝑋
𝑐
).cluster_center()
10: 𝑝 ← pairwise_distances_argmin_min(πΆπ‘’π‘›π‘‘π‘’π‘Ÿπ‘ 
𝐢
, X)
11:
Λ†
X append 𝑝
12: end for
13: Return
Λ†
X
In the above algorithm, the
KMeans
function is equivalent
to the NumPy [
3
] implementation, which construct the K-
Means clustering object, and its
fit
function computes the
clustering on the parameter dataset, and are able to return all
the cluster centers. The
pairwise_distances_argmin_min
function is used to ξ˜›nd all the points closest to the corre-
sponding cluster centers, which also have a NumPy imple-
mentation under the same name.
4 Evaluation
4.1 Experiment Results
During our experiments, we ran the 1NN accuracy test with
random selection and K-Means based selection 6 times each,
and record the error percentage of our classiξ˜›ers. We ran on
the MNIST dataset [
4
], with 60000 samples in the training
set, and 10000 in the test set.
We use the error rate
𝐸
(percentage of wrong classiξ˜›cation)
as our metrics to evaluate the two approaches, in which case
the accuracy of our classiξ˜›er is just 1 βˆ’ 𝐸.
𝐸 =
Í
π‘₯,𝑦 βˆˆπ‘‹_𝑑𝑒𝑠𝑑
𝑓 (π‘Žπ‘Ÿπ‘” min
π‘₯
𝑖
βˆˆπ‘† (𝑋,π‘Œ,𝑀)
||π‘₯
𝑖
βˆ’π‘₯ || β‰  𝑦)
|𝑋 _𝑑𝑒𝑠𝑑 |
We further calculate the 95% conξ˜›dence interval with
the mean and std deviation from 6 rounds of results (data
available in Appendix) with the following formula:
𝐡 = πœ‡ ±𝑑 Γ—
𝜎
√
𝑛
In our case
𝑑 β‰ˆ
2
.
776 for 95% conξ˜›dence, and
𝑛 =
5. Both
approaches has a positive standard deviation across multiple
runs, meaning they have some random procedures, there-
fore we provide the conξ˜›dence intervals in our performance
results.
3.00%
5.00%
7.00%
9.00%
11.00%
13.00%
15.00%
17.00%
500 1000 2000 5000 10000
rand kmeans
Error Rate: Random vs K-Means
Figure 1. Performance comparison between Random Selec-
tion and K-Means Clustering
From the Table 1, we can see that K-Means approach (when
𝐾
equals to
𝑀/||π‘π‘™π‘Žπ‘ π‘ π‘’π‘  ||
, meaning
𝐾 = 𝑀/
10 in MNIST)
outperforms the Uniform Random Selection for every
𝑀
.
We can further observe in the Graph that for small
𝑀
, the
improvements of K-Means over random is profound, but
as
𝑀
increase (to 10000), the dierence between the two
shrinks.
4.2 Future Work
In our above attempt, we use a same number of clusters /
prototypes for all label classes. For the MNIST cases where
there’s only 10 classes, each class is assigned
𝑀
10
points to
represent that class. This works well enough to outperform
the uniformly random selection, but we can further improve
the approach by sampling a weighted number of points for
each label class. For example, we can count the number of
samples for each labels, and assign number of prototypes
according to how often they appear in the dataset:
𝑀
𝑖
=
Í
𝑦
π‘˜
βˆˆπ‘Œ
𝑦
π‘˜
== 𝑦
𝑖
|π‘Œ |
Γ— 𝑀,
5 Conclusion
In conclusion, we presented an alternative approach to do
prototype selection for nearest neighbor other than choos-
ing
𝑀
points randomly. Utilizing K-Means Clustering and
2
CSE 251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor Conference’17, July 2017, Washington, DC, USA
label-based partitioning, we showed that our approach out-
performs the random one in every experiment.
References
[1]
Khaled Alsabti, Sanjay Ranka, and Vineet Singh. 1997. An ecient
k-means clustering algorithm. (1997).
[2]
Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classiξ˜›-
cation. IEEE transactions on information theory 13, 1 (1967), 21–27.
[3]
Charles R. Harris, K. Jarrod Millman, StΓ©fan J. van der Walt, Ralf
Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian
Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus,
Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal-
dane, Jaime FernΓ‘ndez del RΓ­o, Mark Wiebe, Pearu Peterson, Pierre
GΓ©rard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser,
Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array
programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362.
hps://doi.org/10.1038/s41586-020-2649-2
[4]
Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST
handwritten digit database. ATT Labs [Online]. Available:
http://yann.lecun.com/exdb/mnist 2 (2010).
A Appendix
A.1 Experiments RAW Data
RAW results of all the experiments (6 runs for each combi-
nation):
Method M Average Standard Deviation
Random 500 0.150517 0.003311
Random 1000 0.117667 0.002391
Random 2000 0.088217 0.000711
Random 5000 0.065617 0.001389
Random 10000 0.053300 0.001646
K-Means 500 0.082783 0.001878
K-Means 1000 0.073517 0.002154
K-Means 2000 0.061267 0.002176
K-Means 5000 0.049750 0.000954
K-Means 10000 0.044500 0.001752
A.2 K-Means Clustering Selection Implementation
def kmeans_pts (X, Y, M):
P = []
Cls = len(np.unique(Y))
Nc = int(M/Cls)
x_idx = [ [] for _ in range(Cls) ]
for i in range(len(Y)):
x_idx[Y[i]].append(i)
for x in x_idx:
cc = KMeans(n_clusters =
Nc).fit(X[x]).cluster_centers_↩→
px, _ = pairwise_distances_argmin_min(cc,X)
P.append(px)
P = np.concatenate(P).ravel()
return P # prototype result
A.3 Full Source Code for Experiments
import numpy as np
from keras.datasets import mnist
from sklearn.neighbors import
KNeighborsClassifier↩→
from sklearn.cluster import KMeans
from sklearn.metrics import
pairwise_distances_argmin_min↩→
import random
import time
def kmeans_pts(X, Y, M):
P = []
Cls = len(np.unique(Y))
Nc = int(M / Cls)
x_idx = [[] for _ in range(Cls)]
for i in range(len(Y)):
x_idx[Y[i]].append(i)
for x in x_idx:
cc = KMeans(n_clusters=Nc).fit(X[x]).cl
βŒ‹
uster_centers_↩→
px, _ =
pairwise_distances_argmin_min(cc, X)↩→
P.append(px)
P = np.concatenate(P).ravel()
return P # prototype result
def rand_pts(X, _Y, M):
P = random.sample(range(0, X.shape[0]), M)
return P
# Dataset and Pre-Processing
(x_train, y_train), (x_test, y_test) =
mnist.load_data()↩→
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_train /= 255 # normalize [0, 1]
x_test /= 255
# Run 6 rounds of experiments with different M
for rand and kmeans↩→
runs = 6
Ms = np.array([500, 1000, 2000, 5000, 10000],
dtype=int)↩→
types = np.array(["rand", "kmeans"],
dtype=object)↩→
for run in range(runs):
print("run", run + 1)
3
Conference’17, July 2017, Washington, DC, USA Yi Rong
for idx, ss in enumerate(Ms):
for type in types:
print("\t", type, ss)
model = KNeighborsClassifier(n_neigh
βŒ‹
bors=1)↩→
start = time.time()
if type == "rand":
prototypes = rand_pts(x_train,
y_train, Ms[idx])↩→
else:
prototypes = kmeans_pts(x_train,
y_train, Ms[idx])↩→
end = time.time()
print("\t\tduration", end - start)
model.fit(x_train[prototypes],
y_train[prototypes])↩→
print("\t\taccuracy",
model.score(x_test, y_test))↩→
print("\t\terror", 1 -
model.score(x_test, y_test))↩→
4