CSE 251A Project 1: K-Means Clustering based

Prototype Selection for Nearest Neighbor

Yi Rong

∗

Computer Science & Engineering

University of California, San Diego

rongyi@ucsd.edu

hi@rongyi.ai

Abstract

In the report, we discuss our attempt to choose a better

prototype for nearest neighbor other than randon selection.

We present a KMeans based prototype selection method that

clearly outperforms the naive random selection in all our

experiments.

Keywords: K-NN, K-Means

1 Introduction

The k-Nearest Neighbors method (kNN) [

] is one of the

most well-known approach for classication tasks in Ma-

chine Learning. kNN is very simple to implement, and is lazy

learning classier, requiring no training process to work.

However, kNN suers from several drawbacks, such as low

eciency during inference and high storage requirements.

In the base case where

𝑘 =

1, we would need to at least

store and traverse the entire dataset once. That is why sev-

eral approaches has been suggest to optimize the speed and

eciency of nearest neighbor, among which is prototype

selection.

Prototype selection is an approximation for nearest neigh-

bor, the idea of which is to nd a subset of the training data

such that the classication accuracy when performing near-

est neighbor on that subset is closed to that on the entire

dataset. Therefore the accuracy depends on whether those

selected samples can represent the entire dataset in terms of

their relative distances in the feature space.

The simplest way to do prototype selection is uniformly

randomly select

𝑀

points in

𝑁 > 𝑀

samples. Our approach

is to perform K-Means Clustering [

] on each of the 10 labels

to divide them into

𝑀/

10 clusters, and choose

𝑀/

10 points

in the dataset closest to those cluster centers. The selected

𝑀 points will be our prototypes.

We rst formulate the Nearest Neighbor problem and

Prototype Selection method in Section 2, and then present

our approach using K-Means Clustering in Section 3. Then

∗

Student ID: A14113185

Author’s address: Yi Rong, Computer Science & Engineering, University of

California, San Diego, rongyi@ucsd.edu, hi@rongyi.ai.

2018. 2475-1421/2018/1-ART1 $15.00

hps://doi.org/

we evaluate our result in Section 4. Finally we conclude our

report in Section 5.

2 Problem Formulation

Nearest Neighbor can be described as follows: The training

dataset consisted of the features

𝑋 = {𝑥

, 𝑥

, ..., 𝑥

𝑛

}

and the

corresponding labels

𝑌 = {𝑦

, 𝑦

, ..., 𝑦

𝑛

}

, with a mapping

𝑓 (𝑥

𝑖

) = 𝑦

𝑖

. Given a new input data

𝑥

, we want to nd the

nearest element to

𝑥

in the set

𝑋

, denoted

𝑥

∗

and return its

label

𝑦

∗

= 𝑓 (𝑥

∗

)

. The 1-nearest neighbor (1NN) method can

be formulated as follows:

𝑦

∗

= 𝑓 (𝑎𝑟𝑔 min

𝑥

𝑖

∈𝑋

||𝑥

𝑖

−𝑥||)

Prototype Selection is the process of selecting a subset

𝑋

′

from

𝑋

to perform the same 1NN process on that selected

set.

𝑦

∗

= 𝑓 (𝑎𝑟𝑔 min

𝑥

𝑖

∈𝑋

′

||𝑥

𝑖

−𝑥||), 𝑋

′

= 𝑆 (𝑋, 𝑌, 𝑀) ⊆ 𝑋

Here

𝑆 ()

is the ltering function to perform prototype

selection based on the number of target prototypes

𝑀 ≤ |𝑋 |

the features of the dataset

𝑋

, and the labels of the dataset

𝑌

. And

||𝑥 ||

is the distance metrics of interest in the target

scenario.

In this project, we want to nd a better

𝑆

than a naive base-

line of

𝑆

′

= 𝑟𝑎𝑛𝑑𝑜𝑚_𝑠𝑎𝑚𝑝𝑙𝑒

, where the resulting prototype

subset is 𝑋

′

= 𝑟𝑎𝑛𝑑𝑜𝑚_𝑠𝑎𝑚𝑝𝑙𝑒 (𝑋, 𝑀).

3 K-Means Clustering based Prototype

Selection

The idea behind doing K-Means Clustering based Prototype

Selection is that for each class

𝑌

𝑖

, we want to pick

𝑀

𝑖

points

among their feature space

𝑋

𝑖

such that those points, when

compared against a new sample in terms of distance, would

better represent the entire feature space. Here

𝑀 =

𝑀

𝑖

, and

in practice we just take

𝑀

𝑖

= 𝑀/|𝑌 |

. And K-Means Clustering

is used to partition the feature space into

𝑀

𝑖

clusters, and

we pick a point in

𝑋

𝑖

closest to the cluster center for each

cluster. Those points together forms the nal prototypes.

We describe the K-Means Clustering based Prototype Se-

lection process as follows:

Conference’17, July 2017, Washington, DC, USA Yi Rong

Table 1. Error rate for dierent approaches

M / Error Rate (%) 500 1000 2000 5000 10000

Random Selection 15.05 ± 0.4111 11.77 ± 0.297 8.82 ± 0.088 6.56 ± 0.172 5.33 ± 0.204

K-Means Clustering 8.28 ± 0.00233 7.35 ± 0.00267 6.13 ± 0.0027 4.97 ± 0.00118 4.45 ± 0.00218

Algorithm 1 K-Means Cluster based Selection

Input X, Y, M

Output

1: X: Input Dataset Features

2: Y: Input Dataset Labels

3: M: Number of prototypes to sample

X: Output subset of X produced by this precedure

5: C ← Len(Unique(Y)) ⊲ Count the unique classes of Y

6: N

𝐶

← (𝑀/C) ⊲ Number of clusters for each class

7: for 𝑐 ∈ Y do

8: 𝑋

𝑐

← all features with label 𝑐

9: 𝐶𝑒𝑛𝑡𝑒𝑟𝑠

𝐶

← KMeans(N

𝐶

).t(𝑋

𝑐

).cluster_center()

10: 𝑝 ← pairwise_distances_argmin_min(𝐶𝑒𝑛𝑡𝑒𝑟𝑠

𝐶

, X)

11:

X append 𝑝

12: end for

13: Return

In the above algorithm, the

KMeans

function is equivalent

to the NumPy [

] implementation, which construct the K-

Means clustering object, and its

fit

function computes the

clustering on the parameter dataset, and are able to return all

the cluster centers. The

pairwise_distances_argmin_min

function is used to nd all the points closest to the corre-

sponding cluster centers, which also have a NumPy imple-

mentation under the same name.

4 Evaluation

4.1 Experiment Results

During our experiments, we ran the 1NN accuracy test with

random selection and K-Means based selection 6 times each,

and record the error percentage of our classiers. We ran on

the MNIST dataset [

], with 60000 samples in the training

set, and 10000 in the test set.

We use the error rate

𝐸

(percentage of wrong classication)

as our metrics to evaluate the two approaches, in which case

the accuracy of our classier is just 1 − 𝐸.

𝐸 =

𝑥,𝑦 ∈𝑋_𝑡𝑒𝑠𝑡

𝑓 (𝑎𝑟𝑔 min

𝑥

𝑖

∈𝑆 (𝑋,𝑌,𝑀)

||𝑥

𝑖

−𝑥 || ≠ 𝑦)

|𝑋 _𝑡𝑒𝑠𝑡 |

We further calculate the 95% condence interval with

the mean and std deviation from 6 rounds of results (data

available in Appendix) with the following formula:

𝐵 = 𝜇 ±𝑡 ×

𝜎

√

𝑛

In our case

𝑡 ≈

776 for 95% condence, and

𝑛 =

5. Both

approaches has a positive standard deviation across multiple

runs, meaning they have some random procedures, there-

fore we provide the condence intervals in our performance

results.

3.00%

5.00%

7.00%

9.00%

11.00%

13.00%

15.00%

17.00%

500 1000 2000 5000 10000

rand kmeans

Error Rate: Random vs K-Means

Figure 1. Performance comparison between Random Selec-

tion and K-Means Clustering

From the Table 1, we can see that K-Means approach (when

𝐾

equals to

𝑀/||𝑐𝑙𝑎𝑠𝑠𝑒𝑠 ||

, meaning

𝐾 = 𝑀/

10 in MNIST)

outperforms the Uniform Random Selection for every

𝑀

We can further observe in the Graph that for small

𝑀

, the

improvements of K-Means over random is profound, but

𝑀

increase (to 10000), the dierence between the two

shrinks.

4.2 Future Work

In our above attempt, we use a same number of clusters /

prototypes for all label classes. For the MNIST cases where

there’s only 10 classes, each class is assigned

𝑀

points to

represent that class. This works well enough to outperform

the uniformly random selection, but we can further improve

the approach by sampling a weighted number of points for

each label class. For example, we can count the number of

samples for each labels, and assign number of prototypes

according to how often they appear in the dataset:

𝑀

𝑖

𝑦

𝑘

∈𝑌

𝑦

𝑘

== 𝑦

𝑖

|𝑌 |

× 𝑀,

5 Conclusion

In conclusion, we presented an alternative approach to do

prototype selection for nearest neighbor other than choos-

ing

𝑀

points randomly. Utilizing K-Means Clustering and

CSE 251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor Conference’17, July 2017, Washington, DC, USA

label-based partitioning, we showed that our approach out-

performs the random one in every experiment.

References

[1]

Khaled Alsabti, Sanjay Ranka, and Vineet Singh. 1997. An ecient

k-means clustering algorithm. (1997).

[2]

Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classi-

cation. IEEE transactions on information theory 13, 1 (1967), 21–27.

[3]

Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf

Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian

Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus,

Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal-

dane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre

Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser,

Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array

programming with NumPy. Nature 585, 7825 (Sept. 2020), 357–362.

hps://doi.org/10.1038/s41586-020-2649-2

[4]

Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST

handwritten digit database. ATT Labs [Online]. Available:

http://yann.lecun.com/exdb/mnist 2 (2010).

A Appendix

A.1 Experiments RAW Data

RAW results of all the experiments (6 runs for each combi-

nation):

Method M Average Standard Deviation

Random 500 0.150517 0.003311

Random 1000 0.117667 0.002391

Random 2000 0.088217 0.000711

Random 5000 0.065617 0.001389

Random 10000 0.053300 0.001646

K-Means 500 0.082783 0.001878

K-Means 1000 0.073517 0.002154

K-Means 2000 0.061267 0.002176

K-Means 5000 0.049750 0.000954

K-Means 10000 0.044500 0.001752

A.2 K-Means Clustering Selection Implementation

def kmeans_pts (X, Y, M):

P = []

Cls = len(np.unique(Y))

Nc = int(M/Cls)

x_idx = [ [] for _ in range(Cls) ]

for i in range(len(Y)):

x_idx[Y[i]].append(i)

for x in x_idx:

cc = KMeans(n_clusters =

Nc).fit(X[x]).cluster_centers_↩→

px, _ = pairwise_distances_argmin_min(cc,X)

P.append(px)

P = np.concatenate(P).ravel()

return P # prototype result

A.3 Full Source Code for Experiments

import numpy as np

from keras.datasets import mnist

from sklearn.neighbors import

KNeighborsClassifier↩→

from sklearn.cluster import KMeans

from sklearn.metrics import

pairwise_distances_argmin_min↩→

import random

import time

def kmeans_pts(X, Y, M):

P = []

Cls = len(np.unique(Y))

Nc = int(M / Cls)

x_idx = [[] for _ in range(Cls)]

for i in range(len(Y)):

x_idx[Y[i]].append(i)

for x in x_idx:

cc = KMeans(n_clusters=Nc).fit(X[x]).cl

⌋

uster_centers_↩→

px, _ =

pairwise_distances_argmin_min(cc, X)↩→

P.append(px)

P = np.concatenate(P).ravel()

return P # prototype result

def rand_pts(X, _Y, M):

P = random.sample(range(0, X.shape[0]), M)

return P

# Dataset and Pre-Processing

(x_train, y_train), (x_test, y_test) =

mnist.load_data()↩→

x_train = x_train.reshape(60000, 784)

x_test = x_test.reshape(10000, 784)

x_train = x_train.astype("float32")

x_test = x_test.astype("float32")

x_train /= 255 # normalize [0, 1]

x_test /= 255

# Run 6 rounds of experiments with different M

for rand and kmeans↩→

runs = 6

Ms = np.array([500, 1000, 2000, 5000, 10000],

dtype=int)↩→

types = np.array(["rand", "kmeans"],

dtype=object)↩→

for run in range(runs):

print("run", run + 1)

Conference’17, July 2017, Washington, DC, USA Yi Rong

for idx, ss in enumerate(Ms):

for type in types:

print("\t", type, ss)

model = KNeighborsClassifier(n_neigh

⌋

bors=1)↩→

start = time.time()

if type == "rand":

prototypes = rand_pts(x_train,

y_train, Ms[idx])↩→

else:

prototypes = kmeans_pts(x_train,

y_train, Ms[idx])↩→

end = time.time()

print("\t\tduration", end - start)

model.fit(x_train[prototypes],

y_train[prototypes])↩→

print("\t\taccuracy",

model.score(x_test, y_test))↩→

print("\t\terror", 1 -

model.score(x_test, y_test))↩→