CSE 251A Project 1: K-Means Clustering based

Prototype Selection for Nearest Neighbor

Yi Rong

β

Computer Science & Engineering

University of California, San Diego

rongyi@ucsd.edu

hi@rongyi.ai

Abstract

In the report, we discuss our attempt to choose a better

prototype for nearest neighbor other than randon selection.

We present a KMeans based prototype selection method that

clearly outperforms the naive random selection in all our

experiments.

Keywords: K-NN, K-Means

1 Introduction

The k-Nearest Neighbors method (kNN) [

2

] is one of the

most well-known approach for classiξcation tasks in Ma-

chine Learning. kNN is very simple to implement, and is lazy

learning classiξer, requiring no training process to work.

However, kNN suξers from several drawbacks, such as low

eξciency during inference and high storage requirements.

In the base case where

π =

1, we would need to at least

store and traverse the entire dataset once. That is why sev-

eral approaches has been suggest to optimize the speed and

eξciency of nearest neighbor, among which is prototype

selection.

Prototype selection is an approximation for nearest neigh-

bor, the idea of which is to ξnd a subset of the training data

such that the classiξcation accuracy when performing near-

est neighbor on that subset is closed to that on the entire

dataset. Therefore the accuracy depends on whether those

selected samples can represent the entire dataset in terms of

their relative distances in the feature space.

The simplest way to do prototype selection is uniformly

randomly select

π

points in

π > π

samples. Our approach

is to perform K-Means Clustering [

1

] on each of the 10 labels

to divide them into

π/

10 clusters, and choose

π/

10 points

in the dataset closest to those cluster centers. The selected

π points will be our prototypes.

We ξrst formulate the Nearest Neighbor problem and

Prototype Selection method in Section 2, and then present

our approach using K-Means Clustering in Section 3. Then

β

Student ID: A14113185

Authorβs address: Yi Rong, Computer Science & Engineering, University of

California, San Diego, rongyi@ucsd.edu, hi@rongyi.ai.

2018. 2475-1421/2018/1-ART1 $15.00

hξps://doi.org/

we evaluate our result in Section 4. Finally we conclude our

report in Section 5.

2 Problem Formulation

Nearest Neighbor can be described as follows: The training

dataset consisted of the features

π = {π₯

1

, π₯

2

, ..., π₯

π

}

and the

corresponding labels

π = {π¦

1

, π¦

2

, ..., π¦

π

}

, with a mapping

π (π₯

π

) = π¦

π

. Given a new input data

π₯

, we want to ξnd the

nearest element to

π₯

in the set

π

, denoted

π₯

β

and return its

label

π¦

β

= π (π₯

β

)

. The 1-nearest neighbor (1NN) method can

be formulated as follows:

π¦

β

= π (πππ min

π₯

π

βπ

||π₯

π

βπ₯||)

Prototype Selection is the process of selecting a subset

π

β²

from

π

to perform the same 1NN process on that selected

set.

π¦

β

= π (πππ min

π₯

π

βπ

β²

||π₯

π

βπ₯||), π

β²

= π (π, π, π) β π

Here

π ()

is the ξltering function to perform prototype

selection based on the number of target prototypes

π β€ |π |

,

the features of the dataset

π

, and the labels of the dataset

π

. And

||π₯ ||

is the distance metrics of interest in the target

scenario.

In this project, we want to ξnd a better

π

than a naive base-

line of

π

β²

= ππππππ_π πππππ

, where the resulting prototype

subset is π

β²

= ππππππ_π πππππ (π, π).

3 K-Means Clustering based Prototype

Selection

The idea behind doing K-Means Clustering based Prototype

Selection is that for each class

π

π

, we want to pick

π

π

points

among their feature space

π

π

such that those points, when

compared against a new sample in terms of distance, would

better represent the entire feature space. Here

π =

Γ

π

π

, and

in practice we just take

π

π

= π/|π |

. And K-Means Clustering

is used to partition the feature space into

π

π

clusters, and

we pick a point in

π

π

closest to the cluster center for each

cluster. Those points together forms the ξnal prototypes.

We describe the K-Means Clustering based Prototype Se-

lection process as follows:

1

Conferenceβ17, July 2017, Washington, DC, USA Yi Rong

Table 1. Error rate for diξerent approaches

M / Error Rate (%) 500 1000 2000 5000 10000

Random Selection 15.05 Β± 0.4111 11.77 Β± 0.297 8.82 Β± 0.088 6.56 Β± 0.172 5.33 Β± 0.204

K-Means Clustering 8.28 Β± 0.00233 7.35 Β± 0.00267 6.13 Β± 0.0027 4.97 Β± 0.00118 4.45 Β± 0.00218

Algorithm 1 K-Means Cluster based Selection

Input X, Y, M

Output

Λ

X

1: X: Input Dataset Features

2: Y: Input Dataset Labels

3: M: Number of prototypes to sample

4:

Λ

X: Output subset of X produced by this precedure

5: C β Len(Unique(Y)) β² Count the unique classes of Y

6: N

πΆ

β (π/C) β² Number of clusters for each class

7: for π β Y do

8: π

π

β all features with label π

9: πΆπππ‘πππ

πΆ

β KMeans(N

πΆ

).ξt(π

π

).cluster_center()

10: π β pairwise_distances_argmin_min(πΆπππ‘πππ

πΆ

, X)

11:

Λ

X append π

12: end for

13: Return

Λ

X

In the above algorithm, the

KMeans

function is equivalent

to the NumPy [

3

] implementation, which construct the K-

Means clustering object, and its

fit

function computes the

clustering on the parameter dataset, and are able to return all

the cluster centers. The

pairwise_distances_argmin_min

function is used to ξnd all the points closest to the corre-

sponding cluster centers, which also have a NumPy imple-

mentation under the same name.

4 Evaluation

4.1 Experiment Results

During our experiments, we ran the 1NN accuracy test with

random selection and K-Means based selection 6 times each,

and record the error percentage of our classiξers. We ran on

the MNIST dataset [

4

], with 60000 samples in the training

set, and 10000 in the test set.

We use the error rate

πΈ

(percentage of wrong classiξcation)

as our metrics to evaluate the two approaches, in which case

the accuracy of our classiξer is just 1 β πΈ.

πΈ =

Γ

π₯,π¦ βπ_π‘ππ π‘

π (πππ min

π₯

π

βπ (π,π,π)

||π₯

π

βπ₯ || β π¦)

|π _π‘ππ π‘ |

We further calculate the 95% conξdence interval with

the mean and std deviation from 6 rounds of results (data

available in Appendix) with the following formula:

π΅ = π Β±π‘ Γ

π

β

π

In our case

π‘ β

2

.

776 for 95% conξdence, and

π =

5. Both

approaches has a positive standard deviation across multiple

runs, meaning they have some random procedures, there-

fore we provide the conξdence intervals in our performance

results.

3.00%

5.00%

7.00%

9.00%

11.00%

13.00%

15.00%

17.00%

500 1000 2000 5000 10000

rand kmeans

Error Rate: Random vs K-Means

Figure 1. Performance comparison between Random Selec-

tion and K-Means Clustering

From the Table 1, we can see that K-Means approach (when

πΎ

equals to

π/||ππππ π ππ ||

, meaning

πΎ = π/

10 in MNIST)

outperforms the Uniform Random Selection for every

π

.

We can further observe in the Graph that for small

π

, the

improvements of K-Means over random is profound, but

as

π

increase (to 10000), the diξerence between the two

shrinks.

4.2 Future Work

In our above attempt, we use a same number of clusters /

prototypes for all label classes. For the MNIST cases where

thereβs only 10 classes, each class is assigned

π

10

points to

represent that class. This works well enough to outperform

the uniformly random selection, but we can further improve

the approach by sampling a weighted number of points for

each label class. For example, we can count the number of

samples for each labels, and assign number of prototypes

according to how often they appear in the dataset:

π

π

=

Γ

π¦

π

βπ

π¦

π

== π¦

π

|π |

Γ π,

5 Conclusion

In conclusion, we presented an alternative approach to do

prototype selection for nearest neighbor other than choos-

ing

π

points randomly. Utilizing K-Means Clustering and

2

CSE 251A Project 1: K-Means Clustering based Prototype Selection for Nearest Neighbor Conferenceβ17, July 2017, Washington, DC, USA

label-based partitioning, we showed that our approach out-

performs the random one in every experiment.

References

[1]

Khaled Alsabti, Sanjay Ranka, and Vineet Singh. 1997. An eξcient

k-means clustering algorithm. (1997).

[2]

Thomas Cover and Peter Hart. 1967. Nearest neighbor pattern classiξ-

cation. IEEE transactions on information theory 13, 1 (1967), 21β27.

[3]

Charles R. Harris, K. Jarrod Millman, StΓ©fan J. van der Walt, Ralf

Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian

Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus,

Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Hal-

dane, Jaime FernΓ‘ndez del RΓo, Mark Wiebe, Pearu Peterson, Pierre

GΓ©rard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser,

Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant. 2020. Array

programming with NumPy. Nature 585, 7825 (Sept. 2020), 357β362.

hξps://doi.org/10.1038/s41586-020-2649-2

[4]

Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST

handwritten digit database. ATT Labs [Online]. Available:

http://yann.lecun.com/exdb/mnist 2 (2010).

A Appendix

A.1 Experiments RAW Data

RAW results of all the experiments (6 runs for each combi-

nation):

Method M Average Standard Deviation

Random 500 0.150517 0.003311

Random 1000 0.117667 0.002391

Random 2000 0.088217 0.000711

Random 5000 0.065617 0.001389

Random 10000 0.053300 0.001646

K-Means 500 0.082783 0.001878

K-Means 1000 0.073517 0.002154

K-Means 2000 0.061267 0.002176

K-Means 5000 0.049750 0.000954

K-Means 10000 0.044500 0.001752

A.2 K-Means Clustering Selection Implementation

def kmeans_pts (X, Y, M):

P = []

Cls = len(np.unique(Y))

Nc = int(M/Cls)

x_idx = [ [] for _ in range(Cls) ]

for i in range(len(Y)):

x_idx[Y[i]].append(i)

for x in x_idx:

cc = KMeans(n_clusters =

Nc).fit(X[x]).cluster_centers_β©β

px, _ = pairwise_distances_argmin_min(cc,X)

P.append(px)

P = np.concatenate(P).ravel()

return P # prototype result

A.3 Full Source Code for Experiments

import numpy as np

from keras.datasets import mnist

from sklearn.neighbors import

KNeighborsClassifierβ©β

from sklearn.cluster import KMeans

from sklearn.metrics import

pairwise_distances_argmin_minβ©β

import random

import time

def kmeans_pts(X, Y, M):

P = []

Cls = len(np.unique(Y))

Nc = int(M / Cls)

x_idx = [[] for _ in range(Cls)]

for i in range(len(Y)):

x_idx[Y[i]].append(i)

for x in x_idx:

cc = KMeans(n_clusters=Nc).fit(X[x]).cl

β

uster_centers_β©β

px, _ =

pairwise_distances_argmin_min(cc, X)β©β

P.append(px)

P = np.concatenate(P).ravel()

return P # prototype result

def rand_pts(X, _Y, M):

P = random.sample(range(0, X.shape[0]), M)

return P

# Dataset and Pre-Processing

(x_train, y_train), (x_test, y_test) =

mnist.load_data()β©β

x_train = x_train.reshape(60000, 784)

x_test = x_test.reshape(10000, 784)

x_train = x_train.astype("float32")

x_test = x_test.astype("float32")

x_train /= 255 # normalize [0, 1]

x_test /= 255

# Run 6 rounds of experiments with different M

for rand and kmeansβ©β

runs = 6

Ms = np.array([500, 1000, 2000, 5000, 10000],

dtype=int)β©β

types = np.array(["rand", "kmeans"],

dtype=object)β©β

for run in range(runs):

print("run", run + 1)

3

Conferenceβ17, July 2017, Washington, DC, USA Yi Rong

for idx, ss in enumerate(Ms):

for type in types:

print("\t", type, ss)

model = KNeighborsClassifier(n_neigh

β

bors=1)β©β

start = time.time()

if type == "rand":

prototypes = rand_pts(x_train,

y_train, Ms[idx])β©β

else:

prototypes = kmeans_pts(x_train,

y_train, Ms[idx])β©β

end = time.time()

print("\t\tduration", end - start)

model.fit(x_train[prototypes],

y_train[prototypes])β©β

print("\t\taccuracy",

model.score(x_test, y_test))β©β

print("\t\terror", 1 -

model.score(x_test, y_test))β©β

4