0% found this document useful (0 votes)
15 views62 pages

Fast Greedy K-Means Algorithm

This document provides an introduction and overview of a master's thesis that presents a fast greedy k-means clustering algorithm. The algorithm aims to address two main drawbacks of existing k-means algorithms: that they are slow and do not scale well to large datasets, and that the results depend on the initial center points chosen. The algorithm uses a greedy approach and kd-trees to efficiently compute the global clustering for k centers based on the assumption of a global clustering for k-1 centers. Experimental results on both synthetic and real datasets are also presented to evaluate the performance of the algorithm.

Uploaded by

Tabitha Howard
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views62 pages

Fast Greedy K-Means Algorithm

This document provides an introduction and overview of a master's thesis that presents a fast greedy k-means clustering algorithm. The algorithm aims to address two main drawbacks of existing k-means algorithms: that they are slow and do not scale well to large datasets, and that the results depend on the initial center points chosen. The algorithm uses a greedy approach and kd-trees to efficiently compute the global clustering for k centers based on the assumption of a global clustering for k-1 centers. Experimental results on both synthetic and real datasets are also presented to evaluate the performance of the algorithm.

Uploaded by

Tabitha Howard
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

A Fast Greedy K-Means Algorithm

N. Hussein

Page 1 of 62

A Fast Greedy K-Means Algorithm


N. Hussein
Masters Thesis
Nr:9668098
University of Amsterdam Faculty of Mathematics, Com uter !ciences, "hysics and Astronomy #uclides $uilding "lantage muidergracht %& '(') *+ Amsterdam the Netherlands Novem,er, %((%

Page 2 of 62

First of all I would like to thank Nikos Vlassis for his guidance and support, both during the stage of developing ideas as well as during the writing of this thesis !"aak Verbeek has been #ore than helpfull to push #e in the beginning of the graduation pro"ect I thank also $ike for correcting #% english Further#ore, I would like to thank #% fa#il% for ever%thing the% had to cope during writing #% thesis and for their support and love throughout these %ears &he% #ake #% life #eaningfull I a# also grateful to the perfecthousing tea# for their support and understanding

Page 3 of 62

*a,le of Contents
A,stract..........................................................................................' .ntroduction..................................................................................../ % 0efinitions and 1elated 2or3..............................................................'( 2.1 Nave k-means algorithm..............................................................11 2.2 Kd-trees.................................................................................14 2.3 The reed! K-means "lgorithm......................................................1# 2.4 "$$elerating the nave K-Means "lgorithm ........................................1% 2.4.1 The &la$klisting "lgorithm ....................................................1% 2.4.2 'ariants of k-means (sing kd-tree............................................21 4 Accelerating the Glo,al 3-means........................................................%4 3.1 The overall algorithm..................................................................2# 3.2 )estri$ted filtering "lgorithm........................................................2% 3.2.1 *om+(ting ,istortion for the $(rrent set of $enters-.....................3. 3.2.2 *om+(ting the distortion for the (+dated list of $enters-................3. 3.3 ,istortion $om+(tation for the set of $andidate $enter lo$ations..............32 & .m lementation...............................................................................44.1 *om+(ting the ne/ *entroids........................................................3% 4.2 *om+(ting the distortion.............................................................30 4.3 *om+(ting distortion for added $enter.............................................31 4.4 Maintaining the Tree...................................................................41 5 #6 erimental results........................................................................&% #.1 ,ataset generated (sing M"T2"&....................................................43 #.1.1 'ariation of n(m3er of insertion +oints.....................................43 #.1.2 'ariation of threshold1.........................................................44 #.1.3 'ariation of threshold2.........................................................46 #.1.4 4ffe$t of n(m3er of +oints....................................................4% #.1.# 4ffe$t of dimensionalit!.......................................................40

Page 4 of 62

#.1.6 4ffe$t of n(m3er of $enters...................................................41 #.2 ,ata generated (sing ,asg(+ta 51#6.................................................#1 #.2.1 'ariation in dimensionalit!....................................................#1 #.2.2 'ariation in n(m3er of +oints.................................................#3 #.2.3 'ariation in *l(ster 7e+aration...............................................#4 #.3 ,ataset from 8*9 M2 )e+ositor!.....................................................#6 - Conclusion.....................................................................................5) / Further 1esearch............................................................................-( ) 1eferences....................................................................................-'

Page # of 62

Abstract
9n $l(stering: /e are given a set of N +oints in d-dimension s+a$e 'd and /e have to arrange them into a n(m3er of gro(+s ;$alled $l(sters<. 9n k(#eans $l(stering: the gro(+s are identified 3! a set of +oints that are $alled the $l(ster $enters. The data +oints 3elong to the $l(ster /hose $enter is $losest. 4=isting algorithms for k-means $l(stering s(ffer from t/o main dra/3a$ks: ;i< The algorithms are slo/ and do not s$ale to large n(m3er of data +oints and ;ii< the! $onverge to different lo$al minima 3ased on the initiali>ations. ?e +resent a fast greed! k-means algorithm that atta$ks 3oth these dra/3a$ks 3ased on a greed! a++roa$h. The algorithm is fast: deterministi$ and is an a++ro=imation of a glo3al strateg! to o3tain the $l(sters. The method is also sim+le to im+lement re@(iring onl! t/o variations of kd-trees as the maAor data str($t(re. ?ith the ass(m+tion of a glo3al $l(stering for k-1 $enters: /e introd($e an effi$ient method to $om+(te the glo3al $l(stering for k $l(sters. "s the glo3al $l(stering for k B 1 is o3vio(s ;$l(ster $enter at the mean of all +oints<: /e $an find the glo3al sol(tion for k $l(sters ind($tivel!. "s a res(lt: the algorithm /e +resent $an also 3e (sed for finding the minim(m n(m3er of $enters that satisf! a given $riteria. ?e also +resent a n(m3er of em+iri$al st(dies 3oth on s!ntheti$ and real data.

Page 6 of 62

1 Introduction
*l(stering is a /ell-kno/n +ro3lem in statisti$s and engineering: namel!: ho/ to arrange a set of ve$tors ;meas(rements< into a n(m3er of gro(+s ;$l(sters<. *l(stering is an im+ortant area of a++li$ation for a variet! of fields in$l(ding data mining: statisti$al data anal!sis and ve$tor @(anti>ation. The +ro3lem has 3een form(lated in vario(s /a!s in the ma$hine learning: +attern re$ognition o+timi>ation and statisti$s literat(re. The f(ndamental $l(stering +ro3lem is that of gro(+ing together ;$l(stering< data items that are similar to ea$h other. The most general a++roa$h to $l(stering is to vie/ it as a densit! estimation +ro3lem. &e$a(se of its /ide a++li$ation: several algorithms have 3een devised to solve the +ro3lem. Nota3le among these are the 4M algorithm: ne(ral nets: 7'M and k-means. *l(stering the data a$ts as a /a! to +arameteri>e the data so that one does not have to deal /ith the entire data in later anal!sis: 3(t onl! /ith these +arameters that des$ri3e the data. 7ometimes $l(stering is also (sed to red($e the dimensionalit! of the data so as to make the anal!sis of the data sim+ler. 9n one of its forms: $l(stering +ro3lems $an 3e defined as- given a dataset of N re$ords: ea$h having dimensionalit! d: to +artition the data into s(3sets s($h that a s+e$ifi$ $riterion is o+timi>ed. The most /idel! (sed $riterion for o+timi>ation is the distortion $riterion. 4a$h re$ord is assigned to a single $l(ster and distortion is the average s@(ared 4($lidean distan$e 3et/een a re$ord and the $orres+onding $l(ster $enter. Th(s this $riterion minimi>es the s(m of the s@(ared distan$es of ea$h re$ord from its $orres+onding $enter. K-means $l(stering is (sed to minimi>e the a3ovementioned term 3! +artitioning the data into k non-overla++ing regions identified 3! their $enters. "lso k-means is defined onl! over n(meri$ or $ontin(o(s data sin$e it re@(ires the a3ilit! to $om+(te 4($lidean distan$es as /ell as the mean of a set of re$ords. *l(stering 3ased on k-means is $losel! related to a n(m3er of other $l(stering and lo$ation +ro3lems. These in$l(de the 4($lidean k-medians: in /hi$h the o3Ae$tive is to minimi>e the s(m of distan$es to the nearest $enter: and the geometri$ k-$enter +ro3lem: in /hi$h the o3Ae$tive is to minimi>e the ma=im(m distan$e from ever! +oint to its $losest $enter. There are no effi$ient methods kno/n to an! of these +ro3lems and some form(lations are NP-hard. 9n 5116 is +resented an as!m+toti$all! effi$ient a++ro=imation for the k-means $l(stering +ro3lem: 3(t the large $onstant fa$tors s(ggest that it is not a good $andidate for +ra$ti$al im+lementation.

Page % of 62

The k-means algorithm is 3ased on the sim+le o3servation that the o+timal +la$ement of a $enter is at the $entroid of the asso$iated $l(ster. Th(s given an! set of k-$enters *: for ea$h $enter c

C: let ';$< denote the voronoi region of the +oint

$: that is: the region of s+a$e that is $losest to the $enter $. "t ever! stage of the kmeans: the algorithm moves the $enters in * to the $entroid of the +oints in ';$< and then (+dates ';$< 3! re$om+(ting the distan$e from ea$h +oint to a $enter $. These ste+s are re+eated (ntil some $onvergen$e $ondition is met. Cor +oints in general +ositions ;in +arti$(lar: if no +oint is e@(idistant from t/o $enters<: the algorithm /ill $onverge to a +oint /here f(rther stages of the algorithm /ill not $hange the +osition of an! $enter. This /o(ld 3e an ideal $onvergen$e $ondition. Do/ever: $onverging to the +oint /here no $enter is f(rther (+dated might take ver! long. Th(s f(rther stages of the algorithm are sto++ed /hen the $hange in distortion: is less than a threshold. This saves a lot of time as in the last stages: the $enters move ver! little in ever! stage. 4ven /itho(t the a++ro=imation of $onvergen$e 3ased on distortion: the res(lts o3tained are not ne$essaril! a glo3al minim(m. The res(lts o3tained var! greatl! 3ased on the initial set of $enters $hosen. The algorithm is deterministi$ after the initial $enters are determined. The attra$tiveness of the k-means lies in its sim+li$it! and fle=i3ilit!. 9n s+ite of other algorithms 3eing availa3le: k-means $ontin(es to 3e an attra$tive method 3e$a(se of its $onvergen$e +ro+erties. Do/ever: it s(ffers from maAor short$omings that have 3een a $a(se for it not 3eing im+lemented on large datasets. The most im+ortant among these are ;i< K-means is slo/ and s$ales +oorl! /ith res+e$t to the time it takes for large n(m3er of +ointsE ;ii< The algorithm might $onverge to a sol(tion that is a lo$al minim(m of the o3Ae$tive f(n$tion. M($h of the related /ork does not attem+t to $onfront 3oth the 3efore mentioned iss(es dire$tl!. &e$a(se of these short$omings: K-means is at times (sed as a hill $lim3ing method rather than a $om+lete $l(stering algorithm: /here the $entroids are initiali>ed to the $enters o3tained from some other methods. The $l(sters o3tained de+end heavil! on the initial $enters. 9n 516 is given a dis$(ssion on ho/ to $hoose the initial $enters. To treat the +ro3lem of $hoosing the initial $enters: several other te$hni@(es (sing sto$hasti$ glo3al o+timi>ation methods ;e.g. 7im(lated annealing: geneti$ algorithms<: have 3een develo+ed. ,ifferent methods of s(3-sam+ling and a++ro=imation have also 3een +ro+osed to $o(nter the a3ove. Do/ever: none of these methods have gained /ide a$$e+tan$e and in man! +ra$ti$al a++li$ations: the $l(stering method that is (sed is 3ased on r(nning the kmeans algorithm /ith m(lti+le restarts.

Page 0 of 62

There has 3een some related /ork /here one of the a3ove t/o iss(es mentioned has 3een addressed. 9n 52:36 and 546 methods have tried to address the +ro3lem of effi$ient im+lementation of the k-means algorithm. 9n 516 a method has addressed the +ro3lem of the lo$al $onvergen$e of the k-means algorithm and +ro+osed an iterative algorithm that $omes $loser to the glo3al sol(tion 3! r(nning a set of deterministi$ k-means algorithms one 3! one. &(t the +ro+osed strateg! is ver! slo/ and has little +ra$ti$al a++li$ation even in $ase of medi(m si>ed datasets. &efore +ro$eeding: /e e=+li$itl! define the +ro3lem that /e are looking at. 9n a traditional k-means algorithm ;for /hi$h /e develo+ an im+roved sol(tion< ea$h of the +oints is asso$iated /ith onl! one +artition also $alled the $l(ster. The n(m3er of +artitions is +re-s+e$ified. 4a$h of the +artitions is re$ogni>ed 3! a $l(ster $enter: /hi$h is the mean of all the +oints in the +artition. "ll the +oints in a +artition are $loser to its $enter than to an! other $l(ster $enter. The a$$(ra$! of a $l(stering a++roa$h is defined 3! the distortion $riterion: /hi$h is the mean s@(ared distan$e of a +oint from its $l(ster $enter. Th(s the o3Ae$tive is to get the $l(stering /ith minim(m distortion and the s+e$ified n(m3er of $l(sters. 9n this thesis /e address the iss(es of s$aling the algorithm for large n(m3er of +oints and $onvergen$e to lo$al o+tima. ?e +resent a fast greed! k-means algorithm. The +ro+osed algorithm is entirel! deterministi$ and so avoids the +ro3lem of having to initiali>e the $enters. 9n addition to $om3ining ideas in [Link] 546: 516 and the standard k-means algorithm: the +ro+osed algorithm also introd($es some ne/ te$hni@(es to make k-means more tra$ta3le. The im+lementation is also sim+le and onl! re@(ires variants of a kd-tree. The rest of the thesis is organi>ed as follo/s. 9n the ne=t $ha+ter: /e introd($e the vario(s $on$e+ts that are related to this algorithm. 9n $ha+ter 3 /e o(tline o(r algorithm: its variants and the ne/ ideas im+lemented. *ha+ter 4 e=+lains the im+lementation iss(es so that similar res(lts $an 3e o3tained in inde+endent st(dies later. 4=+erimental res(lts on 3oth s!ntheti$ and real data are revie/ed in $ha+ter #. *ha+ter 6 gives the $on$l(sions of the resear$h and $ha+ter % +oints to $ertain dire$tions in /hi$h f(rther resear$h $o(ld 3e (ndertaken. The thesis $loses /ith relevant referen$es in $ha+ters 0.

Page 1 of 62

2 Definitions and Related Work


9n this $ha+ter /e look at relevant /ork and definitions that are im+ortant for the (nderstanding of the thesis. The ideas $overed in this $ha+ter are the standard kmeans algorithm: also referred to as the 2lo!ds algorithm in 546: a 3rief introd($tion to kd-trees: the lo3al K-means algorithm as in 516 in$l(ding the greed! a++ro=imations and other methods s(ggested in the same +a+er for im+roving the +erforman$e of the method and the &la$klistingFfiltering algorithm as in 546 and 526. 7everal definitions: given in these +a+ers: /hi$h /e /o(ld (se in this thesis: are also in$l(ded. ?e: ho/ever: do not give an! +roofs of the theorems or lemma that are alread! +roved in these +a+ers.

Page 1. of 62

2.1

Nave k-means algorithm

Gne of the most +o+(lar he(risti$s for solving the k-means +ro3lem is 3ased on a sim+le iterative s$heme for finding a lo$all! o+timal sol(tion. This algorithm is often $alled the k-means algorithm. There are a n(m3er of variants to this algorithm: so to $larif! /hi$h version /e are (sing: /e /ill refer to it as the nave k-means algorithm as it is m($h sim+ler $om+ared to the other algorithms des$ri3ed here. This algorithm is also referred to as the 2lo!ds algorithm in 546. The naive k-means algorithm +artitions the dataset into Hk s(3sets s($h that all re$ords: from no/ on referred to as +oints: in a given s(3set I3elongI to the same $enter. "lso the +oints in a given s(3set are $loser to that $enter than to an! other $enter. The +artitioning of the s+a$e $an 3e $om+ared to that of 'oronoi +artitioning e=$e+t that in 'oronoi +artitioning one +artitions the space 3ased on distan$e and here /e +artition the points 3ased on distan$e. The algorithm kee+s tra$k of the $entroids of the s(3sets: and +ro$eeds in sim+le iterations. The initial +artitioning is randoml! generated: that is: /e randoml! initiali>e the $entroids to some +oints in the region of the s+a$e. 9n ea$h iteration ste+: a ne/ set of $entroids is generated (sing the e=isting set of $entroids follo/ing t/o ver! sim+le ste+s. 2et (s denote the set of $entroids after the i th iteration 3! C(i). The follo/ing o+erations are +erformed in the ste+s;i< Partition the +oints 3ased on the $entroids *;i<: that is: find the $entroids to /hi$h ea$h of the +oints in the dataset 3elongs. The +oints are +artitioned 3ased on the 4($lidean distan$e from the $entroids. ;ii< 7et a ne/ $entroid $;iJ1< $losest to $;i<

* ;iJ1< to 3e the mean of all the +oints that are

* ;i< The ne/ lo$ation of the $entroid in a +arti$(lar +artition is

referred to as the ne/ lo$ation of the old $entroid. The algorithm is said to have $onverged /hen re$om+(ting the +artitions does not res(lt in a $hange in the +artitioning. 9n the terminolog! that /e are (sing: the algorithm has $onverged $om+letel! /hen C(i) and C(i
1)

are identi$al. Cor

$onfig(rations /here no +oint is e@(idistant to more than one $enter: the a3ove $onvergen$e $ondition $an al/a!s 3e rea$hed. This $onvergen$e +ro+ert! along /ith its sim+li$it! adds to the attra$tiveness of the k-means algorithm. The nave k-means needs to +erform a large n(m3er of Inearest-neigh3orI @(eries for the +oints in the dataset. 9f the data is Hd dimensional and there are HN +oints in the dataset: the $ost of a single iteration is G;kdN<. "s one /o(ld have to r(n several iterations: it is generall! not feasi3le to r(n the nave k-means algorithm for large n(m3er of +oints.

Page 11 of 62

7ometimes the $onvergen$e of the $entroids ;i.e. C(i) and C(i+1) 3eing identi$al< takes several iterations. "lso in the last several iterations: the $entroids move ver! little. "s r(nning the e=+ensive iterations so man! more times might not 3e effi$ient: /e need a meas(re of $onvergen$e of the $entroids so that /e sto+ the iterations /hen the $onvergen$e $riteria is met. ,istortion is the most /idel! a$$e+ted meas(re. *l(stering error meas(res the same $riterion and is sometimes (sed instead of distortion. 9n fa$t k-means algorithm is designed to o+timi>e distortion. Pla$ing the $l(ster $enter at the mean of all the +oints minimi>es the distortion for the +oints in the $l(ster. "lso /hen another $l(ster $enter is $loser to a +oint than its $(rrent $l(ster $enter: moving the $l(ster from its $(rrent $l(ster to the other $an red($e the distortion f(rther. The a3ove t/o ste+s are +re$isel! the ste+s done 3! the kmeans $l(ster. Th(s k-means red($es distortion in ever! ste+ lo$all!. The k-Means algorithm terminates at a sol(tion that is lo$all! o+timal for the distortion f(n$tion. Den$e: a nat(ral $hoi$e as a $onvergen$e $riterion is distortion. "mong other meas(res of $onvergen$e (sed 3! other resear$hers: /e $an meas(re the s(m of 4($lidean distan$e of the ne/ $entroids from the old $entroids as in 516. 9n this thesis /e al/a!s (se $l(stering errorFdistortion as the $onvergen$e $riterion for all variants of k-means algorithm. 0efinition '- )lustering error is the s(m of the s@(ared 4($lidean distan$es from +oints to the $enters of the +artitions to /hi$h the! 3elong. Mathemati$all!: given a $l(stering

: /e denote 3! ( x ) the $entroid this


( x ) is sim+l! the

$l(stering asso$iates /ith an ar3itrar! +oint * ;so for k-means: $enter $losest to *<. ?e then define a meas(re of @(alit! for

distortion =

1 N

x ( x)
x

;1<

?here KaK is (sed to denote the norm of a ve$tor Ha. The lesser the differen$e in distortion over s($$essive iterations: the more the $entroids have $onverged. ,istortion is therefore (sed as a meas(re of goodness of the +artitioning. 9n s+ite of its sim+li$it!: k-means often $onverges to lo$al o+tima. The @(alit! of the sol(tion o3tained de+ends heavil! on the initial set of $entroids: /hi$h is the onl! non-deterministi$ ste+ in the algorithm. Note that altho(gh the starting $enters $an 3e sele$ted ar3itraril!: k-means is f(ll! deterministi$: given the starting $enters. " 3ad $hoi$e of initial $enters $an have a great im+a$t on 3oth +erforman$e and distortion. "lso a good $hoi$e of initial $entroids /o(ld red($e the n(m3er of iterations that are re@(ired for the sol(tion to $onverge. Man! algorithms have tried to im+rove the @(alit! of the k-means sol(tion 3! s(ggesting different /a!s of sam+ling the initial $enters: 3(t none has 3een a3le to avoid the +ro3lem of the sol(tion $onverging to a lo$al o+tim(m. Cor e=am+le: 516 gives a dis$(ssion on ho/ to

Page 12 of 62

$hoose the initial $enters: other te$hni@(es (sing sto$hasti$ glo3al o+timi>ations methods ;e.g. 7im(lated annealing: geneti$ algorithms<: have also 3een develo+ed. None of these algorithms is /idel! a$$e+ted.

Page 13 of 62

2.2

d-trees

" ver! im+ortant data str($t(re that is (sed in o(r algorithm is a kd-tree. " kd-tree is a data str($t(re for storing a set of finite +oints from a d-dimensional s+a$e. 9t /as introd($ed and e=amined it in detail in 512: 136. Kd-trees are sim+le data str($t(res /ith the follo/ing +ro+erties;i< ;ii< ;iii< The! are 3inar! treesE The root node $ontains all the +ointsE " node is s+lit along a s+lit-+lane s($h that +oints to the left are +art of the left s(3-tree: +oints to the right are +art of the right s(3-treeE ;iv< The left and right s(3-trees are re$(rsivel! s+lit (ntil there is onl! one +oint in the leaf or a $ertain $ondition is satisfied. This is the 3asi$ kd-tree str($t(re. There e=ist several variants of the kd-tree 3ased on the /a! in /hi$h the! $hoose the s+litting +lane: the termination $riteria: et$. Griginall! designed to de$rease the time in nearest neigh3or @(eries: kd-trees have fo(nd other a++li$ations as /ell. Gmoh(mdro has re$ommended it in a s(rve! of +ossi3le te$hni@(es to in$rease s+eed of ne(ral net/ork learning in 5146. 9n the $onte=t of k-means: kd-trees are (sed as a data str($t(re to save the +oints in the dataset 3! methods des$ri3ed in 51.6: 546 and 52: 36. " variant of the kd-tree is also (sed in 516. Tho(gh kd-trees give s(3stantial advantage for lo/er dimensions: the +erforman$e of kd-trees de$reasesFdro+s in higher dimensions. Gther data str($t(res like ", trees 5#6 have 3een s(ggested for higher dimensions 526 3(t these have never 3een (sed for k-means. "fter this 3rief introd($tion to the kd-trees ;/hi$h is the +rimar! data str($t(re (sed in o(r algorithm<: /e dis$(ss the t/o main a++roa$hes that tr! to $o(nter the short$omings of the k-means algorithm.

Page 14 of 62

2.!

"he #reed$ -means Algorithm

9n 516 is +resented a variant of the k-means algorithm that is $alled the glo3al kmeans algorithm. The lo$al $onvergen$e +ro+erties of k-means have 3een im+roved in this algorithm. "lso it does not re@(ire the initial set of $entroids to 3e de$ided. The idea is that the glo3al minima $an 3e rea$hed thro(gh a series of lo$al sear$hes 3ased on the glo3al $l(stering /ith one $l(ster less. Assum tion7 The ass(m+tion (sed in the algorithm is that the glo3al o+tima $an 3e rea$hed 3! r(nning k-means /ith the ;k-1< $l(sters 3eing +la$ed at the o+timal +ositions for the ;k-1< $l(stering +ro3lem and the k th $l(ster 3eing +la$ed at an a++ro+riate +osition that is !et to 3e dis$overed. 2et (s ass(me that the +ro3lem is to find K $l(sters and K L K. ?e 8se the a3ove ass(m+tion: the glo3al o+tima for k B K $l(sters is $om+(ted as a series of lo$al sear$hes. "ss(ming that /e have solved the k-means $l(stering +ro3lem for K M 1 $l(sters: /e have to +la$e a ne/ $l(ster at an a++ro+riate lo$ation. To dis$over the a++ro+riate insertion lo$ation: /hi$h is not kno/n: /e r(n k-means algorithm (ntil $onvergen$e /ith ea$h of the +oints in the entire set of the +oints in the dataset 3eing added as the $andidate ne/ $l(ster: one at a time: to the K M 1 $l(sters. The $onverged K $l(sters that have the minim(m distortion after the $onvergen$e of kmeans in the a3ove lo$al sear$hes are the $l(sters of the glo3al k-means. ?e kno/ that for k B 1: the o+timal $l(stering sol(tion is the mean of all the +oints in the dataset. 8sing the a3ove method /e $an $om+(te the o+timal +ositions for the k B 2: 3: 4: ... K: $l(sters. Th(s the +ro$ess involves $om+(ting the o+timal k-means $enters for ea$h of the K B 1: 2: 3N K $l(sters. The algorithm is entirel! deterministi$. Tho(gh the attra$tiveness of the glo3al k-means lies in it finding the glo3al sol(tion: the method involves a heav! $ost. K-means is r(n N times: /here N is the n(m3er of +oints in the dataset: for ever! $l(ster to 3e inserted. The $om+le=it! $an 3e red($ed $onsidera3l! 3! not r(nning the K-means /ith the ne/ $l(ster 3eing inserted at ea$h of the dataset +oints 3(t 3! finding another set of +oints that $o(ld a$t as an a++ro+riate set for insertion lo$ation of the ne/ $l(ster. 9n 516 is s(ggested the (se of $enters of regions: formed 3! a variant of the kd-tree: as the insertion +oints for ne/ $entroids. The variant of the kd-tree s+lits the +oints in a node (sing the +lane that +asses thro(gh the mean of the +oints in the node and is +er+endi$(lar to the +rin$i+al $om+onent of the +oints in the node. " node is not s+lit if it has less than a +re-s+e$ified n(m3er of +oints or an (++er 3o(nd to the n(m3er of leaf nodes is rea$hed. The idea is that even if the kd-tree /ere not (sed for nearest neigh3or

Page 1# of 62

@(eries: merel! the $onstr($tion of the kd-tree 3ased on this strateg! /o(ld give a ver! good +reliminar! $l(stering of the data. ?e $an th(s (se the kd-tree nodes $enters as the $andidateFinitial insertion +ositions for the ne/ $l(sters. The time $om+le=it! of the algorithm $an also 3e im+roved 3! taking a greed% approach. 9n this a++roa$h: r(nning k-means for ea$h +ossi3le insertion +osition is avoided. 9nstead red($tion in the distortion /hen the ne/ $l(ster is added is taken into a$$o(nt /itho(t a$t(all! r(nning k-means. The +oint that gives the ma=im(m de$rease in the distortion /hen added as a $l(ster $enter is taken to 3e the ne/ insertion +osition. K-means is r(n (ntil $onvergen$e on the ne/ list of $l(sters /ith this added +oint as the ne/ $l(ster. The ass(m+tion is that the +oint that gives the ma=im(m de$rease in distortion is also the +oint for /hi$h the $onverged $l(sters /o(ld have the least distortion. This res(lts in a s(3stantial im+rovement in the r(nning time of the algorithm: as it is (nne$essar! to r(n k-means for all the +ossi3le insertion +ositions. Do/ever: the sol(tion ma! not 3e glo3all! o+timal 3(t an a++ro=imate glo3al sol(tion.

Page 16 of 62

2.%

Accelerating the nave

-&eans Algorithm

The high time $om+le=it! of the k-means algorithm makes it im+ra$ti$al for (se in the $ase of large n(m3er of +oints in the dataset. )ed($ing the large n(m3er of nearest neigh3or @(eries in the algorithm $an a$$elerate it. The kd-tree data str($t(re hel+s in a$$elerating in the nearest neigh3or @(eries and is a nat(ral $hoi$e to red($e the $om+le=it!. There are at least three variants of the k-means algorithm (sing kd-trees +resented in 526: 546 and 51.6. 9n all of these the kd-tree is formed (sing the +oints in the dataset. 7($h (sage of kd-trees in k-means /as first done in a method detailed in 51.6.

2.%.1

"he 'lacklisting Algorithm

9n 526 is +resented a method to s+eed (+ the e=e$(tion of single k-means iteration th(s making k-means for large datasets tra$ta3le. This algorithm is $alled the 3la$klisting algorithm. 9n-s+ite of its sim+li$it!: the nave k-means algorithms involves a ver! large n(m3er of nearest neigh3or @(eries. The 3la$klisting algorithm +ro$eeds 3! red($ing the n(m3er of nearest neigh3or @(eries that the k-means algorithm e=e$(tes. Th(s s+eed is enhan$ed greatl!. The time taken for a single iteration is greatl! im+roved. The algorithm remains e=a$t and it /o(ld (ntil take the same n(m3er of iterations for $onvergen$e as the nave k-means. The 3la$klisting algorithm (ses a variation of the kd-tree: $alled the mrkd tree: des$ri3ed in 5#6. 9n this variation of the kd-tree: the region of a node is a h!+erre$tangle identified 3! t/o ve$tors: h ma= and hmin. The node is s+lit at the mid +oint of the h!+er-re$tangle: +er+endi$(lar to the dimension that is longest. The root node re+resents the h!+er-re$tangle that en$loses all the +oints in the dataset. The algorithm e=+loits the fa$t that instead of (+dating a +arti$(lar $l(ster from the +oints that 3elong to it +oint 3! +oint: a more effi$ient a++roa$h /ill 3e to (+date in 3(lk or gro(+s. Nat(rall! these gro(+s /ill $orres+ond to h!+er-re$tangles in the kdtree. Points $an 3e assigned in 3(lk (sing the kno/n $enters of mass and si>e of gro(+s of +oints +rovided all the +oints are 3eing assigned to the same $l(ster. To ens(re $orre$tness /e m(st first make s(re that all of the +oints in a given re$tangle indeed O3elongP to a s+e$ifi$ $enter 3efore adding their statisti$s to it. &efore +ro$eeding to give a des$ri+tion of the algorithm: /e give some definitions that are ne$essar! for an (nderstanding of the algorithm. Cor f(rther details /e refer to 526 and 546.

Page 1% of 62

0efinition %-

iven a set of $enters ) and a h!+er-re$tangle h: /e define 3!

owner)+h, a $enter c

) s($h that an! +oint in h is $loser to c than to an! other

$enter in ): if s($h a $enter e=ists. *heorem '- 2et ) 3e a set of $enters: and h a h!+er-re$tangle. 2et c

) 3e ownerc

+h,. 9f d(c, h) re+resents the distan$e of a h!+er-re$tangle from the $enter $ ;minim(m distan$e from an! +oint of the h!+er-re$tangle<: then- ;9t is not A(st the h!+er-re$tangle 3o(ndar!. 9n $ase the $enter is inside the distan$e is al/a!s >ero. ?hen the $enter is o(tside the distan$e from the h!+er-re$tangle is minim(m from one of the 3o(ndaries<

d (c, h) = min d (c ' , h)


c 'C

The a3ove theorem states that if the h!+er-re$tangle is dominated 3! a $entroid: then it is the $entroid that is the $losest to the h!+er-re$tangle.

Page 10 of 62

Figure '7 *he dotted lines are the lines of decision ,et8een C ', C% and C', C4. C' is also closest to any of the hy er-rectangles. C ' dominates the hy er-rectangles U', U% and U4. 0efinition 4iven a h!+er-re$tangle h: and t/o $enters c- and c.: /e sa! that c-

do#inates c. /ith res+e$t to h if ever! +oint in h is $loser to c- than it is to c.. Th(s the o/ner*;h< dominates all the $enters in *: e=$e+t itself. That is: /e sa! that a h!+er-re$tangle is dominated 3! a $entroid if no other $entroid is $loser to an! +oint of the h!+er-re$tangle than this $entroid. 9emma '7 iven t/o $enters c-: c.: and a h!+er-re$tangle h s($h that d+c-,h, Q

d+c.,h,: the de$ision +ro3lem Idoes c- dominate c. /ith res+e$t to hRI $an 3e ans/ered in /+d, time /here d is the dimensionalit! of the dataset.

Page 11 of 62

The task mentioned in 2emma 1 $an 3e done 3! taking the e=treme +oint + of the h!+er-re$tangle h in the dire$tion v = c 2 c1 and verif!ing if p is $loser to c- or c.. 9f it is $loser to c- then c- dominates c.. C(rthermore: finding the e=treme +oint in the dire$tion v = c 2 c1 : that is: the linear +rogram Oma=imi>e
v, p

s($h that p h P $an 3e solved in time /+d,.

"gain /e noti$e the e=treme +oint is a $orner of h. Cor ea$h $oordinate i: /e $hoose pi to 3e hi#a* if c.i 0 c-i : and hi#in other/ise. 8sing theorem 1 and lemma 1: /e $an find in at most /+k1d,: /here k1 is the n(m3er of $l(ster $enters +assed to the node and d the dimensionalit! of the data: if an! $l(ster $enter dominates the h!+er-re$tangle. 9f so: all the +oints in this h!+erre$tangle $an 3e assigned to this $l(ster.

Figure %7 *his icture sho8s the vectors C % : C' and C4 : C'. #', #%, #4 and #& are the e6treme oints in the direction C % : C'. #4, #&, #5 and #- are the e6treme oints in the direction C4 : C'.

Page 2. of 62

"ll the $l(ster $enters are initiall! +assed to the root node. The node verifies if a single $l(ster $enter dominates its region. 8sing theorem 1 and lemma 1 des$ri3ed 3efore does this. That is: /e find the $entroid $losest to the h!+er-re$tangle ;in the $ase of the 3la$klisting algorithm< or the $entroid $losest to the $enter of the $enter of the h!+er-re$tangle ;filtering algorithm<. 8sing lemma 1: /e $an then verif! if this $entroid dominates all the other $entroids. 9f so: all the +oints in the node are assigned to that $enter. 9f the $losest $enter does not dominate all the other $entroids: then the $losest $entroid and the $entroids not dominated 3! this $entroid are +assed to the $hild nodes. The same +ro$ess of filtering is re+eated at the $hild nodes. ?hen a node is $om+letel! assigned to a single $entroid: the $entroid details are (+dated 3ased on the s(m of the +oints in the node: their norm s@(ared and the n(m3er of +oints assigned so that the distortion and the mean of the +oints assigned to a $entroid $an 3e $om+(ted.

2.%.2

(ariants of k-means using kd-tree

"t least t/o other variants of k-means (sing the kd-tree are kno/n. These are des$ri3ed in 51.6 and 546. 9n 51.6 is also +ro3a3l! the first (se of kd-tree for k-means in a manner similar to the 3la$klisting algorithm. " kd-tree is (sed as a data str($t(re to store the +oints and +oints are assigned to the $enter in 3(lk 3ased on the regions of the kd-tree. Do/ever: in 51.6 the $on$e+t of domination of a node 3! a $enter as in 3la$klisting is not (sed. 9nstead: a node /as $onsidered $loser to a node onl! if its ma=im(m distan$e from the h!+er-re$tangle /as more than the minim(m distan$e of an! other $enter. 9n 546 is s(ggested a method ver! similar to the 3la$klisting algorithm: $alled the filtering algorithm: for s+eeding (+ the e=e$(tion of the k-means $l(stering. This algorithm differs from the 3la$klisting algorithm onl! at t/o +la$es. Gne differen$e from the 3la$klisting algorithm is that in order to find a +otential o/ner instead of distan$e from the $ell as in theorem 1: the distan$e from the $enter of the h!+erre$tangle is taken. This has the advantage that /hen there are more than one $l(ster $enters /hi$h are inside the h!+er-re$tangle: the $l(ster $enter that is $loser to the $enter of the h!+er-re$tangle is $hosen. "nother differen$e 3et/een the filtering algorithm and the 3la$klisting algorithm is that d(ring the kd-tree $onstr($tion: the filtering algorithm (ses a sliding mid+oint +lane for s+litting a node. 9n this: if one of the $hildren nodes has no +oints in its range the s+litting +lane is shifted so that at least one +oint is $ontained in ever! node. 7ee 566 for f(rther dis$(ssion on the sliding mid+oint s+litting in kd-trees.

Page 21 of 62

"ll the three algorithms: des$ri3ed in 51.6: the 3la$klisting and the filtering algorithm +ro$eed in ver! similar fashion. The algorithms are re$(rsive in nat(re and $all the $hildren nodes if an o/ner does not e=ist. "ll the $l(ster $enters are initiall! +assed to the root node. The node verifies if a single $l(ster $enter is $loser to the region than an! other: (sing their res+e$tive $riteria. 9f so: all the +oints in the node are assigned to that $enter. 9f the $losest $enter does not dominate all the other $entroids: then the $losest $entroid and the $entroids not filtered ;again 3ased on their res+e$tive $riteria< 3! this $entroid are +assed to the $hildren nodes. The same +ro$ess of filtering o(t $entroids is re+eated at the $hild nodes. ?hen a node is $om+letel! assigned to a single $entroid: the $entroid details are (+dated 3ased on the s(m of the +oints in the node: their s@(ared norm: and other $a$hed statisti$s so that the ne/ lo$ation of the $entroid and the distortion $an 3e $om+(ted. 9n the filtering algorithm: 3ased on ho/ the tree is are $lassified as e=+anded if the $hildren of that node /ere visited. 0efinition &7 " node is e*panded if its $hildren are visited /hen the tree is traversed. Tho(gh these /ork /ell for lo/er dimensions: the +erforman$e is not as good for higher dimensions. This is so 3e$a(se the kd-trees are not ver! good for higher dimensions. Do/ever the time $om+le=it! of these algorithms does not in$rease linearl! /ith the n(m3er of +oints or the n(m3er of $l(sters: /hi$h is the $ase /ith the nave k-means algorithm. "s a res(lt of these: these algorithms are 3etter s(ited for large n(m3er of +oints and $enters. 9n 546 it is +roven that /hen the $l(sters are $lose to the a$t(al $enters: the n(m3er of nodes e=+anded in a k-means iteration of the filtering algorithm is of the order of O(nk2/ 2). ?here

is the $l(ster

se+aration. Th(s the time $om+le=it! of the k-means red($es as the $l(ster se+aration in$reases. This is in $ontrast to the O(nk) $ost of single k-means iteration in the nave k-means iteration for all degrees of se+aration. The filtering algorithm that $om+(tes the $losest $entroid 3ased on its distan$e from the $enter of the h!+er-re$tangle +erforms 3etter than 3la$klisting algorithm. Gne advantage of (sing the sliding mid+oint +lane for s+litting the +lane over the method in the 3la$klisting algorithm is that ever! node has at least one +oint. Th(s an! node /ith no +oints in its region is eliminated. The idea is alread! more or less e=+ressed in sa!ing that a region /ith no +oints is identified. Therefore: this has 3een dis$arded. 9n this $ha+ter /e have $overed all the 3asi$ $on$e+ts that are needed for (nderstanding the algorithms that are dis$(ssed in the follo/ing $ha+ter. ?e (se all feat(res dis$(ssed in this $ha+ter in o(r algorithms and also introd($e some ne/ $on$e+ts that add an advantage to the algorithm.

Page 22 of 62

! Accelerating the #lobal kmeans


9n this $ha+ter /e $om3ine the three algorithms des$ri3ed 3! (s in the +revio(s se$tion. Th(s /e $om3ine the glo3al k-means algorithm s(ggested in 516: the fast kmeans algorithm s(ggested in 526 and 546 /ith the nave k-means algorithm: there3! making the glo3al k-means tra$ta3le for large data set. The kd-tree as in 516 is (sed for the insertion +ositions of the $entroids. The kd-tree as in 526: as in the 3la$klisting algorithm: /ith minor modifi$ations: is (sed for a$$elerating the k-means iteration. *ertain other modifi$ations are +ro+osed that s+eed (+ the algorithm $onsidera3l!. "s for the $on$e+ts des$ri3ed in 516: the same ass(m+tion s also (sed here. ?e restate this ass(m+tion on$e more 3elo/Assum tion7 The ass(m+tion (sed in the algorithm is that the glo3al o+tima $an 3e rea$hed 3! r(nning k-means /ith the ;k-1< $l(sters 3eing +la$ed at the o+timal +ositions for the ;k-1< $l(stering +ro3lem and the k th $l(ster 3eing +la$ed at an a++ro+riate +osition that is !et to 3e dis$overed. This $ha+ter is organi>ed in three se$tions. 9n the first se$tion /e give an overvie/ of the fast greed! k-means "lgorithm. This gives a 3rief des$ri+tion of the vario(s methods that are +art of the algorithm. 9n the ne=t se$tion /e dis$(ss an algorithm that enhan$es the fast k-means algorithms: that is: an algorithm 3ased on the 3la$klisting algorithm and the filtering algorithm. This is the algorithm that /e (se in the fast greed! k-means algorithm for $onvergen$e of $l(sters. The ne/ algorithm is a mi=t(re of the nave a++roa$h and the a++roa$h (sing kd-trees and is faster than either of these. This fast algorithm is named the restricted filtering algorith# : as filtering is not done /hen $ertain $riteria are satisfied. The restri$ted filtering algorithm also (ses some ne/ $on$e+ts not (sed in either nave k-means or the filtering algorithm. The ne=t se$tion gives an algorithm that a$ts as the $r($ial link in $om3ining the restri$ted filtering algorithm and the greed! glo3al k-means algorithm. 4ffi$ient im+lementation of this link is th(s ver! im+ortant for the fast greed! k-means algorithm. ?e then $om3ine the restri$ted filtering algorithm and the greed! glo3al k-means algorithm to get a fast greed! k-means algorithm. The fast greed! k-means algorithm over$omes the maAor short$omings of the k-means algorithm dis$(ssed in $ha+ter 1: vi>. +ossi3le $onvergen$e to lo$al minima and large time $om+le=it! /ith res+e$t to the n(m3er of +oints in the dataset. Do/ever: there

Page 23 of 62

is a limitation in that tho(gh the algorithm $an 3e (sed for large n(m3er of data +oints: it might (ntil not 3e feasi3le to (se the fast greed! k-means algorithm for large n(m3er of $l(sters and higher dimensions.

Page 24 of 62

!.1

"he overall algorithm

The fast greed! k-means algorithm (ses the $on$e+ts of the greed! glo3al k-means algorithm for a glo3al sol(tion. The intermediate $onvergen$e is done (sing the restri$ted k-means algorithm instead of the nave k-means algorithm. The follo/ing 4 ste+s that have 3een ill(strated in the diagram 3elo/ o(tline the algorithm.

1. *onstr($t an a++ro+riate set of +ositionsFlo$ations /hi$h $an a$t as good $andidates for insertion of ne/ $l(stersE 2. 9nitiali>e the first $l(ster as the mean of all the +oints in the datasetE 3. 9n the Kth iteration: ass(ming K-1 $l(sters after $onvergen$e find an a++ro+riate +osition for insertion of a ne/ $l(ster from the set of +oints $reated in ste+ 1 that gives minim(m distortionE 4. )(n k-means /ith K $l(sters till $onvergen$e. re@(ired n(m3er of $l(sters is not !et rea$hed. o 3a$k to ste+ 3 if the

!te

'7 The first ste+ of the algorithm is to find an a++ro+riate set of +oints that

$o(ld a$t as the insertion +ositions for the ne/ $l(sters in the $l(ster insertion ste+s of the algorithm. This is done (sing the kd-tree as in 516. Th(s: the fast greed! kmeans algorithm starts 3! first initiali>ing the kd-tree so that the $entroids of the +oints $ontained in the leaf nodes $an 3e (sed for the insertion of ne/ $l(ster +ositions. This variant of kd-tree s+lits the +oints in a node (sing the +lane that +asses thro(gh the mean of the +oints and is +er+endi$(lar to the +rin$i+al $om+onent of the +oints in the node. The n(m3er of 3($kets ;leaf nodes< is +res+e$ified: so one does not s+lit a node if the ma=im(m n(m3er of 3($kets has 3een rea$hed. Normall! the n(m3er of 3($kets is taken to 3e the n(m3er of $l(sters that one /ants to in the final sol(tion. Cor f(rther dis$(ssion on the +ro+erties of the kdtree: see 516. !te %7 Gn$e the $l(ster insertion lo$ations are determined: the k-means initiali>es the first $l(ster to 3e the mean of the +oints in the dataset. This is the o+tim(m $l(ster $enter for K B 1. This is the first ste+ in o(r iterative +ro$ess. The algorithm +ro$eeds 3ased on the follo/ing $on$e+t. 2et (s ass(me that /e have the o+tim(m $l(stering for K M 1 $l(sters. 8sing the ass(m+tion in 516: also des$ri3ed in $ha+ter 2: /e have to find an a++ro+riate $l(ster lo$ation for insertion so that the

Page 2# of 62

$l(ster lo$ations for the ;K M 1<-means and the ne/ $l(ster lo$ation $onverge to the o+tim(m $l(ster lo$ations for k-means. !te 47 Gne /a! to find the ne/ insertion lo$ation is to r(n k-means (ntil K M 1 means o+tim(m sol(tion. Do/ever: this is ver! time $ons(ming.

$onvergen$e /ith ea$h of the $andidate $l(ster lo$ations $om+(ted in ste+ 1 3eing added to the "s a res(lt /e (se a greed! a++roa$h that is an a++ro=imation. ?e onl! $om+(te the ne/ distortion /hen the ne/ $l(ster is added at a $andidate insertion +osition. The +oint that gives the minim(m distortion if added to the e=isting K M 1 meansF$enters in there o/n +ositions is taken to 3e the $l(ster $enter that is most a++ro+riate for getting the o+timal k-means sol(tion. !te &7 The final ste+ in the algorithm is to r(n k-means (ntil $onvergen$e on the

$onfig(ration for K M 1 $l(sters +l(s the ne/ $l(ster 3eing added at the lo$ation as determined in 7te+ 3. 9f /e have the re@(ired n(m3er of $l(sters: the algorithm is terminated. 4lse /e go 3a$k to ste+ 3: find another insertion lo$ation and r(n the algorithm for the ne/ set of $l(sters.

Page 26 of 62

!.2

Restricted filtering Algorithm

The restri$ted filtering algorithm $on$entrates on a$$elerating ste+ 4 of the fast greed! k-means algorithm. "s this is e=e$(ted several times: in a for loo+: a$$elerating the large n(m3er of nearest-neigh3or @(eries that the algorithm e=e$(tes in ste+ 4 $an signifi$antl! red($e the time $om+le=it! of the algorithm. Cor ea$h iteration of k-means: one has to assign vario(s data +oints to their nearest $l(ster $enters and then (+date the $l(ster $enters to 3e the mean of the +oints assigned to them. The restri$ted filtering algorithm is ver! similar to the filteringF3la$klisting algorithm e=+lained in $ha+ter 2. Modifi$ationsFe=tensions have 3een done so that effi$ien$! $an 3e im+roved in the $onstr($tion of nodes: s+a$e and n(m3er of nodes s+anned. The modifi$ationsFe=tensions are1. $oundaries of the node7 " modifi$ation is done at the stage /hen the kd-tree is $reated. 9n methods in 3oth 526 and 546: one of the 3o(ndaries of the region of the node /as the +lane +er+endi$(lar to the dimension along /hi$h the +arent h!+er-re$tangle is s+lit. The other e=treme along this dimension /as t!+i$all! the other e=treme val(e along that dimension in the +arent. "s a res(lt: the (nion of the region of the $hildren e@(als the region of the +arent. Do/ever: in this s$heme: large regions of the node have no +oint: also these regions are $onsidered d(ring the dominan$e $he$k. "n im+rovement in the a3ove s$heme $an 3e made. The 3o(ndaries of the node are redefined 3ased on the +oints in the region. The 3o(nds of the region are the e=treme val(es that the +oint $an take. Th(s if there is onl! one data +oint in the h!+er-re$tangle: its $oordinates /o(ld 3e 3oth the higher and lo/er e=tremes of the region that /o(ld 3e looked at 3! the algorithm. This gives marginal im+rovement d(ring filtering and also hel+s /hen the res(ltant region is not s+lit 3! an! of the 'oronoi 3o(ndaries of the $entroids. 9t $an also 3e noted that if the 3o(ndaries /ere redefined this /a!: ea$h of the $hildren nodes: formed 3! s+litting the node along the longest dimension: /o(ld have at least one +oint ea$h: th(s eliminating the need for a sliding mid+oint s+litting as s(ggested in 546. 2. *hreshold for direct com utation of distances - "nother $hange is in the $ondition /hen /e des$end the tree in a node. The $ost overheads of the 3la$klisting and filtering algorithm are high. The overhead is high not A(st /hen /e have to $reate the tree 3(t also /hen /e have to traverse the tree: as /e have to maintainF$om+(te several statisti$s for ea$h node individ(all!. C(rthermore: the $ost of the traditional k-means is signifi$antl! less for small n(m3er of +oints and $enters. "s a res(lt it /o(ld 3e more int(itive to (se a $om3ination of 3oth the a$$elerated and the nave algorithm. ?hen the $ost of the nave algorithm is less: i.e. +rod($t of n(m3er of data +oints and the n(m3er of $entroids to 3e $onsidered is less than a given val(e in the h!+er-re$tangle: instead of des$ending the tree: /e dire$tl! $om+(te the distan$es and assign the +oints in the h!+er-re$tangle to the $enters. The

Page 2% of 62

threshold val(e is de$ided 3ased on e=+erimental res(lts on vario(s random datasets of vario(s dimensions and si>es. This $hange res(lts in a s+eed (+ of at least t/o to three times /hen $om+ared to the im+lementation /itho(t the threshold val(e. 9t sho(ld 3e noti$ed that even if the average height of the kd-tree red($es 3! A(st one: the n(m3er of nodes to 3e e=+anded is red($ed to almost half. This res(lts in red($tion of large overheads like re$(rsion and maintenan$e of statisti$s for ea$h node. 9n $ertain lang(ages like M"T2"&: having to r(n a large n(m3er of statements as against the fe/ statements as re@(ired for dire$t $om+(tation is also a 3ig overhead. 3. Kd-tree construction at runtime- 9t might 3e noted that as a res(lt of the modifi$ation in ;2< a3ove: not all nodes /ith more than one data +oint have $hildren. "s a res(lt the $hildren of a node are determined onl! at r(n time: i.e. onl! if this node is e=+anded in some iteration. Do/ever: on$e the $hildren of a node are defined: these are stored so that one does not have do redefine the $hildren. This might give $onsidera3le advantage as the de+th of the tree is red($ed $onsidera3l!. "lso the s+a$e $om+le=it! is drasti$all! red($ed as even if the average de+th red($es 3! A(st one: /e have red($ed the s+a$e $om+le=it! of the tree to almost half as /ith a f(ll tree.

Figure 47 *his figure sho8s a ty ical case 8hen using the threshold 8ould ,e advantageous. *he decision line asses through almost the middle of one of the children hy er-rectangles that has only / oints. .nstead of s litting this node, 8e 8ould directly com ute the distortion.

Page 20 of 62

The restri$ted filtering algorithm $an is s(mmari>ed in the follo/ing8+date ;Node 8: *enters *< S ,o an! initiali>ations to 8 if first $all 22boundaries set to e*tre#e points

9f the threshold $riteria for dire$t $om+(tation is satisfied S ,ire$tl! $om+(te distan$esE (+date statisti$s of $enters )et(rn T

Cind the $entroid $U $losest to the $enter of the h!+er-re$tangle 8

9f $U dominates all other $entroids S "ll +oints in 8 3elong to $UE (+date statisti$s )et(rnE T else S Cind the $enters in *: *(nfiltered: not dominated 3! $U

,o initiali>ations to the $hild nodes if first $all: 22points split in node a#ong children 8+date ;[Link]: *(nfiltered< 8+date ;[Link]: *(nfiltered<E )et(rnE

The main o+erations in k-means are to find the mean of +oints in a $l(ster region and to $om+(te the distortion. ?hen the distan$es are 3eing $om+(ted +oint 3! +oint: the $om+(tation of these statisti$s is trivial. Do/ever /hen (sing the restri$ted filtering algorithm: +oints are assigned in 3(lk: so the $om+(tation is not so straightfor/ard. ?hile sim+l! initiall! $om+(ting and storing the s(m of all the +oints in a node $an o3tain the mean: there are t/o /a!s in /hi$h the distortion $an 3e $om+(ted. The! are des$ri3ed 3elo/.

Page 21 of 62

!.2.1

)om*uting Distortion for the current set of centers+

Cor all +oints V that 3elong to a +arti$(lar $enter ;=< the $ontri3(tion to the total distortion is-

distortion =

1 N

x ( x)
x

1 | x | 2 2 T ( x) x + N | ( x ) |2 N x x

;2<

?here N is the n(m3er of +oints. The a3ove e@(ation $an 3e dire$tl! (sed to $om+(te the distortion for the $(rrent set of +oints if one has stored the statisti$s of the s(m of s@(ared norm of the +oints: the s(m of the +oints that 3elong to a single $l(ster and the n(m3er of +oints in ea$h $l(ster.

!.2.2

)om*uting the distortion for the u*dated list of centers+

The same statisti$s $an 3e (sed to $om+(te the distortion /ith the (+dated $entroids. This is a ne/ method of $om+(ting the distortion devised 3! (s. 2et ;=< 3e the ne/ $entroids. ;=< is $om+(ted as the mean of the s(m of +oints in ea$h region. Th(s

' ( x) =

1 N

x
x

;3<

7(3stit(ting the same in the e@(ation ;2<: /e get the follo/ing

x ' ( x)
x 2 x

= x 2T ' ( x )( N' ( x )) + N | ' ( x ) |2


2 x 2

= x N ' ( x )

;4<

The statisti$s that need to 3e stored in either of these is the same vi>. the s(m of s@(ared norms of all the +oints in a node. Do/ever: 3ased on the form(la that is (sed: /e might either o3tain the distortion for the old set of $enters or the one-ste+ (+dated $enters. Do/ever /hen /e (se the se$ond form(la for o3taining the distortion for the ne/ $l(ster $enters: the n(m3er of iterations re@(ired for $onvergen$e /o(ld red($e 3! one ;3ased on the $riteria that /e (se for $onvergen$e<. This res(lts in marginal red($tion of the time taken for $onvergen$e. The e=a$t fig(res a3o(t the red($tion in time taken $an 3e seen in the e=+erimental res(lts: 7e$tion #.1- 4ffe$t of n(m3er of +oints: 4ffe$t of ,imensionalit! and 4ffe$t of n(m3er of $enters. 7e$tion #.2- 'ariation of dimension: 'ariation of n(m3er of +oints.

Page 3. of 62

The restri$ted filtering algorithm is highl! effi$ient and faster than either the nave k-means algorithm or the filtering algorithm for all t!+es of +ro3lems e=$e+t /hen the dimensionalit! is ver! large. 9n fa$t: the restri$ted k-means algorithm $an 3e vie/ed as a $om3ination of the nave k-means and the filtering algorithm. ?hen the threshold val(e is set ver! high: the algorithm red($es to a nave k-means algorithm. ?hen the threshold is set to >ero: the algorithm red($es to the filtering algorithm. Th(s in this $ase: the restri$ted k-means algorithm /o(ld 3e at least as fast as an! of these t/o algorithms 3(t the threshold val(e for dire$t $om+(tation of distortion might 3e different. Cor lo/er dimensions the threshold val(e is e=+e$ted to 3e lo/er and for higher dimensions: the threshold val(e is e=+e$ted to 3e higher.

Page 31 of 62

!.! Distortion com*utation for the set of candidate center locations


,istortion $om+(tation for the set of $andidate $enters is the ke! ste+ in the fast greed! k-means algorithm. The distortion /ith ea$h of the $andidate $enters 3eing inserted in the $(rrent set glo3al $l(sters needs to 3e $om+(ted. Th(s for ea$h $l(ster that is inserted: the distortion for ea$h of the $andidate $entroids is $om+(ted. "s the n(m3er of $andidate $entroids is large: this ste+ $an 3e$ome a 3ottlene$k in the +rogram. The n(m3er of s($h distortion $om+(tations might easil! e=$eed the total n(m3er of distortion $om+(tations done 3! the restri$ted filtering algorithm. "n effi$ient algorithm of this ste+ is th(s ver! im+ortant for the fast greed! algorithm. The $om+(tation of this distortion +arameter is also ver! similar to the restri$ted filtering algorithm. ?e (se the same kd-tree that /as defined in the +revio(s se$tion for the restri$ted filtering algorithm. The o(tline of the algorithm that $om+(tes distortion /ith a $entroid added is as follo/s. *al,istortion ;Node 8: Point +< S ,o an! initiali>ations to ( that are ne$essar!

9f the threshold $riteria for dire$t $om+(tation is met S ,ire$tl! $om+(te distan$es )et(rn T

9f the $entroid $losest to the $enter of 8 dominates + S ,istortion B same as $om+(ted in last k-means iterationE )et(rnE T

9f + is $loser to the $enter than the $losest $entroid and the $losest $entroid is not $ontained in the h!+er-re$tangle S 9f + dominates all the (nfiltered $entroids for node 8 $onsidered 3! the last k-means iteration S

Page 32 of 62

22all points in the region belong to p *om+(te distortion 22 all point assigned to the point p )et(rn T T

,o the ne$essar! initiali>ations to the $hild nodes: ,istortion B *aldistortion ;[Link]: +< J *aldistortion ;[Link]: +<E

)et(rnE

Page 33 of 62

Figure &7 *his figure sho8s 4 ossi,le insertion ositions. C ' and C% are the centroids that are not filtered. ;i< " ' is closer than the closest centroid and also dominates the unfiltered list of centroids= all oints assigned to " '. ;ii< "% is dominated ,y the closest centroid= 0istortion is the same as last com uted in 3means iteration. ;iii< "4 is not dominated ,y the closest centroid= cal0istortion is called recursively for oints of this ty e and those not classified as cases ;i< and ;ii<. "s mentioned +revio(sl!: it is im+ortant that the a3ove 3e im+lemented in an effi$ient /a!. "s a res(lt: /e save the statisti$s that have 3een alread! $om+(ted in the k-means iteration. These in$l(de the $losest $entroid in the iteration: the list of (nfiltered $entroids and the distortion as $om+(ted +revio(sl!. "lso d(ring im+lementation: /e (se a version of $al,istortion that does all the a3ove $om+(tations for all the +ossi3le insertion +oints sim(ltaneo(sl! so that the advantages of the lang(age and some $ommon $om+(tations like the distan$es of the $enters from all the +oints in $ase the threshold is satisfied $an 3e (sed to the 3est of the advantage.

Page 34 of 62

The $ondition for dire$t $om+(tation is slightl! different than in the restri$ted kmeans algorithm. 9n $al,istortion: /e not onl! have the old k-means $enters from /hi$h distan$es need to 3e $om+(ted 3(t also the insertion +ositions /hi$h are often more in n(m3er than the n(m3er of $enters. "s a res(lt: the threshold $ondition is also modified. ?e not onl! $he$k that the +rod($t of the n(m3er of +oints and the n(m3er of +oint satisfies the $riteria in the restri$ted k-means algorithm: 3(t also introd($e a ne/ threshold /hi$h validates the +rod($t of the n(m3er of insertion +oints and the n(m3er of +oints. ?e $all these thresholds threshold- and threshold. res+e$tivel!. 9t sho(ld 3e noted that the task a$$om+lished in this ste+ is ver! similar to the distortion $om+(tation in the restri$ted filtering algorithm e=$e+t for three things- 19nstead of kee+ing tra$k of the mean and the s(m of s@(ared norms of the +oints assigned to ea$h $l(ster: /e maintain the distortion for ea$h $l(ster from the +revio(s iterationE 2- ?e alread! kno/ /hi$h $l(ster ea$h of the +oints 3elongs toE 3- The list of $entroids that /ere (nfiltered in the last k-means iterations and the final distortion as /as o3tained is alread! $a$hed in the node. "s in the restri$ted filtering algorithms: there e=ist t/o /a!s of $om+(ting distortion. Do/ever: in $al,istortion: /e $om+(te the distortion for the original $l(ster $enters onl! and not for the (+dated list of $enters. 9t /o(ld have 3een highl! advantageo(s if /e had 3een a3le to $om+(te the distortion for the (+dated list of $entroids: as then /e /o(ld 3e $hoosing the insertion +oint 3ased on the distortion after the iteration and so the $hoi$e /o(ld 3e a 3etter a++ro=imation of the a$t(al insertion $enter. Do/ever: doing so in M"T2"& is not so effi$ient for m(lti+le sets of $l(sters. "s a res(lt /e al/a!s $om+(te the distortion 3ased on the initial set of $enters and not the (+dated $enters. C(rther dis$(ssion on the same is in the $ha+ter on 9m+lementation: se$tion 4.3. 9n this $ha+ter /e have given a detailed des$ri+tion of the fast greed! k-means algorithm and its variant that does advan$e distortion $om+(tation. "ll the ne/ ideas vi>.: the restri$ted filtering algorithm: advan$e distortion $om+(tation and the distortion $om+(tation /ith A(st one added $enter have 3een given d(e em+hasis. The ne=t $ha+ter gives a des$ri+tion of some of the im+lementation iss(es /hen im+lementing the fast greed! k-means algorithm.

Page 3# of 62

% Im*lementation
9n this $ha+ter /e dis$(ss the im+ortant iss(es that /e had to handle in o(r im+lementation of the vario(s algorithms dis$(ssed so far. ?e e=+lain the vario(s mathemati$al form(lae (sed. These in$l(de the $om+(tation of the ne/ $entroids and $om+(tation of the distortion- for 3oth the $(rrent and the (+dated list of $enters. ?e also dis$(ss ho/ /e $om+(ted ea$h of the terms in these form(lae in the a$t(al im+lementation. Cinall! /e give the 3asi$ fields in a single node of the kdtree and ho/ one $an $he$k if one of the $enters is dominating all the other $enters. "ll this data /ill also hel+ in re+li$ating the res(lts o3tained in the e=+erimental res(lts at other +la$es.

Page 36 of 62

%.1

)om*uting the ne, )entroids

The ne/ $entroid is the mean of all the data +oints that 3elong to this $entroid. "s /e assign the data +oints to a $entroid in 3(lk: for ea$h node in the tree /e have to $a$he the s(m of the +oints in its region. ?henever a node is assigned to a $entroid: the $orres+onding s(m needs to 3e added for that $entroid. The ne/ lo$ation of the $entroid is the total s(m divided 3! the total n(m3er of +oints assigned to this $entroid. Th(s $om+(tation of the ne/ $entroids $an 3e done 3! storingF$a$hing the statisti$s of the s(m of +oints in a node and the n(m3er of +oints in ea$h node. ?henever a node is assigned to a $enter: these statisti$s are added to the +ro+erties of that node. ?hen a dire$t $om+(tation is done 3e$a(se of the threshold $riteria 3eing satisfied: /e $an dire$tl! $om+(te the distan$e of ea$h of the +oints from the all the $enters in $onsideration.

Page 3% of 62

%.2

)om*uting the distortion

"s mentioned in $ha+ter 3: there are t/o /a!s in /hi$h the distortion $an 3e $om+(ted 3! maintaining the same statisti$s in a node. ?e restate the t/o form(lae for referen$e-

distortion =

1 | x |2 2 T ( x)x + N | ( x ) |2 ;2< N x x
2 2

distortion = x N ' ( x)
x

;4<

The statisti$s that are re@(ired for the distortion $om+(tation are ;i<. ;ii<. ;iii<. 7@(ared norms of +oints in the nodeE 7(m of +oints in the node ;also re@(ired for finding the ne/ $enter lo$ation<E N(m3er of +oints in the node ;also re@(ired for finding the ne/ $enter lo$ation<. The distortion /ith the (+dated $entroids $an 3e $om+(ted 3! maintaining the n(m3er of data +oints assigned to ea$h $enter: the s(m of +oints and the s(m of the norm s@(ared of these +oints. Tho(gh it a++ears that e@(ation ;4< is al/a!s 3etter to (se as it gives the distortion for the (+dated list of $entroids: there is one dra/3a$k /ith this e@(ation. The distortion $an 3e $om+(ted onl! after the $om+lete traversal of the tree is done: as the val(es of the (+dated $enters /o(ld onl! 3e kno/n then. &! $ontrast: e@(ation ;2< $an 3e (sed in the $om+(tation of distortion for ea$h of the h!+er-re$tangles individ(all!.

Page 30 of 62

%.!

)om*uting distortion for added center

Dere /e are referring to the $al,istortion ste+ vi>. ste+ 3 of the fast greed! k-means algorithm. 4ffi$ien$! is most im+ortant in this ste+. The n(m3er of distortion $om+(tations that /e have to do in this ste+ is m($h more than the distortion $om+(tations that /e do for the $onvergen$e of k-means. 8sing the strength of M"T2"&: /e do $om+(tation in matri$es for all the +ossi3le insertion +oints sim(ltaneo(sl!. 8sing e@(ation ;4< in this ste+ /o(ld 3e a$t(all! 3etter than (sing e@(ation ;2< as then /e are finding the distortion after one r(n of k-means iteration /hi$h is a 3etter a++ro=imation than $om+(ting the distortion /itho(t an! iterations. Do/ever for e@(ation ;3<: /e have to first e=+li$itl! $om+(te the s@(ared norm s(m and the n(m3er of +oints that are in the region of ea$h $l(ster. ?hen /e have a single +artitioning: this is sim+le. Do/ever /hen /e have to maintain the s(m for ea$h of the +ossi3le insertion +oints: the task 3e$omes diffi$(lt. Cor ea$h of the $andidate $enters: the s(m of the +oints ;/hi$h in t(rn is a ve$tor< for ea$h of the $l(sters needs to 3e maintained. This makes the matri= three-dimensional and diffi$(lt to im+lement. 9n $ontrast: the distortion in ea$h region $an 3e $om+(ted easil! (sing the e@(ation ;2< and onl! the distortion val(es s(mmed. "s a res(lt /e (se e@(ation ;2< for the distortion $om+(tation in the $al,istortion ste+. To (se e@(ation ;2< for the $om+(tation of the distortion: /e have to $a$he the s(m of the +oints assigned to a $entroid in a h!+er-re$tangle and the n(m3er of +oints that are $losest to the $orres+onding $entroids in a h!+er-re$tangle. "s the list of $entroids that are to 3e $onsidered d(ring the $al$(lation $annot 3e kno/n d(ring the re$(rsion of $al,istortion: the list of (nfiltered $entroids last $onsidered ;3! kmeans< is also storedF$a$hed in the nodes. The im+lementation of $al,istortion in M"T2"& a$$e+ts a set of +oints instead of a +oint and $om+(tes the distortion /ith ea$h of the +oints added. The ret(rn val(e of $al,istortion is a ve$tor re+resenting the distortion ;in its region< /ith the res+e$tive +oints added to its list of $entroids. The algorithm for the same is de+i$ted in the follo/ing*al,istortion ;Node 8: Points P< S ,o an! initiali>ations to 8 that are ne$essar!

9f the threshold1 and threshold2 $riteria for dire$t $om+(tation is met S ,ire$tl! $om+(te distan$es

Page 31 of 62

)et(rn T for the Points in P /hi$h are dominated 3! $entroid $losest to the $enter of 8 S ,istortion B same as $om+(ted in last k-means iterationE )et(rnE T 9f $losest $entroid not $ontained in 8 S Cor Points in P dominating all the (nfiltered $entroids of the last k-means iteration for node 8 S 22all points in the region belong to these candidate centers in 3 *om+(te distortion 22 all point assigned to the new center in 3 )et(rn T T

,o the ne$essar! initiali>ations to the $hild nodes: ,istortion B *aldistortion ;[Link]: Pnot*om+(ted < J *aldistortion ;[Link]: Pnot*om+(ted <E

)et(rnE

Page 4. of 62

%.%

&aintaining the "ree

"s the kd-tree is a 3inar! tree and also kee+ing in mind the vario(s statisti$s for kmeans and distortion $om+(tation for the added $enter: the follo/ing fields are ne$essar! in a nodeHmax Hmin X //the upper bound of the hyper-rectangle //the lower bound of the hyper-rectangle //the list of points in the hyper-rectangle,

//necessary so that they could be assigned to the //children. Leftchild Rightchild Sum Normsq //sum of all the points in the hyper-rectangle //sum of the norm squared of all the points in //the

hyper-rectangle C Csum W //indices of centroids not filtered //sum of points assigned to the various centroids //number of points assigned to various centroids

"lso: the node needs to 3e initiali>ed onl! on$e to $om+(te the mean and other fields. "fter that it /o(ld onl! need to 3e traversed.

Page 41 of 62

- ./*erimental results
?e +erformed a series of $aref(ll! +re+ared tests on s!ntheti$ data generated (sing M"T2"&E s!ntheti$ data generated (sing a /ell-kno/n algorithm as in 51#6 and on some /ell-kno/n data sets. The vario(s e=+eriments +erformed $om+are the effe$t of vario(s +arameters of the algorithm and its variant /ith advan$ed distortion $om+(tation s(ggested in this +a+er. The e=+eriments also $om+are the algorithm to the greed! glo3al k-means algorithm as in 516. 9n all the tests that /e +erform: /e $om+are the distortion and time of the algorithm. Tests have 3een designed in order to meas(re the effe$t of the vario(s +arameters in the algorithm. These in$l(de the effe$t of n(m3er of +oints in dataset: dimensionalit! of data: n(m3er of $entroids and the threshold val(es for the dire$t $om+(tation. The distortion is $onsidered /hen $onsidering the affe$t of the n(m3er of leaves of the kd-tree on the @(alit! of the sol(tion. "ll the tests are $ond($ted on a *eleron 6.. MD> ma$hine /ith 120 M& )"M. M"T2"& version #.2 on ?indo/s 10 is (sed for the tests.

Page 42 of 62

-.1

Dataset generated using &A"0A'

?e generated a set of datasets (sing M"T2"& for testing the effe$t of vario(s +arameters and si>e of the +ro3lem on the time taken 3! the algorithm. &! $hoosing the re@(ired n(m3er of $l(ster $enters randoml! in the region 5.:1..6 and then $reating a re@(ired n(m3er of +oints /ith a normal distri3(tion aro(nd these +oints /e $reate the datasets on /hi$h the e=+eriments are done.

-.1.1

(ariation of number of insertion *oints

8nder this $ategor! of tests: the effe$t of the n(m3er of insertion +oints on the sol(tion and the time taken for the $om+(tation /as eval(ated. Daving a smaller n(m3er of insertion +oints might affe$t the @(alit! of the sol(tion: as it might not 3e glo3all! o+timal. &(t having man! insertion +ositions /o(ld res(lt in the $al,istortion ste+ taking m($h longer. #.... +oints /ere generated /ith #. $l(ster $enters and 1... +oints 3eing generated aro(nd them /ith a normal distri3(tion. The res(lts are dis+la!ed the fig(re 3elo/.

Page 43 of 62

Figure 57 +ariation in num,er of insertion oints ;5( centers< "s e=+e$ted the @(alit! of the sol(tion im+roves /ith in$reasing n(m3er of insertion +oints 3(t onl! to some val(e ;B1#. in fig #< . "fter that the distortion is more or less the same. The time taken for the $om+(tation in$reases /ith in$reasing n(m3er of insertion +oints. &ased on the res(lts of these e=+eriments: /e set the n(m3er of insertion +oints in o(r other e=+eriments to 3e the same as the n(m3er of $enters.

-.1.2

(ariation of threshold1

Threshold1 is the threshold for dire$t $om+(tation of the distan$es in the restri$ted filtering algorithm. Daving a small threshold1 val(e /ill make the restri$ted filtering algorithm $loser to the filtering algorithm as in 546. " large threshold /ill make the algorithm similar to the nave k-means algorithm. ?e $ond($ted the e=+eriments on t/o different dimensions to also see the variation in threshold 3ased on the dimension: as /e are a/are from +revio(s res(lts in 546 that the filtering algorithm

Page 44 of 62

does not +erform ver! /ell for high dimensions. The n(m3er of insertion +oints /as set to the n(m3er of $enters. The res(lts are sho/n 3elo/.

Figure -7 *hreshold' vs. *ime ta3en. % 0 data

Figure /7 *hreshold' vs. time ta3en. - 0 data

Page 4# of 62

The res(lts o3tained in the e=+eriments sho/ that there e=ists an o+tim(m val(e of the threshold1 for /hi$h the algorithm /o(ld take the least time. The val(e of threshold1 /o(ld +ro3a3l! de+end on the lang(age of im+lementation: as the nave k-means algorithm might 3e highl! effi$ient $om+ared to the im+lementation of the filtering algorithm. Cor e=am+le in M"T2"&: $om+(tation of distan$es for all the +oints to all the $enters $an 3e im+lemented ver! effi$ientl! in a ve$tori>ed fashion /hile the filtering algorithm /o(ld re@(ire several $om+(tations.

-.1.!

(ariation of threshold2

4=+eriments /ere also done to see the affe$t of the variation in the threshold2: /hi$h is the +rod($t of the n(m3er of $andidate $enters in $onsideration and the n(m3er of +oints. The affe$t on the val(e of the time taken for the algorithm to e=e$(te is $onsidered. The data had #.... +oints and /as generated 3! $reating an e@(al n(m3er of +oints /ith normal distri3(tion aro(nd ea$h of these #. $enters. The threshold1 val(e /as set at 1.... in these e=+eriments. The res(lts are dis+la!ed in the follo/ing gra+hs.

Figure )7 *hreshold% vs. *ime ta3en. % 0 data, threshold' fi6ed as '((((

Page 46 of 62

Figure >7 *hreshold% vs. *ime ta3en. - 0 data, threshold' fi6ed as '(((( The res(lts sho/ that as /ith threshold1: there also e=ists an o+timal val(e for the threshold2. Do/ever the variation is not ver! high and is generall! not more than 1. +er$ent of the minim(m time taken if the threshold2 is not set ver! less $om+ared to threshold1.

-.1.%

.ffect of number of *oints

8nder this $ategor! of tests: test r(ns are +erformed to see ho/ the vario(s algorithms s$ale /ith in$reasing n(m3er of +oints. The three algorithms that /e tested /ere the greed! glo3al algorithm in 516: the fast greed! algorithm dis$(ssed in this thesis and the fast greed! algorithm /ith advan$ed distortion $om+(tation in the restri$ted filtering algorithm. The fast greed! algorithm /ith advan$ed distortion $om+(tation is e=+e$ted to +erform marginall! 3etter than the algorithm /ith dire$t $om+(tation as it has one iteration of the restri$ted filtering algorithm less d(ring the $onvergen$e of k-means. The res(lts of the same are sho/n in the follo/ing gra+h.

Page 4% of 62

Figure '(7 +ariation in num,er of oints. % 0

"s e=+e$ted the $om+le=it! of greed! glo3al algorithm in 516 in$reases linearl! /ith time. "lgorithm in 516 also re@(ires a lot of memor! and $o(ld not r(n for more than #.... +oints. The $om+le=it! of the fast greed! algorithm does not in$rease as fast and the rate of in$rease also a++ears to fall /ith in$rease in the n(m3er of +oints. The fast greed! algorithm also fares 3etter than the greed! glo3al. Tho(gh it might 3e e=+e$ted that the distortion for the fast greed! algorithm /ith advan$e distortion $om+(tation might not 3e as good as fast greed! algorithm sin$e the n(m3er of iterations in ea$h ste+ is one less: that /as not so. The distortion of all the three $ases /as almost identi$al.

-.1.-

.ffect of dimensionalit$

8nder this $ategor! of tests the effe$t of variation in the dimensionalit! of the data on the time taken 3! the algorithm is eval(ated. The tests involve #.... +oints and #. $l(sters. The average time taken for 2 r(ns is +lotted in the gra+h 3elo/.

Page 40 of 62

Figure '' *(rio(sl! for normal data: variation in dimension does not seem to have m($h effe$t on the time taken for the fast greed! algorithms. "lso time taken for the algorithm in 516 falls initiall! /ith in$rease in dimensionalit!. These res(lts /ere $onsistent a$ross s($$essive r(ns on data generated in a similar fashion. This might 3e 3e$a(se /e are testing the data of #.... re$ords and the algorithm in 516 re@(ires lot of memor! and therefore might 3e $reating a lot of +age fa(lts. &! $ontrast: the fast greed! algorithm is highl! memor! effi$ient /orks ver! /ell even for 16 dimensional data.

-.1.1

.ffect of number of centers

The effe$t of variation of the n(m3er of $enters on the time $om+le=it! of the algorithm is seen in these tests. "s 3efore /e generate normal data. The total n(m3er of +oints is set at #..... The data in this $ase is 2 dimensional. Tests /ere not $ond($ted on the algorithm in 516 as it re@(ired lot of memor! and $o(ld not r(n even for 1.. $enters /ith #.... +oints.

Page 41 of 62

Figure '% "s e=+e$ted the n(m3er of $enters affe$ts the $om+le=it! of the algorithm more or less in a @(adrati$ /a!. This is also e=+e$ted as a$$ording to a res(lt in 546 the n(m3er of nodes e=+anded in a k-means iteration in filtering algorithm is not more than O(nk2/ 2). Th(s for ea$h k-means iteration: the time $om+le=it! is e=+e$ted to 3e of that order. "s the n(m3er of $l(sters in$reases one 3! one: the time $om+le=it! of the fast greed! algorithm is e=+e$ted to 3e-

n nk O 2 = O 2 k =1
kmax

2 nk max n k max (k max + 1) k = O = O 2 2 2 k =1 k max

Th(s /ith in$reasing n(m3er of $enters to $onverge: the restri$ted k-means algorithm also takes more time. "s a res(lt of this: the fast greed! algorithm $annot 3e (sed for large n(m3er of $l(ster $enters. Do/ever o(r restri$ted k-means algorithm $an still 3e (sed.

Page #. of 62

-.2

Data generated using Dasgu*ta 21-3


a(ssian distri3(tion given +arameters like

9n this set of e=+eriments the dataset is generated (sing the algorithm in 51#6. The algorithm generates a random e$$entri$it!: se+aration degree: dimensionalit! and n(m3er of +oints. Tests are done to $om+are the greed! glo3al k-means algorithm /ith the algorithms +ro+osed in this +a+er.

-.2.1

(ariation in dimensionalit$
a(ssian mi=t(re of e$$entri$it! 2: and

The effe$t of dimensionalit! of data on the time taken is $om+(ted. #.... +oints aro(nd 3. $enters are generated /ith a in the follo/ing diagrams. var!ing degrees of se+aration 3ased on the lines in 51#6. The res(lts o3tained are as

Page #1 of 62

Figure '47 *ime +s 0imension, 5(((( oints, 4( centers, eccentricity %

Page #2 of 62

8nlike the $ase /ith normall! distri3(ted data (niforml! /ith #.... +oints and #. $enters ;the +revio(s $ategor! of tests<: the in$rease in dimensionalit! hardl! affe$ts the time $om+le=it! of the algorithm in 516 3(t drasti$all! in$reases the time $om+le=it! of the fast greed! algorithm. The algorithm in 516 +erforms 3etter than the fast greed! algorithm for higher dimensions. This is also e=+e$ted: as kd-trees do not +erform /ell for high dimensional data.

-.2.2

(ariation in number of *oints

The variation of the n(m3er of +oints is tested in this $ategor!. )andom +oints are generated aro(nd 3. $enters /ith an e$$entri$it! 2 and se+aration degree of 2 in t/o dimensions. The n(m3er of +oints ranged from 2.... to 16..... The res(lts o3tained are ill(strated in the diagram 3elo/.

Figure '&7 *ime vs. Num,er of oints, 4( centers % 0, eccentricity %, se aration degree % The res(lts in this $ase are similar to those o3tained /ith s!ntheti$ data generated (sing M"T2"& alone.

Page #3 of 62

-.2.!

(ariation in )luster 4e*aration

8nder this $ategor! of tests: /e varied the $l(ster se+aration of the $l(sters 3eing generated to see the affe$t of the vario(s algorithms in them. #.... +oints /ere generated aro(nd 3. $enters /ith e$$entri$it! 1 and var!ing $l(ster se+aration. The res(lts o3tained are ill(strated in the fig(re 3elo/.

Figure '57 +ariation in se aration, 4( centers, 5(((( oints, eccentricity % The de$rease in time taken for in$rease in the se+aration is e=+e$ted for t/o reasons. ?ith the in$rease in the se+aration of the $l(sters: the n(m3er of iterations needed for $onvergen$e in k-means /o(ld 3e lessE The $ost of Cast reed! a++roa$hes is also likel! to de$rease 3e$a(se of the

$om+le=it! res(lt +roved in 546: vi>.: the ma=im(m n(m3er of nodes e=+anded in a kmeans iteration on filtering algorithm is O(nk2/ 2), /here nodes e=+anded is e=+e$ted to 3e m($h less. The red($tion in the time taken 3! the algorithm in 516 is onl! 3e$a(se of the first reason stated a3ove. The fast greed! a++roa$h /o(ld re@(ire less time 3e$a(se of 3oth the a3ove reasons. 9n fa$t it a++ears that the affe$t of 3oth of these is e@(al

is the se+aration

degree of the $l(sters. "s /e have a restri$ted k-means a++roa$h: the n(m3er of

Page #4 of 62

sin$e /hile the algorithm in 516 red($ed the time $om+le=it! 3! a fa$tor of A(st t/o: the time $om+le=it! of the fast greed! a++roa$hed red($ed 3! a fa$tor of fo(r. The $(rves also seem to follo/ the r(le of 3eing +ro+ortional to (1/ 2 + constant).

Page ## of 62

-.!

Dataset from 5)I &0 Re*ositor$

Most of the +(3li$l! availa3le real /orld datasets have ver! fe/ re$ords. ?e eval(ated o(r method on the letter-re$ognition dataset of the 9rvine dataset. This dataset $onsists of 2.... data elements in 16 dimensions: /hi$h $onsist of a large n(m3er of 3la$k-and-/hite re$tang(lar +i=el dis+la!s as one of the 26 $a+ital letters in the 4nglish al+ha3et. The $hara$ter images /ere 3ased on 2. different fonts and ea$h letter /ithin these 2. fonts /as randoml! distorted to +rod($e a file of 2.:... (ni@(e stim(li1. This is the largest n(m3er of re$ords that an! of the datasets in the re+ositor! have. "s it /as kno/n that the algorithm does not s$ale /ell for high dimension data: the e=+eriments /ere also $ond($ted 3! $onsidering onl! first fe/ dimensions of the data as /ell. ?e also e=+erimented /ith var!ing val(es of the threshold. The res(lts are sho/n in the follo/ing ta3le.

Threshold1 val(e #... 1#... 2#... 3#... #.... %#... 1..... 1#.... 2..... reed! glo3al 516

Time Taken ;se$onds< for Cast greed! algorithm /ith advan$ed distortion $om+(tation "ll 16 dimensions Cirst 0 dimensions Cirst 4 dimensions Cirst 2 dimensions 3.4.1 23#.3 22%.2 22%.0 22#.6 220.1 22%.6 226.1 220.2 224.6 106.4 10..1 1%%.1 1%4.4 1%2.# 1%1.1 1%3.1 1%2.3 1.2.4 1.#.4 111.% 11%.6 123.1 132.1 136.2 13#.0 131.4 41.. 40.6 #%.1 #1.6 64.4 %2.. %6.# 0..4 03.3

130.3

136.1

1.0.6

62.1

The res(lts o3tained are on e=+e$ted lines kee+ing in mind that the fast greed! algorithm is not effi$ient for large dimensions and for less n(m3er of +oints. The res(lts also sho/ that /ith in$reasing dimensions: the val(e of threshold1 in$reases. Cor A(st 2 dimensions: the 3est sol(tions are o3tained /ith a threshold1 as lo/ as #.... ?ith higher dimensions ;0 and 16<: the sol(tions are 3etter for a larger val(e of
1

Cor a more detailed des$ri+tion of the data: see the 9rvine M2 ,ata re+ositor! at

htt+-FF///.i$s.($[Link](FWmlearnFM2)e+ositor!.html

Page #6 of 62

threshold1. 9f the n(m3er of +oints is less: 2.... for e=am+le: and the n(m3er of dimensions is also less: then /hi$h algorithm vi>.: the greed! glo3al algorithm in 516 or the fast greed! algorithm: is faster de+ends on the kind of data. Dere: the fast greed! algorithm +erforms s(3stantiall! 3etter for 2 dimensions. Do/ever for data that /as generated as des$ri3ed in 51#6: the fast greed! algorithm /as not so m($h faster for so fe/ +oints even in the $ase of 2 dimensional data ;refer fig(re 0<. Do/ever /hen the n(m3er of +oints is large and dimensionalit! lo/: it is al/a!s 3etter to (se the fast greed! algorithm.

Page #% of 62

1 )onclusion
?e have +resented a ne/ k-means 3ased algorithm in this thesis that gives (s near glo3al sol(tions effi$ientl!. T/o main dra/3a$ks of the k-means algorithm vi>.: $onvergen$e to lo$al minima and non-s$ala3ilit! to large n(m3er of +oints have 3een effe$tivel! handled. The im+lementation of the algorithm is also sim+le. The +ro+osed algorithm is a $om3ination of e=isting /orks 3(t also introd($es some ne/ ideas. "n effi$ient method of $om3ining the restri$ted filtering algorithm and the greed! glo3al algorithm in 516 has 3een the ste++ing-stone in 3eing a3le to develo+ a fast greed! glo3al algorithm. The fast distortion $om+(tation /ith an added $enter is a3le to $om+(te the distortion for the large n(m3er of $andidate $enters. This is done effi$ientl! even /hen the n(m3er of insertion +oints /as m($h more than the total n(m3er of k-means $l(sters. "lso as the k-means $enters for all k Q k are also eval(ated: the algorithm $an also 3e (sed to eval(ated the minim(m n(m3er of $entroids that satisf! a given $ondition. The e=+erimental res(lts on 3oth the s!ntheti$ and real data s(ggest that the algorithm +erforms ver! /ell for lo/ dimensional data 3(t not so /ell for data e=$eeding 6 dimensions. "lso the algorithm s(ggested in 516 is faster for less n(m3er of +oints. This is so as the greed! glo3al algorithm in 516 $a$hes the distan$es from ea$h of the +ossi3le insertion +oints to ea$h of the data +oints. The magnit(de of data $a$hed is so large that this algorithm is not a3le to r(n even for medi(m si>ed data of an order 3e!ond #.... +oints and #. $enters. &! $ontrast: the fast greed! algorithm is memor! effi$ient and s$ales ver! /ell /ith large n(m3er of +oints. 9f the algorithm is to 3e (sed onl! for less n(m3er of +oints $a$hing the distan$es from the insertion +oints to the data +oints /hen $om+(ted 3! a node $an 3e $a$hed: as in 516. ?hen this is done: the fast greed! algorithm /ith $a$hing is likel! to r(n faster than the algorithm in 516 if the thresholds are set a++ro+riatel!. Do/ever: all the algorithms dis$(ssed: in$l(ding the one in 516: fare ver! 3adl! in terms of the n(m3er of $enters. The $ost goes (+ @(adrati$all! /ith the n(m3er of $enters. "s a res(lt the algorithm $annot 3e (sed for more than a fe/ h(ndred $enters /hen im+lemented in M"T2"&. Do/ever an im+lementation in H* is likel! to 3e (+ to #. M 1.. times faster making the algorithm feasi3le even for $enters of the order of a fe/ tho(sands. 7ome of the ideas introd($ed in this thesis $an 3e (sed even at +la$es /here it is not +ra$ti$al to (se the fast greed! algorithm 3e$a(se of large n(m3er of $enters

Page #0 of 62

involved. The restri$ted filtering algorithm introd($ed in $ha+ter 3 s$ales ver! /ell for all si>es of n(m3er of $enters and n(m3er of +oints. This method $an 3e (sed /hen one is dealing /ith millions of re$ords in lo/ dimension. This algorithm is also likel! to fare /ell for higher dimensions /ith an in$rease in the threshold val(es.

Page #1 of 62

6 7urther Research
*(rrentl! /e are (sing the $enter of the +artitions formed 3! a kd-tree as in 516 as the insertion +osition for the ne/ $enters. The $enters are (sed as the! give a good +reliminar! $l(stering of the data. Gther /a!s $o(ld 3e em+lo!ed to get a good +reliminar! $l(stering. Cor e=am+le to (se $enters that are s(ggested for initiali>ation in 516 as the insertion +ositions. The idea of having the insertion +oints is onl! to have an a++ro+riate set of insertion lo$ations so that one of the lo$ations al/a!s satisfies the a++ro+riate lo$ation assu#ption ;refer *ha+ter 3< (sed 3! the algorithm. The initiali>ation +oints s(ggested in 516 /o(ld +ro3a3l! give a good +reliminar! $l(stering of data to start /ith instead of the kd-tree as in 516. "nother e=tension of the /ork /o(ld 3e to $ome (+ /ith a form(la for $hoosing the a++ro+riate threshold val(es: a form(la that /o(ld /ork for all dimensions: n(m3er of +oints and $enters. Dere /e have onl! 3een a3le to e=+erimentall! see that an o+tim(m threshold val(e e=ists: 3(t not 3een a3le to $ome to the e=a$t val(e. 9t is also likel! that the thresholds de+end on the t!+e of data. More resear$h is needed for deriving the form(la. The $(rrent algorithm +erforms 3adl! /ith in$reases in dimensionalit!. This is 3e$a(se the most im+ortant data str($t(re (sed in the algorithm: the kd-tree: does not s$ale /ell /ith in$reases in dimension. "n alternative /a! /o(ld 3e to (se some other data str($t(res instead of a kd-tree: like ", trees in 5#6: that s$ale /ell /ith in$reases in dimensionalit!.

Page 6. of 62

8 References
516 "ristidis 2ikas: Nikos 'lassis and Xa$o3 X. 'er3eek- The glo3al k-means $l(stering algorithm. 9n 3attern 'ecognition 'ol 36: No 2: 2..3

526 ,an Pelleg and "ndre/ Moore- "$$elerating e=a$t k-means algorithms /ith geometri$ reasoning.;Te$hni$al )e+ort *M8-*7-..1.#<: *arnegie Mellon 8niversit!: Pitts3(rgh: P".

536 ,an Pelleg and "ndre/ Moore- V-means- 4=tending k-means /ith effi$ient estimation of the n(m3er of $l(sters. 9n 3roceedings of the !eventeenth International )onference on $achine 4earning: Palo "lto: *": X(l! 2....

546 T. Kan(ngo: ,.M. Mo(nt: N.7. Netan!ah(: *. Piatko: ). 7ilverman and ".Y. ?(- "n 4ffi$ient k-means $l(stering algorithm- anal!sis and im+lementation. 9n I555 &ransactions /n 3attern 6nal%sis 6nd $achine Intelligence 'ol. 24: No. %: X(l! 2..2: ++. 001-012

5#6 ,an Pelleg and "ndre/ Moore- *a$hed s(ffi$ient statisti$s for effi$ient ma$hine learning /ith large datasets. 9n 7ournal of 6rtificial Intelligence 'esearch : 0-6%-11: 1110.

566 7ongrit Manee/ongvatana and ,avid M. Mo(nt- "nal!sis of a++ro=imate nearest neigh3or sear$hing /ith $l(stered +oint sets. 9n 8orkshop on 6lgorith# 5ngineering and 5*peri#ents +6459N59199,: Xan(ar! 1111

5%6 "ndre/ ?. Moore- "n introd($tor! t(torial on kd-trees: 5*tract fro# 6ndrew $oore1s 3h: &hesis: 5fficient $e#or%(based learning for 'obot )ontrol : Ph, ThesisE Te$hni$al )e+ort No. 2.1: *om+(ter 2a3orator!: 8niversit! of *am3ridge: 1111

506 Kan ,eng and "ndre/ Moore- M(ltiresol(tion instan$e-3ased learning: 9n &he 3roceedings of I7)6I(9;: +ages 1233-1242. Morgan Ka(fmann: 111#

Page 61 of 62

516 P.7. &radle! and 8sama M. Ca!!ad- )efining initial +oints for k-means $l(stering. 9n 3roceedings Fifteenth International )onference on $achine 4earning : +ages 11-11: 7an Cran$is$o: *": 1110: Morgan Ka(fmann.

51.6 Khaled "lsa3ti: 7anAa! )anka and 'ineet 7ingh- "n 4ffi$ient k-means $l(stering algorithm. 9n 3roceedings of the First 8orkshop on <igh 3erfor#ance :ata $ining : Grlando: C2: Mar$h 1110.

5116 X. Mato(Zek. Gn the a++ro=imate geometri$ k-$l(stering. :iscrete and )o#putational =eo#etr% 24-61-04: 2...

5126 X.2. &entle!: M(ltidimensional ,ivide and *on@(er. )o##unications of the 6)$ 23;4<-214-221: 110.

5136 X.D. Criedman: X.2. &entle! and ).". Cinkel. "n "lgorithm for Cinding &est Mat$hes in 2ogarithmi$ 4=+e$ted Time. 6)$ &rans on $athe#atical !oftware, 3;3<-2.1-226: 7e+tem3er 11%%.

5146 7.M. Gmoh(ndro: 4ffi$ient "lgorithms /ith Ne(ral Net/ork &ehavio(r. 7ournal of )o#ple* !%ste#s: 1;2<-2%3-34%: 110%.

51#6 7anAo! ,asg(+ta: 2earning mi=t(re of "lamitos: *": 1111: ++. 634M644.

a(ssians. >0th 6nnual !%#posiu# on

Foundations of )o#puter !cience +New ?ork, N?, -999 <: 9444 *om+(t. 7o$. Press: 2os

Page 62 of 62

You might also like