Reference: http://en.wikipedia.org/wiki/K-means_clustering 


  k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. It partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. In terms of finding nearest mean, it uses squared Euclidean distance. Mathematically, this means partitioning the observations according to the Voronoi diagram generated by the means. 


 k-means clustering is NP-hard problem; however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. It is common that uses an iterative refinement. Following figure shows a brief description of the standard k clustering algorithm. 




As we mentioned above, The most common algorithm uses an iterative refinement technique. Given an initial set of k means m[1], ... , m[k], the algorithm proceeds by  alternating between two steps: assignment step and update step. 


Assignment step: Assign each observation to the cluster whose mean yields the least within-cluster sum of squares. Since the sum of squares is the squared Euclidean distance, this is intuitively the "nearest" mean. 


Update step: Calculate the new means to be the centroids of the observations in the new cluster. 


Initialization

  Forgy and Random Parition method usually uses for initialization. The Forgy method randomly choose k observations from the data set and uses these as the initial means. The Random Partition method first randomly assigns a cluster to each observation and then proceeds to calculate the new means to be the centroids of the observations in the new clusters. 

  The Forgy method tends to spread the initial means out, while Random Partition places all of them close to the center of the data set. In the standard k-means algorithms, the Forgy method is preferable as a initialization technique. 


Convergence

 As it is a heuristic algorithm, there is no guarantee that it will converge to the global optimum, and the result may depend on the initial cluster. As the algorithm is usually very fast, it is common to run it multiple times with different starting conditions. However, in the worst case, k-means can be very slow to converge: in particular it has been shown that there exist certain point sets, even in 2 dimensions, on which k-means takes exponential time to converge. 


'Cat.Storage > SubCat.Research' 카테고리의 다른 글

Fourier transform pairs  (0) 2015.07.18
FFT & DFT  (0) 2015.07.10
What is Filter bank  (0) 2015.06.14
What is CIELUV color space?  (0) 2015.05.26
On-the-fly multi-scale infinite texturing from example  (2) 2015.05.20
Posted by Cat.IanKang
,