kmeans算法用Python怎么实现

如题所述

函数

loadDataSet(fileName)
从文件中读取数据集
distEclud(vecA, vecB)
计算距离,这里用的是欧氏距离,当然其他合理的距离都是可以的
randCent(dataSet, k)
随机生成初始的质心,这里是虽具选取数据范围内的点
kMeans(dataSet, k, distMeas=distEclud, createCent=randCent)
kmeans算法,输入数据和k值。后面两个事可选的距离计算方式和初始质心的选择方式
show(dataSet, k, centroids, clusterAssment)
可视化结果

    1 #coding=utf-8 2 from numpy import * 3  4 def loadDataSet(fileName): 5     dataMat = [] 6     fr = open(fileName) 7     for line in fr.readlines(): 8         curLine = line.strip().split('\t') 9         fltLine = map(float, curLine)10         dataMat.append(fltLine)11     return dataMat12     13 #计算两个向量的距离,用的是欧几里得距离14 def distEclud(vecA, vecB):15     return sqrt(sum(power(vecA - vecB, 2)))16 17 #随机生成初始的质心(ng的课说的初始方式是随机选K个点)    
    18 def randCent(dataSet, k):19     n = shape(dataSet)[1]20     centroids = mat(zeros((k,n)))21     for j in range(n):22         minJ = min(dataSet[:,j])23         rangeJ = float(max(array(dataSet)[:,j]) - minJ)24         centroids[:,j] = minJ + rangeJ * random.rand(k,1)25     return centroids26     27 def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):28     m = shape(dataSet)[0]29     clusterAssment = mat(zeros((m,2)))#create mat to assign data points
    30                                       #to a centroid, also holds SE of each point31     centroids = createCent(dataSet, k)32     clusterChanged = True33     while clusterChanged:34         clusterChanged = False35         for i in range(m):#for each data point assign it to the closest centroid36             minDist = inf37             minIndex = -138             for j in range(k):39                 distJI = distMeas(centroids[j,:],dataSet[i,:])40                 if distJI < minDist:41                     minDist = distJI; minIndex = j42             if clusterAssment[i,0] != minIndex:
    43                 clusterChanged = True44             clusterAssment[i,:] = minIndex,minDist**245         print centroids46         for cent in range(k):#recalculate centroids47             ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster48             centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
    49     return centroids, clusterAssment50     51 def show(dataSet, k, centroids, clusterAssment):52     from matplotlib import pyplot as plt  
    53     numSamples, dim = dataSet.shape  
    54     mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr', '<r', 'pr']  
    55     for i in xrange(numSamples):  
    56         markIndex = int(clusterAssment[i, 0])  
    57         plt.plot(dataSet[i, 0], dataSet[i, 1], mark[markIndex])  
    58     mark = ['Dr', 'Db', 'Dg', 'Dk', '^b', '+b', 'sb', 'db', '<b', 'pb']  
    59     for i in range(k):  
    60         plt.plot(centroids[i, 0], centroids[i, 1], mark[i], markersize = 12)  
    61     plt.show()62       63 def main():64     dataMat = mat(loadDataSet('testSet.txt'))65     myCentroids, clustAssing= kMeans(dataMat,4)66     print myCentroids67     show(dataMat, 4, myCentroids, clustAssing)  
    68     69     70 if __name__ == '__main__':71     main()

    这里是聚类结果,还是很不错的啦

    但是有时候也会收敛到局部最小值,就像下面这样,就是不幸收敛到局部最优了

温馨提示:答案为网友推荐,仅供参考
相似回答