Abstract: Focusing on the low accuracy and slow convergence of K-means clustering algorithm, an improved K-means algorithm based on optimization sample clustering named OSCK (Optimization Sampling Clustering K-means Algorithm) was proposed. Firstly, multiple samples were obtained from mass data by probability sampling. Secondly, based on Euclidean distance similarity principle of optimal clustering center, the results of sample clustering were modeled and evaluated, and the sub-optimal solution of sample clustering results was removed. Finally, the final k clustering centers were got by weighted integration evaluation of clustering results, and the final k clustering centers were used as cluster centers of big data set. Theoretical analysis and experimental results show that the proposed method for mass data analysis with respect to the comparison algorithm has better clustering accuracy, and has strong robustness and scalability.

Key words: big data, K-means, probability sampling, Euclidean distance, clustering accuracy [1] WU X, ZHU X, WU G, et al. Data mining with big data[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 97-107.
[2] CHEN M-S, HAN J, YU P S. Data mining: an overview from a database perspective[J]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866-883.
[3] NIMMAGADDA S L, DREHER H. Petro-data cluster mining——knowledge building analysis of complex petroleum systems[C]//ICIT 2009: Proceedings of the 2009 IEEE International Conference on Industrial Technology. Washington, DC: IEEE Computer Society, 2009: 1-8.
[4] FAHAD A, ALSHATRI N, TARI Z, et al. A survey of clustering algorithms for big data: taxonomy & empirical analysis[J]. IEEE Transactions on Emerging Topics in Computing, 2014, 2(3): 1.
[5] KURASOVA O, MARCINKEVICIUS V, MEDVEDEV V, et al. Strategies for big data clustering[C]//ICTAI 2014: Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence. Piscataway, NJ: IEEE, 2014: 740-747.
[6] 李建江,崔健,王聃,等.MapReduce并行编程模型研究综述[J].电子学报,2011,39(11):2635-2642. (LI J J, CUI J, WANG D, et al. Survey of MapReduce parallel programming model[J]. Acta Electronica Sinica, 2011, 39(11): 2635-2642.)
[7] GUNARATHNE T, WU T-L, QIU J, et al. MapReduce in the clouds for science[C]//CloudCom 2010: Proceedings of the IEEE Second International Conference on Cloud Computing Technology and Science. Washington, DC: IEEE Computer Society, 2010: 565-572.
[8] 江小平,李成华,向文,等.K-means聚类算法的MapReduce并行化实现[J]. 华中科技大学学报(自然科学版),2011,39(Z1):120-124. (JIANG X P, LI C H, XIANG W, et al. Parallel implementing K-means clustering algorithm using MapReduce programming mode[J]. Journal of Huazhong University of Science and Technology (Natural Science), 2011, 39(Z1): 120-124.)
[9] 赵卫中,马慧芳,傅燕翔,等.基于云计算平台Hadoop的并行K-means聚类算法设计研究[J].计算机科学,2011,38(10):166-168. (ZHAO W Z, MA H F, FU Y X, et al. Research on parallel K-means clustering algorithm design based on Hadoop platform[J]. Computer Science, 2011, 38(10): 166-168.)
[10] 行小帅,潘进,焦李成.基于免疫规划的K-means聚类算法[J].计算机学报,2003,26(5):605-610. (XING X S, PAN J, JIAO L C. A novel K-means clustering algorithm based on immune programming algorithm[J]. Chinese Journal of Computers, 2003, 26(5): 605-610.)
[11] 於跃成,王建东,郑关胜,等.基于约束信息的并行K-means算[J].东南大学学报(自然科学版),2011,41(3):505-508. (YU Y C, WANG J D, ZHENG G S, et al. Parallel K-means algorithm based on constrained information[J]. Journal of Southeast University (Natural Science Edition), 2011, 41(3): 505-508.)
[12] ANCHALIA P P. Improved MapReduce K-means clustering algorithm with combiner[C]//Proceedings of the 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation. Washington, DC: IEEE Computer Society, 2014: 386-391.
[13] CUI X, ZHU P, YANG X, et al. Optimized big data K-means clustering using MapReduce[J]. Journal of Supercomputing, 2014, 70(3): 1249-1259.
[14] ARTHUR D, VASSILVITSKII S. K-means++: the advantages of careful seeding[C]//SODA '07: Proceedings of the Eighteenth Annual ACM-SIAM symposium on Discrete Algorithms. Philadelphia, PA: SIAM, 2007: 1027-1035.
[15] LIAO Q, YANG F, ZHAO J. An improved parallel K-means clustering algorithm with MapReduce[C]//ICCT 2013: Proceedings of the 15th IEEE International Conference on Communication Technology. Piscataway, NJ: IEEE, 2013: 764-768.
[16] PUN W K D, ALI A B M S. Unique distance measure approach for K-means (UDMA-Km) clustering algorithm[C]//TENCON 2007: Proceedings of the 2007 IEEE Region 10 Conference. Piscataway, NJ: IEEE, 2007: 1-4.
[17] BAHMANI B, MOSELEY B, VATTANI A, et al. Scalable K-means++[J]. Proceedings of VLDB Endowment, 2012, 5(7): 622-633.
[18] SHINDLER M, WONG A, MEYERSON A. Fast and accurate K-means for large datasets[C]//NIPS 2011: Advances in Neural Information Processing Systems 26. Cambridge, MA: MIT Press, 2011: 2375-2383.
[19] CAI X, NIE F, HUANG H. Multi-view K-means clustering on big data[C]//IJCAI '13: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence. Menlo Park, CA: AAAI Press, 2013: 2098-2064.
[20] WANG J, SU X. An improved K-means clustering algorithm[C]//ICCSN 2011: Proceedings of the IEEE 3rd International Conference on Communication Software and Networks. Piscataway, NJ: IEEE, 2011: 44-46.
[21] CHEN G-P, WANG W-P. An improved K-means algorithm with meliorated initial center[C]//ICCSE 2012: Proceedings of the 7th International Conference on Computer Science & Education. Piscataway, NJ: IEEE, 2012: 150-153.
[22] DONG J, QI M. K-means optimization algorithm for solving clustering problem[C]//WKDD 2009: Proceedings of the Second International Workshop on Knowledge Discovery and Data Mining. Washington, DC: IEEE Computer Society, 2009: 52-55.