当前位置>主页 > 期刊在线 > 信息技术 >

信息技术

一种基于卷积自编码器的文档聚类模型
冯永强 亚军李
(1.天津海河乳业公司,天津 300410;2.天津科技大学计算机科学与信息工程学院,天津 300457)
摘要点击次数:166    

摘  要:文档聚类是将文档集自动归成若干类别的过程,是对文本信息进行分类的有效方式。为了解决半结构化的文本数据转化为结构化数据时出现的数据高维性问题,本文提出了一种卷积自编码器的文档聚类模型CASC,利用卷积神经网络和自编码器的特征提取能力,在尽可能保留原始数据内部结构的同时,将其嵌入到低维潜在空间,然后使用谱聚类算法进行聚类。实验表明,CASC模型在保证聚类准确率不降低的前提下减少了算法运行时间,同时也降低了算法时间复杂度。


关键词:聚类;卷积神经网络;自编码器;无监督模型


作者介绍:

冯永强(1963-),男,汉族,天津人,天津海河乳业公司研发部经理,高级工程师;李亚军(1993-),汉族,河南新乡人,计算机应用技术专业硕士研究生。


中图分类号:TP391TN911.2     文献标识码:A 文章编号:2096-4706(2018)02-0000-04

ADocument Clustering Model Based on Convolutional Autoencoder

FENG Yongqiang1,LI Yajun2

(1.Tianjin Haihe Dairy Company,Tianjin  300410,China;2.Tianjin University of Science andTechnology College of Computer Science and Information Engineering,Tianjin  300457,China)

AbstractDocument clustering is aprocess of automatically categorizing document sets into several categories andis an effective means of organizing textual information. Aiming at the problemof high dimensionality of data when converting semi-structured text data intostructured data,this paper proposes a document clustering model called ConvolutionalSelf-Encoder (CASC),which usesconvolutional neural network and self-encoder feature extraction capabilities,the best possible to retain the internal structure of the originaldata while embedded in low-dimensional potential space,andthen use the spectral clustering algorithm for clustering. Experiments showthat the CASC algorithm can reduce the algorithm running time and reduce thetime complexity of the algorithm without reducing the accuracy of clustering.

Keywordsclustering;convolutionneural network;autoencoder;unsupervisedmodel


参考文献:

[1] Xu Jiaming et al. "Short textclustering via convolutional neural networks."2015.

[2] 谭晋秀,何跃.基于k-means文本聚类的新浪微博个性化博文推荐研究 [J].情报科学,2016,34(4):74-79.

[3]John Langford,Joelle Pineau.Proceedings of the 29th international conference onmachine learning (icml-12)[J].CoRR,2012.

[4] Gerard Salton,Christopher Buckley.Term-weighting approaches in automatic textretrieval [J].Information Processing & Management,1988,24(5):513-523.

[5]Mikolov T,Sutskever I,Chen K,etal. Distributed Representations of Words and Phrases and their Compositionality[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.

[6]Mikolov T,Chen K,Corrado G,etal. Efficient Estimation of Word Representations in Vector Space [J].ComputerScience,2013.

[7] Stephen Johnson.Hierarchical clusteringschemes [J].Psychometrika,1967.

[8] HARTIGAN JA,WONG MA.Algorithm as 136:a k-meansclustering algorithm [J].Appl Stat,1979,28(1):100.

[9] Arthur,David,and Sergei Vassilvitskii."k-means++: The advantages of carefulseeding." Proceedings of the eighteenth annual ACM-SIAM symposiumon Discrete algorithms. Society for Industrial and Applied Mathematics,2007.

[10] NAYAK J,NAIK B,BEHERA HS.Fuzzy c-means (fcm) clustering algorithm: a decade review from 2000 to 2014[M].New Delhi:Springer India,2015:133-149.

[11] Ester,Martin,et al. "A density-basedalgorithm for discovering clusters in large spatial databases with noise."Kdd.Vol.96.No.34.1996.

[12] Krzysztof Cios,Mark Shields.Advances in neural information processing systems 7[J].Neurocomputing,1997,16(3):263.

[13] Yunlan Tan,Pengjie Tang,Yimin Zhou,et al.Photograph aesthetical evaluation and classification with deepconvolutional neural networks [J].Neurocomputing,2016.

[14]Ioffe S,Szegedy C. Batch Normalization:AcceleratingDeep Network Training by Reducing Internal Covariate Shift [J].2015:448-456.