1 School of Information Science & Engineering, Yunnan University, Kunming 650500, China
2 School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
3 Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service,Chengdu University of Information Technology, Chengdu 610225, China
摘要

【目的】 为解决生成新闻线索时抽取新闻主题及度量子事件相关性困难的问题,通过动态滑动窗口的方法改进主题模型,提高长文本和短文本新闻主题抽取的质量,并基于抽取出的新闻主题,提出面向新闻事件的新闻线索生成方法。 【方法】 在主题模型IBTM(Incremental Biterm Topic Model)的基础上,通过动态滑动窗口减小二元词组的提取范围,提出既适合在长文本新闻也适合在短文本新闻上抽取主题的News-IBTM模型,进而基于该模型从新闻数据中抽取主题分布和主题-词分布、推断文档-主题分布,再利用JS散度来度量文档-主题分布的差异,从而生成新闻线索。 【结果】 在人民网新闻和微博新闻数据上的实验结果表明,无论是长文本新闻还是短文本新闻,News-IBTM在困惑度、准确率及效率上都优于现有的经典主题模型。 【局限】 News-IBTM以及其他新闻线索生成方法的准确率都不高,还可以进一步提升。 【结论】 本文方法适合应对长文本和短文本新闻主题抽取的质量问题,并能从新闻事件中获取新闻线索。

Abstract

[Objective] This paper modifies the topic model to improve the quality of extracted news clues. [Methods] We constructed a News-IBTM model based on IBTM (Incremental Biterm Topic Model) with dynamic sliding window, which reduced the extraction scope of binary phrases. Then, we used this model to extract topics and topic-word distributions from news, and inferred the document-topic distributions. Finally, we used the JS (Jensen-Shannon) divergence to measure the difference between document-topic distributions and generate news clues. [Results] We examined our News-IBTM model with news from People’s Daily Online and Weibo. The proposed model outperformed existing ones in perplexity, accuracy and efficiency. [Limitations] The accuracy of News-IBTM algorithm needs to be further improved. [Conclusions] The proposed method could effectively extract quality news topics and clues.

Key words News Events News Clues Generation Topic Model Jensen-Shannon Divergence Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000,61(2):217-235. doi: 10.1006/jcss.2000.1711 Kling C C, Posch L, Bleier A, et al. Topic Model Tutorial: A Basic Introduction on Latent Dirichlet Allocation and Extensions for Web Scientists[C]//Proceedings of the 8th ACM Conference on Web Science. 2016. Blei D M, Ng A Y, Jordan M I, et al. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022. AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking[C]//Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 3-12. Yao L, Zhang Y, Wei B, et al. Incorporating Knowledge Graph Embeddings into Topic Modeling[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3119-3126. 唐焕玲, 窦全胜, 于立萍, 等. 有监督主题模型的SLDA-TC文本分类新方法[J]. 电子学报, 2019,47(6):1300-1308. ( Tang Huanling, Dou Quansheng, Yu Liping, et al. SLDA-TC: A Novel Text Categorization Approach Based on Supervised Topic Model[J]. Acta Electronica Sinica, 2019,47(6):1300-1308.) Yan X, Guo J, Lan Y, et al. A Biterm Topic Model for Short Texts[C]//Proceedings of the 22nd International Conference on World Wide Web. ACM, 2013: 1445-1456. Cheng X, Yan X, Lan Y, et al. BTM: Topic Modeling over Short Texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014,26(12):2928-2941. doi: 10.1109/TKDE.2014.2313872 梁吉业, 乔洁, 曹付元, 等. 面向短文本分析的分布式表示模型[J]. 计算机研究与发展, 2018,55(8):1631-1640. ( Liang Jiye, Qiao Jie, Cao Fuyuan, et al. A Distributed Representation Model for Short Text Analysis[J]. Journal of Computer Research and Development, 2018,55(8):1631-1640.) Pang J, Li X, Xie H, et al. SBTM: Topic Modeling over Short Texts[C]//Proceedings of the DASFAA 2016 Workshop. Springer International Publishing, 2016: 43-56. Zhou X, Ouyang J, Li X. Two Time-Efficient Gibbs Sampling Inference Algorithms for Biterm Topic Model[J]. Applied Intelligence, 2018,48(3):730-754. doi: 10.1007/s10489-017-1004-2 Li X, Zhang A, Li C, et al. Relational Biterm Topic Model: Short-Text Topic Modeling Using Word Embeddings[J]. Computer Journal, 2019,62(3):359-372. doi: 10.1093/comjnl/bxy037 Liu J, Xia C, Li X, et al. A Bert-based Ensemble Model for Chinese News Topic Prediction[C]//Proceedings of the 2nd International Conference on Big Data Engineering. 2020: 18-23. Nam H, Seo S, Mailthody V, et al. I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths arXiv Preprint, arXiv: 2006.10220. 郑飞, 韦德壕, 黄胜. 基于LDA和深度学习的文本分类方法[J]. 计算机工程与设计, 2020,41(8):2184-2189. ( Zheng Fei, Wei Dehao, Huang Sheng. Text Classification Method Based on LDA and Deep Learning[J]. Computer Engineering and Design, 2020,41(8):2184-2189.) Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805. Fiscus J G, Doddington G R. Topic Detection and Tracking Evaluation Overview[A]//Topic Detection and Tracking: Event-based Information Organization[M]. 2002: 17-31. Mei Q, Zhai C X. Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining[C]//Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2005: 198-207. Goodfellow I, Bengio Y, Courville A. Deep Learning (Vol. 1) [M]. Cambridge: MIT Press, 2016: 71-73. Canini K R, Shi L, Griffiths T L. Online Inference of Topics with Latent Dirichlet Allocation[C]//Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. AISTATS, 2009: 65-72. 李莹莹, 马帅, 蒋浩谊, 等. 一种基于社交事件关联的故事脉络生成方法[J]. 计算机研究与发展, 2018,55(9):1972-1986. ( Li Yingying, Ma Shuai, Jiang Haoyi, et al. An Approach for Storytelling by Correlating Events from Social Networks[J]. Journal of Computer Research and Development, 2018,55(9):1972-1986.) 何旭峰, 陈岭, 陈根才, 等. 基于LDA主题模型的分布式信息检索集合选择方法[J]. 中文信息学报, 2017,31(3):125-133. ( He Xufeng, Chen Ling, Chen Gencai, et al. A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval[J]. Journal of Chinese Information Processing, 2017,31(3):125-133.) Li C, Wang H, Zhang Z, et al. Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016: 165-174. Huang J, Peng M, Wang H, et al. A Probabilistic Method for Emerging Topic Tracking in Microblog Stream[J]. World Wide Web Journal, 2017,20(2):325-350. doi: 10.1007/s11280-016-0390-4 彭敏, 官宸宇, 朱佳晖, 等. 面向社交媒体文本的话题检测与追踪技术研究综述[J]. 武汉大学学报(理学版), 2016,62(3):197-217. ( Peng Min, Guan Chenyu, Zhu Jiahui, et al. A Survey on Topic Detection and Tracking in Social Media Text[J]. Journal of Wuhan University (Natural Science Edition), 2016,62(3):197-217.) 张仰森, 段宇翔, 黄改娟, 等. 社交媒体话题检测与追踪技术研究综述[J]. 中文信息学报, 2019,33(7):1-10, 30. ( Zhang Yangsen, Duan Yuxiang, Huang Gaijuan, et al. A Survey on Topic Detection and Tracking Methods in Social Media[J]. Journal of Chinese Information Processing, 2019,33(7):1-10,30.) Zhang Y, Ma J, Wang Z, et al. Extraction and Tracking of Scientific Topics by LDA[C]//Proceedings of the 9th International Conference on Intelligent Networking and Collaborative Systems. 2017: 536-544. 周楠, 杜攀, 靳小龙, 等. 面向舆情事件的子话题标签生成模型ET-TAG[J]. 计算机学报, 2018,41(7):1490-1503. ( Zhou Nan, Du Pan, Jin Xiaolong, et al. ET-TAG: A Tag Generation Model for the Sub-Topics of Public Opinion Events[J]. Chinese Journal of Computers, 2018,41(7):1490-1503.) 韩忠明, 张梦玫, 李梦琪, 等. 面向复杂主题建模的流式层次狄里克雷过程[J]. 计算机学报, 2019,42(7):1539-1552. ( Han Zhongming, Zhang Mengmei, Li Mengqi, et al. Flow Hierarchical Dirichlet Process for Complex Topic Modeling[J]. Chinese Journal of Computers, 2019,42(7):1539-1552.) Huang L, Ma J, Chen C. Topic Detection from Microblogs Using T-LDA and Perplexity[C]//Proceedings of the 24th Asia-Pacific Software Engineering Conference Workshops. IEEE, 2017: 71-77. 陈浩, 张梦毅, 程秀峰. 融合主题模型与决策树的跨地区专利合作关系发现与推荐 * ——以广东省和武汉市高校专利库为例 [J]. 数据分析与知识发现, 2021, 5(10): 37-50. 余传明,原赛,朱星宇,林虹君,张普亮,安璐. 基于深度学习的热点事件主题表示研究* [J]. 数据分析与知识发现, 2020, 4(4): 1-14. 潘有能,倪秀丽. 基于Labeled-LDA模型的在线医疗专家推荐研究* [J]. 数据分析与知识发现, 2020, 4(4): 34-43. 陈文杰. 基于翻译模型的科研合作预测研究 * [J]. 数据分析与知识发现, 2020, 4(10): 28-36. 凌洪飞,欧石燕. 面向主题模型的主题自动语义标注研究综述 * [J]. 数据分析与知识发现, 2019, 3(9): 16-26. 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 * [J]. 数据分析与知识发现, 2019, 3(9): 45-52. 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 * [J]. 数据分析与知识发现, 2019, 3(7): 52-60. 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类 * [J]. 数据分析与知识发现, 2019, 3(5): 77-85. 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究 * [J]. 数据分析与知识发现, 2019, 3(2): 13-20. 张杰,赵君博,翟东升,孙宁宁. 基于主题模型的微藻生物燃料产业链专利技术分析 * [J]. 数据分析与知识发现, 2019, 3(2): 52-64. 刘俊婉,龙志昕,王菲菲. 基于LDA主题模型与链路预测的新兴主题关联机会发现研究 * [J]. 数据分析与知识发现, 2019, 3(1): 104-117. 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用 * [J]. 数据分析与知识发现, 2019, 3(1): 118-126. 张涛, 马海群. 一种基于LDA主题模型的政策文本聚类方法研究 * [J]. 数据分析与知识发现, 2018, 2(9): 59-65. 地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938