Short Text Expansion and Classification Based on Word Embedding

The short text has the characteristics of less vocabulary, more noise and sparse features, which leads to the unsatisfactory effect of the traditional text classification method applied to the short text classification. In order to improve the classification accuracy of short texts, a feature extension method based on Wikipedia word vector is proposed. First, word vectors are trained using Wikipedia corpus. Then, word vector is combined with document vector for feature selection. Finally, by extending the word set with high similarity of feature items, the resulting text is classified by the traditional classifier. Experimental results show that the proposed method is better than other text feature extension algorithms in the accuracy of short text classification.

關鍵字

Word Embeddings ； Short Texts ； Feature Extension ； Text Classification

參考文獻

Sriram B, Fuhry D, Demir E, et al. Short text classification in twitter to improve information filtering [C]. Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Geneva: Bharath Sriram, 2010:841-842.

Zubiaga A, Spina D, Martínez R, et al. Real-time classification of Twitter trends [J]. Journal of the Association for Information Science & Technology, 2015, 66(3):462-473.

Li, X., Gao, F., & Ding, C. (2016, January). The Research of Chinese Short-text Classification Based on Domain Keyword Set Extension and HowNet. In 2016 International Conference on Intelligent Control and Computer Application (ICCA 2016). Atlantis Press.

Fan, X. (2012). A method for Chinese short text classification considering effective feature expansion. INTERNATIONAL JOURNAL OF ADVANCED RESEARCH IN ARTIFICIAL INTELLIGENCE, 1(1).

Zhang zhifei, miao jiaoqian, gao can. Classification of short texts based on LDA topic model [J]. Computer applications, 2013,33 (6):1587-1590.

延伸閱讀

Chiang, T. H., Chang, J. S., Lin, M. Y., & Su, K. Y. (1996). STATISTICAL WORD SEGMENTATION. Journal of Chinese Linguistics Monograph Series, (9), 147-173+250-251. https://www.airitilibrary.com/Article/Detail?DocID=P20181128001-199612-201812050015-201812050015-147-173+250-251
凃守謙（2013）。Using ConceptNet in Text Classification〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201300818
黃耀民（2004）。Research on Sentence Extraction-based Automatic Summarization Applied to Document Classification〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-2004200710353072
林政文（2019）。Domain Knowledge Linguistic Pattern-based Document Representation for text Classification〔碩士論文，國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-0206202016150674
Alam, M., Goni, O., Shameem, A., Islam, S., Datta, N. K., Ahmed, S., & Moazzam, G. (2021). An Approach for the Normalization of Short Message Service to Detect Shorter Form of Words and Find out Actual Word. International Journal of Electronics and Information Engineering, 13(3), 111-118. https://doi.org/10.6636/IJEIE.202109_13(3).04

國際替代計量

Short Text Expansion and Classification Based on Word Embedding

全文下載

主題瀏覽