The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification

Text classification (TC) is the task of assigning predefined categories (or labels) to texts for information organization, knowledge management, and many other applications. Normally the categories are topical in library science applications, although they can be any labels suitable for an application. Thus, TC often requires topical analysis which relies on human knowledge. However, in recent decades, machine learning (ML) techniques have been applied to TC for efficiency, as long as a sufficient number of training texts are available for each category. Nevertheless, in real-world cases, the number of texts (documents) for each category is often highly skewed for a certain TC task. This leads to the problem of predicting labels for small categories, which is viable for humans but challenging for machines. Deep learning (DL) is an emerging class of machine learning (ML) which was inspired by human neural networks. This study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four off-the-shelf DL methods (CNN, RCNN, fastText, and BERT) with four traditional ML techniques on five skew-distributed datasets (four in Chinese, and one in English for comparison). Our results show that BERT is effective for moderately skewed datasets, but is still not feasible for highly skewed TC tasks. The other three DL-aware methods (CNN, RCNN, fastText) do not show any advantage in comparison with traditional methods such as SVM for the five TC tasks, although they captured extra language knowledge in the pretrained word representation. To facilitate future study, all of the Chinese datasets used in this study have been released publicly, together with all of the adapted machine learning and evaluation source codes for verification and for further study at https://github.com/SamTseng/Chinese_Skewed_TxtClf.

關鍵字

Text categorization ； Real-world corpus ； Deep learning ； Performance evaluation

並列摘要

文件分類是圖書資訊學中的主題分析問題，而深度學習（deep learning，DL）為近年來運用大量語言知識的語意理解技術。本研究旨在透過四種現成的DL方法（CNN、RCNN、fastText和BERT）與四種傳統機器學習技術，對五個偏斜分佈語料（四個中文和一個英文）做成效比較，來評估DL進行主題分析的可行性。結果顯示，BERT對中等偏斜的語料有效，但對於高度偏斜的文件自動分類任務成效仍不佳。與傳統方法（例如SVM）相比，其他三種DL方法（CNN、RCNN、fastText）在五個文件分類任務上沒有顯示出優勢，儘管它們在預訓練的詞彙表示法中獲取了廣泛的額外語言知識，其成效也沒有比較好。為了方便將來的研究，本研究使用到的中文語料庫以及所有經過改編的機器學習和評估程式碼均公開發布。

並列關鍵字

文本分類；語料庫；深度學習；績效評估

參考文獻

Calkins, S. (1983). The new Merger Guidelines and the Herfindahl-Hirschman Index. California Law Review, 71(2), 402-429. https://doi.org/10.2307/3480160

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. https://doi.org/10.1162/neco.1997.9.8.1735

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In C. Nédellec & C. Rouveirol (Eds.), Machine learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21-23, 1998 proceedings (pp. 137-142). Springer. https://doi.org/10.1007/BFb0026683

Johnson, R., & Zhang, T. (2015). Effective use of word order for text categorization with convolutional neural networks. In R. Mihalcea, J. Chai, & A. Sarkar (Eds.), Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 103-112). Association for Computational Linguistics. https://doi.org/10.3115/v1/N15-1011

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436-444. https://doi.org/10.1038/nature14539

被引用紀錄

顏瑞宏、傅文成（2022）。外交新常態？以主題及網絡建模技術探索中共Twitter外交的戰狼溝通策略。資訊社會研究，(43)，67-113。https://doi.org/10.29843/JCCIS.202207_(43).0004

國際替代計量

The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification

全文下載

主題瀏覽