透過您的圖書館登入
IP:3.138.123.190
  • 學位論文

深度學習應用於部落格文章分類

Topic Classification of Blog Posts Using Deep Learning

指導教授 : 丁肇隆
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於網路每天有巨量文章產出,所以正確的文章分類,可以加速讀者在閱讀搜尋上的效率。據痞客邦網站的統計,有近50%的部落格文章未勾選文章所屬類別。本論文提出一自定義損失函數,協助提高這類的文章來進行正確的主題分類。經過本論文所提出之分類系統,可協助痞客邦系統後台自動得知該文章之主題分類 。 文章分別以Jieba斷詞系統及CKIP斷詞系統進行斷詞,實驗結果發現使用Jieba斷詞系統之分類正確率為92.60%,而使用CKIP斷詞系統之正確率為93.35%,顯示繁體中文文章在分類分析時,CKIP斷詞系統為輸入文章斷詞之首選。 斷詞後的文章經過預先訓練的詞向量進行編碼,編碼後輸入長短期記憶模型或卷積神經網路進行訓練。訓練時使用自定義之損失函數,其結果之正確率為93.35%,比傳統使用之損失函數之正確率92.98%有更好的成效。顯示本論文提出之自定義損失函數,可協助部落格文章進行更準確之分類。

並列摘要


Due to the huge amount of articles produced on the Internet every day, well-organized article labels can help improve user experience in reading and searching. However, according to the statistics of the Pixnet website, nearly 50% of blog posts are not being labeled by the author. To address this problem, our paper proposes a custom loss function to provide an automatic article labeling system in the website back end. Through this labeling system we can automatically assign accurate labels onto those articles without a label. We use Jieba word segmentation system and CKIP word segmentation system to segment articles. The experimental result in our study shows that the classification accuracy of the Jieba system is 92.60%, and the accuracy of the CKIP system is 93.35%. Thus, for traditional Chinese characters, the CKIP system is the first choice in word segmentation. After word segmentation, the articles are coded by pre-trained word vectors, and after encoding, they are input into Long Short-Term Memory models or Convolutional Neural Networks for training. When using our custom loss function during training, the accuracy of the result is 93.35%, which is better than the accuracy of 92.98% of the categorical_crossentropy loss function. In conclusion, our custom loss function proposed in this paper can help blog articles to be classified automatically and accurately.

參考文獻


1. Ou, G. and Y.L.J.P.R. Murphey, Multi-class pattern classification using neural networks. 2007. 40(1): p. 4-18.
2. Read, J., et al., Classifier chains for multi-label classification. 2011. 85(3): p. 333.
3. Tsoumakas, G., I.J.I.J.o.D.W. Katakis, and Mining, Multi-label classification: An overview. 2007. 3(3): p. 1-13.
4. Deng, L.J.I.S.P.M., The mnist database of handwritten digit images for machine learning research [best of the web]. 2012. 29(6): p. 141-142.
5. Manning, C.D., C.D. Manning, and H. Schütze, Foundations of statistical natural language processing. 1999: MIT press.

延伸閱讀