深度學習應用於部落格文章分類

由於網路每天有巨量文章產出，所以正確的文章分類，可以加速讀者在閱讀搜尋上的效率。據痞客邦網站的統計，有近50%的部落格文章未勾選文章所屬類別。本論文提出一自定義損失函數，協助提高這類的文章來進行正確的主題分類。經過本論文所提出之分類系統，可協助痞客邦系統後台自動得知該文章之主題分類。文章分別以Jieba斷詞系統及CKIP斷詞系統進行斷詞，實驗結果發現使用Jieba斷詞系統之分類正確率為92.60%，而使用CKIP斷詞系統之正確率為93.35%，顯示繁體中文文章在分類分析時，CKIP斷詞系統為輸入文章斷詞之首選。斷詞後的文章經過預先訓練的詞向量進行編碼，編碼後輸入長短期記憶模型或卷積神經網路進行訓練。訓練時使用自定義之損失函數，其結果之正確率為93.35%，比傳統使用之損失函數之正確率92.98%有更好的成效。顯示本論文提出之自定義損失函數，可協助部落格文章進行更準確之分類。

關鍵字

自然語言處理；機器學習；社群網站；損失函數；斷詞系統

並列摘要

Due to the huge amount of articles produced on the Internet every day, well-organized article labels can help improve user experience in reading and searching. However, according to the statistics of the Pixnet website, nearly 50% of blog posts are not being labeled by the author. To address this problem, our paper proposes a custom loss function to provide an automatic article labeling system in the website back end. Through this labeling system we can automatically assign accurate labels onto those articles without a label. We use Jieba word segmentation system and CKIP word segmentation system to segment articles. The experimental result in our study shows that the classification accuracy of the Jieba system is 92.60%, and the accuracy of the CKIP system is 93.35%. Thus, for traditional Chinese characters, the CKIP system is the first choice in word segmentation. After word segmentation, the articles are coded by pre-trained word vectors, and after encoding, they are input into Long Short-Term Memory models or Convolutional Neural Networks for training. When using our custom loss function during training, the accuracy of the result is 93.35%, which is better than the accuracy of 92.98% of the categorical_crossentropy loss function. In conclusion, our custom loss function proposed in this paper can help blog articles to be classified automatically and accurately.

並列關鍵字

Natural language processing ； Machine learning ； Social network ； Loss function ； Word segmentation system

參考文獻

1. Ou, G. and Y.L.J.P.R. Murphey, Multi-class pattern classification using neural networks. 2007. 40(1): p. 4-18.

Google Scholar

2. Read, J., et al., Classifier chains for multi-label classification. 2011. 85(3): p. 333.

Google Scholar

3. Tsoumakas, G., I.J.I.J.o.D.W. Katakis, and Mining, Multi-label classification: An overview. 2007. 3(3): p. 1-13.

Google Scholar

4. Deng, L.J.I.S.P.M., The mnist database of handwritten digit images for machine learning research [best of the web]. 2012. 29(6): p. 141-142.

Google Scholar

5. Manning, C.D., C.D. Manning, and H. Schütze, Foundations of statistical natural language processing. 1999: MIT press.

Google Scholar

國際替代計量

深度學習應用於部落格文章分類

未授權

主題瀏覽