詞向量化方法之比較與應用

詞向量化（或稱為詞嵌入或分布式詞表示）是將詞用固定長度的向量表示的一系列方法，在文本挖掘和自然語言處理中有相當廣泛的應用。然而，較少有研究完整地比較這些方法的表現。本研究比較了八種詞向量化方法，這些方法使用不同的技術包括矩陣分解，主題模型和神經網絡。在比較方式上我們使用內在和外在評估來比較方法的表現，內在評估包含了衡量詞向量之間的關聯性、相似性以及類比關係，而外在評估我們採用實體辨識 (NER) 做為衡量的任務。以內在評估的結果而言，基於神經網絡的方法（如CBOW和Skip-gram）表現最佳，其次是GloVe，而根據文章層級的資訊做訓練的方法，如潛在語義分析（LSA）和潛藏狄利克雷分配方法（LDA），在我們的比較中表現相對不佳。而以外在評估的結果而言，表現最好的是Skip-gram和HAL (一種相對簡單的矩陣分解方法)，對NER的表現帶來了最大的進步，而LDA和CBOW帶來的進步最少。這樣的結果意味著內在評估方法的排名可能與外在評估的排名不一致。因此，未來的研究可以包括更多的外在評估任務，以便找到內外在任務表現之間的關係。

關鍵字

詞向量化；詞嵌入；主題模型；矩陣分解；類神經網路語言模型

並列摘要

Word vectorization (a.k.a. word embedding or distributional word representation) is a family of approaches that convert a word to a fixed length of vector. It is widely used in text mining and natural language processing tasks. However, few studies have systematically compared the performance of these methods. This study investigated eight word vectorization methods using different technical approaches including matrix factorization, topic models, and neural network. We compared their performances using both intrinsic and extrinsic evaluations. Intrinsic evaluations included examining association, similarity and analogy relationship between word vectors. And extrinsic evaluations included name entity recognition task. For intrinsic evaluations, the result suggests that neural network based methods such as continuous bag-of-word (CBOW) and Skip-gram performed the best, followed by GloVe, a method that extract latent vectors from word-context matrix. Method that adopted document-wide information, such as latent semantic analysis (LSA) and latent Dirichlet allocation (LDA), did not perform well in our evaluation. For extrinsic evaluation, Skip-gram and HAL, a relatively simple matrix factorization method, brought the most improvement on the performance of NER. While LDA and CBOW brought the least improvement. This result implies that the ranking of methods for intrinsic evaluations may be inconsistent with the ranking for extrinsic evaluations. Thus, future studies could include more tasks for extrinsic evaluation to allow finding the relationship between performances of intrinsic evaluations and extrinsic task performances.

並列關鍵字

Word vectorization ； Word embedding ； Topic model ； Distributional word representation ； Matrix factorization ； Neural network language model

參考文獻

Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict! A systematic comparison of context-counting vs. Context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 238-247). Baltimore, Maryland.

Google Scholar

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137-1155.

Google Scholar

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan), 993-1022.

Google Scholar

Bruni, E., Tran, N.-K., & Baroni, M. (2014). Multimodal distributional semantics. Journal of Artificial Intelligence Research, 49, 1-47.

Google Scholar

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537.

Google Scholar

國際替代計量

詞向量化方法之比較與應用

全文下載

主題瀏覽