資訊保存與自然語言處理的應用

在這篇論文中，我們從機率模型的範疇內推導一個稱作「資訊保存」的數學概念。我們的方法提供了連接數個最佳化原則，例如最大蹢及最小蹢方法（maximum and minimum entropy methods）的基礎。在這個框架中，我們明確地假設模型推衍是一個目標針對某個參考假說的有向過程。為了檢驗這個理論，我們對無監督式斷詞（unsupervised word segmentation）以及靜態索引刪減（static index pruning）進行了詳盡的實證研究。在無監督式斷詞中，我們的方法顯著地提昇了以壓縮為基礎的方法斷詞精確度，並且在效能與效率表現上達到與目前最佳方法接近的程度。在靜態索引刪減上，我們提出的以資訊為基礎的量度（information-based measure）以比其他方法效率更好的方式達到目前最好的結果。我們的模型推衍方法也取得了新發現，像是分群分析（cluster analysis）中的新校正方法。我們期望這個對推衍原則的深度理解能產生機率模型的新方法論，並且最終邁向自然語言處理上的突破。

關鍵字

資訊理論；資訊保存；推衍原則；無監督式斷詞；靜態索引刪減；蹢最佳化

並列摘要

In this dissertation, we motivate a mathematical concept, called information preservation, in the context of probabilistic modeling. Our approach provides a common ground for relating various optimization principles, such as maximum and minimum entropy methods. In this framework, we make explicit an assumption that the model induction is a directed process toward some reference hypothesis. To verify this theory, we conducted extensive empirical studies to unsupervised word segmentation and static index pruning. In unsupervised word segmentation, our approach has significantly boosted the segmentation accuracy of an ordinary compression-based method and achieved comparable performance to several state-of-the-art methods in terms of efficiency and effectiveness. For static index pruning, the proposed information-based measure has achieved state-of-the-art performance, and it has done so more efficiently than the other methods. Our approach to model induction has also led to new discovery, such as a new regularization method for cluster analysis. We expect that this deepened understanding about the induction principles may produce new methodologies towards probabilistic modeling, and eventually lead to breakthrough in natural language processing.

並列關鍵字

information theory ； information preservation ； induction principle ； unsupervised word segmentation ； static index pruning ； entropy optimization

參考文獻

Ismail S. Altingovde, Rifat Ozcan, and ‥Ozg‥ur Ulusoy. A practitioner’s guide for static index pruning. In Mohand Boughanem, Catherine Berrut, Josiane Mothe, and Chantal Soule-Dupuy, editors, Advances in Information Retrieval, volume 5478 of Lecture Notes in Computer Science, chapter 65, pages 675–679. Springer Berlin / Heidelberg, Berlin, Heidelberg, 2009. ISBN 978-3-642-00957- 0. doi: 10.1007/978-3-642-00958-7 65. URL http://dx.doi.org/10.1007/978-3-642-00958-7_65.

Ismail S. Altingovde, Rifat Ozcan, and ‥Ozg‥ur Ulusoy. Static index pruning in web search engines: Combining term and document popularities with query views. ACM Transactions on Information Systems, 30(1), March 2012. ISSN 1046-8188. doi: 10.1145/2094072.2094074. URL http://dx.doi.org/10.1145/2094072.2094074.

Shlomo Argamon, Navot Akiva, Amihood Amir, and Oren Kapah. Efficient unsupervised recursive word segmentation using minimum description length. In Proceedings of the 20th international conference on Computational Linguistics, COLING ’04, Stroudsburg, PA, USA, 2004. Association for Computational Linguistics. doi: 10.3115/1220355.1220507. URL http://dx.doi.org/10.3115/1220355.1220507.

Roi Blanco and Alvaro Barreiro. Static pruning of terms in inverted files. In Giambattista Amati, Claudio Carpineto, and Giovanni Romano, editors, Advances in Information Retrieval, volume 4425 of Lecture Notes in Computer Science, chapter 9, pages 64–75. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007. ISBN 978-3-540-71494-1. doi: 10.1007/978-3-540-71496-5 9. URL http://dx.doi.org/10.1007/978-3-540-71496-5_9.

Roi Blanco and Alvaro Barreiro. Probabilistic static pruning of inverted files. ACM Transactions on Information Systems, 28(1), January 2010. ISSN 1046-8188. doi: 10.1145/1658377.1658378. URL http://dx.doi.org/10.1145/1658377.1658378.

國際替代計量

資訊保存與自然語言處理的應用

主題瀏覽