多語夾雜環境下未知詞擷取之研究

未知詞之擷取在中文的語意分析上，扮演相當重要的角色言，而目前多語夾雜語料庫中夾雜的短語有些並未出現在目前的字典中，因此可能無法正確的斷詞，因此本研究希望從多語夾雜語料庫中找出未知詞，尤其是夾雜短語的部分。我們將會提出數個方式做比較和測試結果。我們主要使用點式交互資訊(Pointwise Mutual Information, PMI)來衡量兩詞的內力，並以閥值(threshold)篩選有較大PMI值的相鄰詞為候選新詞。由於PMI只考慮兩個詞是否經常相鄰出現，但經常相鄰出現的字合併後未必是新詞，有鑑於此，我們除了使用PMI之外，亦將使用前後文脈之entropy來過濾無關的新詞以提升精確度。

關鍵字

語碼轉換；未知詞擷取；交互資訊； entropy

並列摘要

Unknown word extraction plays an important rule in the field of Chinese language analysis. Several short terms of compound language are not available in dictionary up to now, and affect the result of Chinese word segmentation. This research focuses on extracting unknown words from code-Switched sentences, especially for “short term”. This research provides several approaches for comparison and examination as follows. My research primarily uses the Pointwise Mutual Information(PMI) to calculate the relationship between two different terms, and PMI would choose bigger value as the candidate. Although PMI only accept nearby words by appearance frequency, Besides PMI, I sieve out new words on all sides more precisely by entropy of sentence meaning.

並列關鍵字

Code-switching ； unknown word extraction ； mutual information ； entropy

參考文獻

[2] P. C. Chang, S. P. Liao, and L. S. Lee, “Improved Chinese Broadcast News Transcription by Language Modeling with Temporally Consistent Training Corpora and Iterative Phrase Extraction,” in Proc. of Eurospeech, pp. 421-424, 2003.

[3] L. F. Chien, “PAT-tree-based Keyword Extraction for Chinese Information Retrieval,” in Proc. of SIGIR-97, pp. 50-59, 1997.

[4] P. Fung and T. Schultz, “Multilingual Spoken Language Processing,” IEEE Signal Processing Magazine, 25(3), pp. 89-97, 2008.

[7] H. Holzapfel, “Building Multilingual Spoken Dialogue Systems,” Archives of Control Sciences, 15(4), pp. 555-566, 2005.

[8] A. Hategan, B. Barliga, and I. Tabus, “Language Identification of Individual Words in a Multilingual Automatic Speech Recognition System,” in Proc. of ICASSP-09, pp. 4357-4360, 2009.

被引用紀錄

李翔（2004）。影響基層診所醫師生涯滿意度之相關因子分析〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2004.02335

黃千凡（2008）。地區醫師人力分布與民眾西醫門診醫療服務利用之相關研究-以台灣地區十八歲以上成人為例〔碩士論文，亞洲大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0118-0807200916274894

國際替代計量

多語夾雜環境下未知詞擷取之研究

全文下載

主題瀏覽