未知詞之擷取在中文的語意分析上,扮演相當重要的角色言,而目前多語夾雜語料庫中夾雜的短語有些並未出現在目前的字典中,因此可能無法正確的斷詞,因此本研究希望從多語夾雜語料庫中找出未知詞,尤其是夾雜短語的部分。我們將會提出數個方式做比較和測試結果。 我們主要使用點式交互資訊(Pointwise Mutual Information, PMI)來衡量兩詞的內力,並以閥值(threshold)篩選有較大PMI值的相鄰詞為候選新詞。由於PMI只考慮兩個詞是否經常相鄰出現,但經常相鄰出現的字合併後未必是新詞,有鑑於此,我們除了使用PMI之外,亦將使用前後文脈之entropy來過濾無關的新詞以提升精確度。
Unknown word extraction plays an important rule in the field of Chinese language analysis. Several short terms of compound language are not available in dictionary up to now, and affect the result of Chinese word segmentation. This research focuses on extracting unknown words from code-Switched sentences, especially for “short term”. This research provides several approaches for comparison and examination as follows. My research primarily uses the Pointwise Mutual Information(PMI) to calculate the relationship between two different terms, and PMI would choose bigger value as the candidate. Although PMI only accept nearby words by appearance frequency, Besides PMI, I sieve out new words on all sides more precisely by entropy of sentence meaning.