音節斷詞是中文注音/拼音輸入法的一部分,相較於中文斷詞,同音詞歧義引入了更多交疊邊界。中文注音/拼音輸入法通常假定輸入是完整的句子,並用結構嚴謹的語料庫評估效能。然而,大部分拼音使用者偏好一一輸入多半只含一到二字的短語,這種用法在運算資源有限的手持裝置上更是普遍。若想盡可能取得最佳的音轉字成果,短語往往無法提供足夠的脈絡,尤其是在短語中有邊界交疊時。這些交疊歧義具有方向性。因此本文提出善用前後脈絡的雙重排名策略。實驗結果顯示,比起記憶體需求較低而速度夠快的詞頻法,雙重排名有較佳的效能,而比起記憶體需求極高且速度頗慢的條件隨機域模型,雙重排名占用空間極低且效能在伯仲之間。
Syllable word segmentations as a part of Chinese phonetic input methods (CPIM) involve more overlapping boundaries than word segmentations because of homophone ambiguities. A CPIM usually assumes that the input is a complete sentence, and evaluates the performance based on a well-formed corpus. However, most Pinyin users prefer progressive text entry in short chunks, mainly in one or two words each, which is even more popular on handheld devices with limited computing power. Short chunks do not provide enough contexts to perform the best possible syllable-to-character conversion, especially when a chunk consists of overlapping boundaries. Those overlapping ambiguities show directional tendencies. This dissertation proposes a double ranking (DR) strategy on the left and right context. Experiments show that DR has the benefits of less memory with competitive performance compared to the frequency-based method (low memory and fast) and the conditional random fields model (larger memory and slower).