分詞 是 中文 語言處理e 基礎問題, 也是 困難e問題。 困難來源 主要來自 詞界線難定。 現有e 分詞規範 主要倚靠 語意、語法, 除了 規則眾多 以外, 分詞結果 也常常 無夠一致。另外一面, 字串e 表面形式 卻是 容易辨認e 要件。本文 提出 延複詞, 自 六種 字面形式 認定 延複詞: 1. 重複型式、 2. 二字寬鬆、 3. 2+1名詞、 4. 並列並合、 5. 總字數、 6. 總成分數。延複詞 包括 簡單延複詞、一般延複詞; 簡單延複詞 約略等於 現有e分詞, 一般延複詞 則放寬到長度 四五字左右, 但是 保持 語法結構 簡單。趣味e是, 讓格書寫 以及分延複詞 差別無大。阮 同時探討 標記 延複詞e 語法類別, 稱 延複詞類; 因為 標記e單位 卡大, 提供機會 免標記 語法行為 複雜e 濟濟單字詞, 因此得著 標記 卡簡單e 延複詞類。
With traditional orthography in Chinese or Taiwanese where the writing is without spaces between words, segmentation is both fundamental and difficult. It is difficult because there hardly are clear boundaries between words and compound words, and between words and phrases. The current segmentation standard proposed in CKIP (1996) relies mainly on semantics and syntaxes, and noticeably gives inconsistent segmentation results. On the other hand, we find that the literal forms of character strings are much easier to recognize. We thus propose segmentation in so-called extended words. We emphasize the use of six literal forms to define extended words: 1. character repetition patterns, 2. two-character strings are loosely defined as an extended words, 3. noun in 2+1 shape with head word at the right, 4. concatenation of parallel words, 5. total number of words, 6. total number of constituents. Extended words include simple extended words and general extended words. Simple extended words correspond roughly to units segmented by CKIP standard, while using much simpler rules. The general extended word consists of multiple constituents with total length up to four or five characters, while keeping syntactic structure simple. Interestingly the segmentation in extended words and the LangGeh orthography (江永進等(2009)) give similar results. We also try to tag the extended words with syntactic categories. Due to the fact that we use a larger unit, we are given the opportunity to omit tagging those constituents of single character which are syntactically complex, and results in a simpler tagging process.