延複詞及延複詞類初探

分詞是中文語言處理ｅ基礎問題，也是困難ｅ問題。困難來源主要來自詞界線難定。現有ｅ分詞規範主要倚靠語意、語法，除了規則眾多以外，分詞結果也常常無夠一致。另外一面，字串ｅ表面形式卻是容易辨認ｅ要件。本文提出延複詞，自六種字面形式認定延複詞： 1. 重複型式、 2. 二字寬鬆、 3. 2+1名詞、 4. 並列並合、 5. 總字數、 6. 總成分數。延複詞包括簡單延複詞、一般延複詞；簡單延複詞約略等於現有ｅ分詞，一般延複詞則放寬到長度四五字左右，但是保持語法結構簡單。趣味ｅ是，讓格書寫以及分延複詞差別無大。阮同時探討標記延複詞ｅ語法類別，稱延複詞類；因為標記ｅ單位卡大，提供機會免標記語法行為複雜ｅ濟濟單字詞，因此得著標記卡簡單ｅ延複詞類。

關鍵字

延複詞；延複詞組；斷詞；分詞；讓格書寫

並列摘要

With traditional orthography in Chinese or Taiwanese where the writing is without spaces between words, segmentation is both fundamental and difficult. It is difficult because there hardly are clear boundaries between words and compound words, and between words and phrases. The current segmentation standard proposed in CKIP (1996) relies mainly on semantics and syntaxes, and noticeably gives inconsistent segmentation results. On the other hand, we find that the literal forms of character strings are much easier to recognize. We thus propose segmentation in so-called extended words. We emphasize the use of six literal forms to define extended words: 1. character repetition patterns, 2. two-character strings are loosely defined as an extended words, 3. noun in 2+1 shape with head word at the right, 4. concatenation of parallel words, 5. total number of words, 6. total number of constituents. Extended words include simple extended words and general extended words. Simple extended words correspond roughly to units segmented by CKIP standard, while using much simpler rules. The general extended word consists of multiple constituents with total length up to four or five characters, while keeping syntactic structure simple. Interestingly the segmentation in extended words and the LangGeh orthography (江永進等(2009)) give similar results. We also try to tag the extended words with syntactic categories. Due to the fact that we use a larger unit, we are given the opportunity to omit tagging those constituents of single character which are syntactically complex, and results in a simpler tagging process.

並列關鍵字

extended compound word ； extended compound part-of-speech ； segmentation ； LangGeh orthography

參考文獻

Collins, M. (1999). Head-Driven Statistical Models for Natuaral Language Parsing. Phd Dissertation, University of Pennsylvania.

朱德熙(1984).

Angela Troni著陳黎娟、方秀芬譯(2009). 《德語一本通》。彰化市：陵曦文化。

Google Scholar

ckip規範 (1996).《「搜」文解字——中文詞界研究與資訊用分詞標準》。中文詞知識庫小組技術報告 96-1，台北：中央研究院資訊科學研究所，中央研究院歷史語言研究所。 (簡稱分詞規範或者 ckip規範。)

Google Scholar

ckip斷詞 (2010). 中文斷詞系統 (http://ckipsvr.iis.sinica.edu.tw/). (提供線上斷詞服務。)

Google Scholar

被引用紀錄

吳戴任（2011）。論前音節輸入法〔碩士論文，國立清華大學〕。華藝線上圖書館。https://doi.org/10.6843/NTHU.2011.00667

李柏宏（2011）。台華平行語料中台語簡短詞組的詞類標記〔碩士論文，國立清華大學〕。華藝線上圖書館。https://doi.org/10.6843/NTHU.2011.00666

孫玉萍（2010）。讓格書寫下延複詞類自動標記初探〔碩士論文，國立清華大學〕。華藝線上圖書館。https://doi.org/10.6843/NTHU.2010.00061

王建傑（2013）。讓格書寫下之斷詞探討〔碩士論文，國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-2511201311361262

國際替代計量

延複詞及延複詞類初探

全文下載

主題瀏覽