英文書寫 使用 空格 將詞組分隔, 但中文書寫時, 詞和詞之間, 並不會 以空白區分。 這對 自然語言 造成了 某些問題, 如分詞和 句法分析。 江永進等 (2009) 提出的 讓格書寫, 使用 簡單的短詞組 作為 基本書寫單元。 用簡短的 詞組, 分詞的問題 就變得 容易多了, 也有 方便閱讀, 不易模楜 的好處。 而且, 讓格詞組 也與 台語變調模式, 習習相關。 本文 進行了 三項探討。 首先是 新詞組採礦, 從 報章文字 自動提取。 第二個 和 第三個是 中文 無間書寫句子 轉換成 寬詞書寫 以及 讓格書寫。 放寬 傳統 詞 的定義 較不易操作, 我們 提議 寬詞, 並且 提出 高可操作性 計算字數的 寬詞1234原則。
Unlike written English having spaces between words, the current Chinese orthography uses no spaces. This poses certain problems in automatic text processing, such as word segmentation and syntactic parsing. Chiang et. al. (2009) proposed Spaced (讓格) orthography that uses simple short phrases as basic writing units. With simple short phrases, the problem of word segmentation becomes much easier. Texts in Spaced orthography are also easier to read. For Taiwanese, Spaced orthography seems closely related to Taiwanese tone sandhi patterns. Three tasks are studied in this paper. The first is the mining of new words, automatic extractions of new words from newspaper texts. The second and the third are the automatic segmentations of traditional Chinese sentences into so-called generalized words (寬詞) and into simple short phrases. Using the number of characters for easy operability, we propose a set of 1234 rules for the specification of generalized words.