專業或正式寫作須精簡、優雅而不羅嗦。從口語化的初稿開始,經過刪簡精練,成為書面而正式的文章。刪字、刪詞是新聞改稿最頻繁的動作之一,但這樣的動作都以人工方式進行。 我們取得一年的新聞編輯紀錄,來研究刪字刪詞的現象。中文有偏單字詞的文言與雙字詞的白話文之分,後者在口語中有避免同音歧義的作用,但在書面語中常可省略部分字而保留詞語原來的意思,因此有刪部分字的可能性。其他語言雖有固定集合的縮寫,但不見如中文上千萬不同詞那麼普遍,甚至同樣詞語可有多種刪簡法。改稿人多而無明確標準則要求,一致性很難達到,並且問題本身的奇異性與多種選擇。 我們的實驗結果顯示機器翻譯雖然精確度高,但召回率低,尤其是部分字刪除。用多層序列標記模型結合詞與字的特徵在最佳表現。 在考慮議題本身的奇異性與多選擇而每句字只提供一個參考答案的條件之下,本模型在高頻與虛詞表現得不錯,但在開放詞類集合,比如名詞、動詞等,若只有幾個案例則難多了,不過經由實驗發現此處詞性資訊較能發揮作用。
Writing in a professional or formal context requires conciseness. Starting from a colloquial draft, text is gradually refined and wordiness removed, resulting in a more formal style. For newspaper editing this is among the most frequent operations, yet is still carried out manually. We have obtained a year of editing records and provide some insight into this phenomenon. In spoken Chinese, many words are composed of two or more characters, in writing the same meaning can often be conveyed by a subsequence. This gives rise to subword deletion. We show this to be an open class problem, with thousands of different word reductions pairs. Often there exist different reduction or deletion possibilities for the same word, contributing to the difficulty of achieving consistency with a variety of human annotators, given only a single reference and without explicitly formulated rules. We show that a neural machine translation based model can usually judge with very high precision whether to delete a word, but suffers from low recall, especially at the subword level. We combine sequence labeling at word and character level and attain the best performance for full and subword deletion in a single. Considering the ambiguity inherent in the problem and given only a single reference, our model attains reasonable consistency, especially on grammatical function words with hundreds or even thousands of instances available for training, Open word classes are more difficult to handle with in many cases only a few instances per word. We show how syntactic features are particularly helpful for these.