詞彙刪簡模型用於中文句子精練

專業或正式寫作須精簡、優雅而不羅嗦。從口語化的初稿開始，經過刪簡精練，成為書面而正式的文章。刪字、刪詞是新聞改稿最頻繁的動作之一，但這樣的動作都以人工方式進行。我們取得一年的新聞編輯紀錄，來研究刪字刪詞的現象。中文有偏單字詞的文言與雙字詞的白話文之分，後者在口語中有避免同音歧義的作用，但在書面語中常可省略部分字而保留詞語原來的意思，因此有刪部分字的可能性。其他語言雖有固定集合的縮寫，但不見如中文上千萬不同詞那麼普遍，甚至同樣詞語可有多種刪簡法。改稿人多而無明確標準則要求，一致性很難達到，並且問題本身的奇異性與多種選擇。我們的實驗結果顯示機器翻譯雖然精確度高，但召回率低，尤其是部分字刪除。用多層序列標記模型結合詞與字的特徵在最佳表現。在考慮議題本身的奇異性與多選擇而每句字只提供一個參考答案的條件之下，本模型在高頻與虛詞表現得不錯，但在開放詞類集合，比如名詞、動詞等，若只有幾個案例則難多了，不過經由實驗發現此處詞性資訊較能發揮作用。

關鍵字

句子精練；精簡化；新聞；編輯；改稿；句子壓縮；詞彙刪簡；刪詞；刪字；電腦輔助文字處理；多層序列標記；神經機器翻譯

並列摘要

Writing in a professional or formal context requires conciseness. Starting from a colloquial draft, text is gradually refined and wordiness removed, resulting in a more formal style. For newspaper editing this is among the most frequent operations, yet is still carried out manually. We have obtained a year of editing records and provide some insight into this phenomenon. In spoken Chinese, many words are composed of two or more characters, in writing the same meaning can often be conveyed by a subsequence. This gives rise to subword deletion. We show this to be an open class problem, with thousands of different word reductions pairs. Often there exist different reduction or deletion possibilities for the same word, contributing to the difficulty of achieving consistency with a variety of human annotators, given only a single reference and without explicitly formulated rules. We show that a neural machine translation based model can usually judge with very high precision whether to delete a word, but suffers from low recall, especially at the subword level. We combine sequence labeling at word and character level and attain the best performance for full and subword deletion in a single. Considering the ambiguity inherent in the problem and given only a single reference, our model attains reasonable consistency, especially on grammatical function words with hundreds or even thousands of instances available for training, Open word classes are more difficult to handle with in many cases only a few instances per word. We show how syntactic features are particularly helpful for these.

並列關鍵字

Sentence refinement ； Conciseness ； Newspaper ； Newspaper editing ； Rewriting ； Sentence compression ； Chinese word refinement ； Chinese word deletion ； Chinese character deletion ； subword deletion ； Computer aided text processing ； Multi-level sequence labeling ； Neural machine translation

參考文獻

David Chiang. 2010. Learning to Translate with Source and Target Syntax. In Proceedings of the Association for Computational Linguistics 2010, pages 1443-1452.

Google Scholar

James Clarke and Mirella Lapata. 2007. Modelling Compression with Discourse Constraints. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and on Computational Natural Language Learning, pages 1–11.

Google Scholar

James Clarke and Mirella Lapata. 2008. Global inference for sentence compression: An integer linear programming approach. In Journal of Artificial Intelligence Research, 31(1): pages 399–429.

Google Scholar

Katja Filippova, Enrique Alfonseca, Carlos A. Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with LSTMs. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 360–368.

Google Scholar

Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1481-1491.

Google Scholar

國際替代計量

詞彙刪簡模型用於中文句子精練

全文下載

主題瀏覽