基於條件隨機場之中文樹狀結構標記

中文句法樹狀結構剖析在中文的自然語言處理上是非常重要的工作，在中文裡，詞為有意義的最小語言單位，中文的句子是由多個詞所組成，對於詞與詞之間該如何連接、哪些詞又需要優先被連接即為結構剖析的工作。近年來的研究傾向於使用機器學習的方式來進行中文斷詞及剖析，傳統的中文結構樹對於樹狀結構的標示約能達到70%左右的F-measure，在本研究中的第一部分使用條件隨機場來進行中文句法結構的訓練及標記，所用的訓練語料為使用中研院詞庫小組剖析系統標記的剖析結果，由於剖析結果中並不是完全正確，將部分錯誤的剖析結果經過人工修改後，使用條件隨機場進行模型訓練及標記，對測試語料的結果評估可以達到80%以上的F-measure。由過去文獻中顯示，中文句法結構及中文語音韻律結構有一定程度的關係，本研究的第二部分根據停頓時長大小定義一停頓韻律樹，並使用與第一部分相同的機器學習方式來標記停頓韻律結構樹，標記結果顯示，對於較容易判別的長停頓B3、B4分別能達到57.80%及81.25%的正確率，而較難判定的短停頓B2-2則僅有35.54%。

關鍵字

條件隨機場；中文句法結構樹；停頓標記

並列摘要

In Chinese Natural language processing (NLP), syntax tree structure parsing is an important topic. The smallest meaningful unit is a word in Mandarin. Besides, a Chinese sentence is composed by many words. Thus, how to connect the words and which need to be connected at first is the role of parsing. Recent studies tend to use machine learning to parsing. The traditional Chinese parsing can almost achieved 70% F-measure. In our system, we train and label tree structure by Conditional random field (CRF). Training data use the parsing result by CKIP parser. We correct the parsing result which is not identical before model training. Using the CRF-based model to label testing data can achieve over 80% F-measure. In past work, Chinese Mandarin syntax is always in connection with Mandarin prosody. So, we defined a Prosodic Break tree by pause duration between words. Then label the break tree in the same method with syntax tree. We can achieve 57.80% and 81.25% correct rate to the long pause B3 and B4. And only 35.54% to the short pause B2-2.

並列關鍵字

Conditional Random Field ； Chinese syntax tree ； Break prediction

參考文獻

[1] John Lafferty, Andrew McCallum, Fernando Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data”, Proc. of the 18th ICML, pp. 282-289, 2001.

[2] Taku Kudo and Yuji Matsumoto. 2000. “Use of support vector learning for chunk identification”. In Proceedings of CoNLL-2000 and LLL-2000.

[3] Taku Kudo and Yuji Matsumoto. “ Chunking with support vector machines”. In Proc. NAACL 2001. ACL, 2001.

[8] C.-Y. Chiang, S.-H. Chen,H.-M. Yu,and Y.-R. Wang, “Unsupervised joint prosody labeling and modeling for Mandarin speech, ” J. Acoust. Soc. Amer. ,vol. 125, no. 2, pp. 1164–1183,Feb. 2009.

[9] H.-J. Peng, C.-C. Chen, C.-Y. Tseng, and K.-J. Chen, “Predicting prosodic words from lexical words—A first step towards predicting prosody from text,” Proceedings of the ISCSLP 2004, pp. 173–176.

國際替代計量

基於條件隨機場之中文樹狀結構標記

全文下載

主題瀏覽