基於深度學習判定日文句型之研究

對於初學者來說，日文句型複雜而容易混淆。例如，日文助詞有「は」、「が」、「に」和「で」等，同一助詞在不同語意下，常可分類出不同用途的句型。傳統判定句型需要借助詞性之輔助，但詞性本身判定不易，而最近流行的深度學習BERT語言模型具有詞性標記之能力，所以本文想藉由BERT模型在輸入句子之下，直接判定日文句型，同時標記句型判定之關鍵字詞，省下詞性判定之步驟。為了測試BERT標記句型之能力，本文將檢視如下四類基本句型：（1）判定基本句型及標示連續關鍵字詞，以動詞普通型為例；（2）判定基本句型及標示不連續關鍵字詞，以「たり」型為例；（3）判定外表字詞相同卻屬於不同句型之情況，以「のに」型為例；（4）判定及標示特定地點存在人事物相關句型，以「ある/いる」型為例。實驗結果指出，以關鍵成功指數(CSI)衡量，經過微調訓練後，動詞普通型模型之訓練CSI為98.3%，驗證CSI為98.6%；「たり」型模型之訓練CSI為98.4%，驗證CSI為98.5%；「のに」型模型之訓練CSI為96.3%，驗證CSI為93.2%；「ある/いる」型模型之訓練CSI為98.4%，驗證CSI為78.6%。上述結果顯示透過深度學習可正確判定本文所測試近八成以上的句型，解決部分正規式不易判定句型問題，其結果能提供未來製作句型分析系統之參考，協助學習者更加熟悉日文句型之運用，增長自我學習等能力。由本研究可看出深度學習的BERT語言模型用於句型判定及標示之潛力深厚，值得進一步探索，以達成輔助日語初學者之目的。

關鍵字

深度學習；自然語言處理；命名實體識別；關鍵成功指數(CSI)

並列摘要

For beginners, Japanese sentence patterns are complex and confusing. For example, Japanese particles include "は"(ha), "が"(ga), "に"(ni) , "で"(de), and etc. The same particle can often be classified into different sentence patterns with different semantics. The traditional detection of sentence patterns requires the assistance of parts of speech, but it is in itself a difficult task to determine the parts of speech. The recently popular deep learning BERT language model has the ability to tag parts of speech, so we want to use the BERT model to directly determine the Japanese sentence pattern. Given an input sentence, the BERT model will detect a sentence pattern and mark those critical keywords of the pattern, thus bypassing the step of tagging parts of speech. In order to test the ability of BERT to detect sentence patterns, this study will examine the following four types of patterns: (1) basic sentence patterns with consecutive keywords, taking the plain present tense verb as example; (2) basic sentence patterns with non-consecutive keywords, taking the “たり”(tari) pattern as example; (3) different basic sentence patterns with the same surface form, taking the “のに”(noni) pattern as example; (4) sentence patterns related to the existence of men or things in a place, taking the “ある/いる”(aru/iru) pattern as example. The experimental results show that in terms of the critical success index (CSI), after fine-tuning, the verb model has a CSI of 98.3% in training, and 98.6% in validation; the “たり”(tari)model has a CSI of 98.4% in training, and 98.5% in validation; the “のに” (noni)model has a CSI of 96.3% in training, and 93.2% in validation; the “ある/いる”(aru/iru) model has a CSI of 98.4% in training, and 78.6% in validation. It is verified that through deep learning, more than 80% of sentence patterns in test can be correctly determined. It can also solve the problem that some sentence patterns are hard to detect by regular expressions. This experience can be used for building an analysis system of Japanese sentence patterns, thereby enriching the self-learning ability of Japanese language learners. The study reveals that the BERT language model has great potential for sentence pattern detection and tagging. It is worthy of further exploration to help more beginners in learning Japanese.