本研究的目的在於發展一種統計的方法自動的擷取並且辨識這些錯誤的中間語言。中間語言是由語文學家所定義,任何人在學習第二外國語所會產生的一種中介語言,它以學習者正在學的第二外國語的形式表現但卻會受到學習者本身母語相關特性的影響,學習者自己無法辨識出他正在寫的是正確的第二外國語抑或是中間語言。本論文所提出之架構有相當大的彈性,在訓練的過程之中,不需要任何由人辨識標籤過的句子當作訓練資料,故可以很輕易的被轉換至任兩種不同的母語與第二外國語來使用。 此系統先使用機器翻譯的技術去模擬中間語言的特徵當作訓練資料去訓練出一可以用來判斷是否相似於機器所模擬之中間語言的辨識器,再以此辨識器去標記訓練資料,以這些訓練資料最像中間語言的句子當作中間語言的訓練資料,重新訓練出一個更佳之中間語言辨識器。 本系統以母語為中文和第二外國語為英文的情境下做評估,本實驗把系統應用在中華民國碩博士論文網中由全中華民國碩博士所寫的論文的摘要的句子上,實驗結果發現我們可以達到64.58%的精密度和56.67的偵測率。
This paper describes a statistic method aiming at automatically retrieving and identifying interlanguage sentences. Interlanguage is a kind of language developed by a second language learner who has not become fully proficient yet but trying to approximate the learned language. The framework does not require human annotated and is language universal, thus can be applied to retrieve interlanguage between any two given languages. The framework has three stages, the first is approximating interlanguage with an order-preserved phrasal machine translator, the second is training a classifier to identifying interlanguage sentences, and the last is refining the classifier by retraining a new classifier with the interlanguage indentified by the classifier in second stage. The frame work is applied to extract a set of Chinese-English sentences for evaluation which reveals 64.58% in precision and 56.67% in recall while identifying a set of Chinese-English sentences from normal English sentences in the abstracts of thesis in English written by graduate students in Taiwan.