透過您的圖書館登入
IP:3.17.150.89
  • 學位論文

文字部件為本的語料分析:一個子字詞層次的中文語料庫工具

Glyph-based Corpus Analysis: A Toolkit for Sub-character Analysis of Chinese Corpora

指導教授 : 謝舒凱

摘要


中文書寫系統在世界書寫系統中具有獨特的地位,因為絕大多數的漢字為語素文字 (logogram)。因此,漢字本身即攜帶語義訊息,而不像許多其他書寫系統需透過拼音對應至詞彙來攜帶語意訊息。此外,漢字通常可以被分解成更小的元素,這些元素常攜帶著與該漢字相關的語意和發音。然而,由於漢字的編碼方式 (encoding),電腦使用者不容易取得這些豐富的資訊——一個漢字對應到電腦中的一個編碼 (code point),這讓使用者無法進一步取得漢字的內部結構訊息,因為編碼本身並不會記錄這些資訊。例如,中文使用者會知道,「淋」和「霖」這兩個字的發音相同,因為它們有共同的部件「林」。但是我們無法從「淋」和「霖」的編碼中取得這個共同的部件——在 Unicode 中,「淋」與「霖」分別對應到 U+6DCB 與 U+9716,但這些編碼並無法表徵這兩個字具有關聯的事實。面對這個局限,我們開發了一個可分析子字詞層次的中文語料庫工具。這個語料庫工具讓使用者能夠取得漢字豐富的部件資訊 (包含部首與非部首),例如,這讓使用者可以根據漢字共有的部件進行檢索 (舉例來說,透過共同部件「林」,可以取得「淋」、「霖」、「琳」、「箖」與「惏」),並且讓使用者能夠透過這類訊息來進行語料的量化分析。除了語料庫工具之外,我們還進行了一項個案研究,以透過實徵資料驗證子字詞層次的資訊是否有用,並同時探索此階層與更高階層的語意關聯。結果顯示,某些特定的漢字部首語義訊息與詞彙的語義訊息具有顯著的關聯,然而多數的部首與詞彙類型並無明確的對映關係。論文最後,我們指出了漢字內部的高度遞迴結構對於當前研究的一些影響,並討論了解決相關困境的潛在可能。

並列摘要


The Chinese writing system is exceptional among the world’s writing systems in that Chinese characters are predominantly logograms that denote words or morphemes. Hence, Chinese characters carry semantic information directly without reference to pronunciations as in other writing systems. Furthermore, Chinese characters are often decomposable into smaller elements that hint at their meaning and pronunciation. This rich internal information of the Chinese characters, however, is not easily accessed in computers nowadays due to the way characters are encoded—a Chinese character is mapped onto a single code point in computers, which makes it impossible to access the internal structures of the character since the code point does not provide such information. For instance, users of Chinese characters would know the characters 淋 and 霖 are pronounced identically due to their common phonetic component 林. But we have no way to access this common component from the encoding of 淋 and 霖, which is U+6DCB and U+9716 in Unicode representation, respectively. Facing this limitation, the current work sets out to develop a software toolkit for analyzing Chinese corpora at and below the level of characters. This corpus toolkit provides access to the rich character internals (radicals and non-radical components) of Chinese characters that were previously unavailable to users, which, for instance, would allow users to search characters based on their common components (e.g., 淋, 霖, 琳, 箖, and 惏 could be retrieved by their common component 林) and enable users to quantitatively analyze corpus data with sub-character information such as this. In addition to the introduction of the corpus toolkit, a case study is also conducted to collect empirical evidence for the usefulness of character internals and to explore their relationships to larger units of the Chinese writing system. The results indicated that semantic information encoded in certain character radicals has a reliable relationship with word semantic types, although most of them have no clear one-to-one correspondence. Finally, several complications due to the highly recursive internal structure of Chinese characters have been pointed out. Potential solutions to these complications are discussed.

參考文獻


Baayen, H. (1993). On frequency, transparency and productivity. In G. Booij J. van Marle (Eds.), Yearbook of Morphology 1992 (pp. 181–208). Springer Netherlands. https://doi.org/10.1007/978-94-017-3710-4_7
Baayen, H. (2001). Word frequency distributions. Kluwer Academic Publishers.
Baayen, H. (2009). Corpus linguistics in morphology: Morphological productivity. In A. Lüdeling M. Kytö (Eds.), Corpus Linguistics: An International Handbook (Vol. 2, pp. 899–919). De Gruyter Mouton. https://doi.org/10.1515/9783110213881.2.899
Baayen, H., Renouf, A. (1996). Chronicling the Times: Productive lexical innovations in an English newspaper. Language, 69–96.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

延伸閱讀