透過您的圖書館登入
IP:3.139.86.56
  • 期刊
  • OpenAccess

臺灣客語語料庫建置與客語詞彙使用初探

The Construction of Taiwan Hakka Corpus and Preliminary Analysis of Hakka Lexical Usage

摘要


本文旨在介紹建置中的「臺灣客語語料庫」,其重要性在於其為臺灣第一個書面語料與口語語料兼具的帶標記客語語料庫,以系統化方式收錄臺灣客語六腔語料。為克服於建構過程面臨到之眾多挑戰,本語料庫制訂符合客語真實語言表現之相關規範,解決客語用字及難字輸入問題,介面一律中文化,並獨立開發專屬客語的檢索與斷詞系統。後以高頻詞為引,藉由探索臺灣客語語料庫、中央研究院現代漢語平衡語料庫(臺灣華語)、美國當代英語語料庫(美式英語)前300名高頻詞的詞頻排序結果,檢視此三自然語言是否皆符合齊夫定律,接續則更進一步著重探究臺灣客語與臺灣華語的前十大高頻詞比較,展示語料庫研究具量化數據統計與質性文本分析集於一體之應用實證特性。

並列摘要


This paper aims to address the procedural implications of Taiwan Hakka Corpus under construction. With both written and spoken varieties of Taiwan Hakka language collected in a systematic manner, Taiwan Hakka Corpus is the first part-of-speech-tagged corpus in Taiwan. While confronting various challenges, Taiwan Hakka Corpus manifests its distinctive insignias by formulating standards based on the authentic language performance of Hakka, as well as by tackling the issues of the inputs of Hakka (rare-used) characters. In addition, concordance and segmentation system is developed exclusively for Taiwan Hakka language, with its interface in all Chinese, facilitating users to access the corpus. The distribution of top 300 words in three corpora is subsequently compared and contrasted, examining whether Zipf's law for word frequencies is observed in the three languages (Taiwan Hakka in Taiwan Hakka Corpus; Taiwan Mandarin in Academia Sinica Balanced Corpus of Modern Chinese [Sinica Corpus]; American English in Corpus of Contemporary American English [COCA]). The result exemplifies an empirical quantitative and qualitative experiment made possible for Taiwan Hakka language, thanks to the construction of this corpus.

參考文獻


白璧玲、吳承翰、蔡融易、蔡宗翰、范毅軍(2019)。數位人文與時空資訊整合分析—個人歷史文本分析工具及其應用於明代倭寇研究之案例。數位典藏與數位人文,4,1-25。doi:10.6853/DADH.201910_(4).0001
邱詩雯(2018)。《史記》作者數位化研究初探—以三十世家虛字字頻為例。數位典藏與數位人文,2,49-69。doi:10.6853/DADH.201810_2.0003
劉吉軒(2018)。計算思維與數位人文研究意涵。數位典藏與數位人文,1,51-77。doi:10.6853/DADH.201804_1.0003
Sung, L.-M., Su, L. I.-w., Hsieh, F., & Lin, Z. (2008). Developing an online corpus of Formosan languages. Taiwan Journal of Linguistics, 6(2), 79-117. doi:10.6519/TJL.2008.6(2).4
Chui, K., & Lai, H.-l. (2008). The NCCU corpus of spoken Chinese: Mandarin, Hakka, and Southern Min. Taiwan Journal of Linguistics, 6(2), 119-144. doi:10.6519/TJL.2008.6(2).5

被引用紀錄


葉秋杏、賴惠玲(2023)。從語料庫建構探討臺灣客語難字、缺字與異體字議題臺灣語文研究18(1),135-183。https://doi.org/10.6710/JTLL.202304_18(1).0003

延伸閱讀