透過您的圖書館登入
IP:3.145.206.169
  • 學位論文

自然語言處理技術應用於科技華語詞彙分析

A Vocabulary Analysis of Chinese for Science and Technology with Natural Language Processing Methods

指導教授 : 洪嘉馡

摘要


本研究透過自然語言處理技術進行科技華語真實語料分析,以《泛科學》11067篇文本作為訓練資料,分別訓練LDA主題模型以及Word2Vec詞向量模型,欲藉此輔助科技華語詞彙教學。 在國際化與科技發展的交互作用之下,來華就讀自然科學相關科系的外籍學習者日益增加。為滿足其修習專業課程以及與同儕進行學術交流之學習需求,科技華語課程需銜接通用華語與科技學術華語之間的落差。然對此專業領域的語言相關研究尚有不足,使得科技華語課程與教材存在兩大問題,一是無法針對學習者不同科系的專業選出合適的詞彙進行教學,二是缺乏科技文本語境中詞彙的使用方式分析。為此本研究將聚焦於以下研究目的:第一,篩選不同學科領域主題之科技華語選詞範圍,並提出參考詞表;第二分析科技華語詞彙於通用華語及科技華語語境中的共現詞差異;第三,比較科技華語近義詞之使用情境與共現詞。 首先,本研究根據LDA主題模型的建模結果發現,科普文本中存在「食品科學、營養學」、「生物學、生命科學」、「醫學、藥學、公共衛生」、「學術生活」、「資訊通訊科技、電機電子工程」、「地球科學、環境科學」、「天文學、航太工程」、「物理學、化學、材料科學」與「神經心理學、統計學」九個潛在的科技主題。接著,將各主題的關聯詞彙以國家教育研究院的詞語分級標準檢索系統進行詞彙難易度分級,建置科技華語各領域主題推薦詞表。其後,以上述詞表中的科技詞彙作為示例,應用Word2Vec模型計算詞彙之間的語義相似度,比較科技詞彙於通用華語和科技華語語境中的使用差異,並進行科技華語近義詞分析,以期作為科技華語詞彙教學之參考。

並列摘要


In this study, the LDA topic model and Word2Vec word vector model were trained using 11,067 texts from PanSci as the training data to assist in the teaching of Chinese for science and technology (CST) vocabulary. With the interaction of internationalization and technological development, the number of foreign learners coming to Taiwan to study science-related subjects is increasing. In order to meet their learning needs for professional courses and academic communication with their peers, the CST curriculum needs to bridge the gap between Chinese for general purposes (CGP) and academic courses. However, there is a lack of language-related research in this area of expertise, which has led to two major problems in CST curricula and materials: the inability to select appropriate vocabulary for teaching learners in different disciplines, and the lack of analysis of vocabulary usage in scientific text contexts. In this study, we will focus on the following objectives: first, to select the range of CST words in different subject areas and propose a reference wordlist; second, to analyze the differences in the co-occurrence of CST words in the contexts of CGP and CST; and third, to compare the usage contexts and co-occurrence of CST synonyms. First, based on the modeling results of the LDA theme model, we found that there are nine potential topics in science texts: "food science, nutrition", "biology, life science", "medicine, pharmacy, public health", "academic life", "information and communication technology, electrical and electronic engineering", "earth science, environmental science", "astronomy, aerospace engineering", "physics, chemistry, material science", and "neuropsychology, statistics". Then, the associated vocabulary of each topic was graded by the National Academy for Educational Research's word grading system for difficulty, and a list of recommended words for each field of CST was created. After that, we applied the Word2Vec model to calculate the semantic similarity between the words in the above list as an example, compared the differences in the usage of the CST words in the contexts of CGP and CST, and analyzed the synonyms in CST in order to serve as a reference for teaching CST vocabulary.

參考文獻


英文文獻
Alexander, R. J. (1984). Fixed Expressions in English: Reference Books and the Teacher. English Language Teaching Journal, 38(2), 127-134. https://doi.org/10.1093/elt/38.2.127
Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A Systematic Comparison of Context-counting vs. Context-predicting Semantic Vectors. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Maryland, United States. https://doi.org/10.3115/v1/P14-1
Berners-Lee, T. I. M., Hendler, J., & Lassila, O. R. A. (2001) The Semantic Web. Scientific American, 284(5), 34-43.
Blei, D. M., & Lafferty, J. D. (2006). Correlated Topic Models. Advances in Neural Information Processing Systems, 18, 147.

延伸閱讀