詞彙穩定的秘密—對各語言學面向的質性與量化分析

前人的在語詞上的研究有許多見解，主要可分為兩部分：語言理論上的分析和語言處理的應用。理論上的分析主要包含三個角度：研究語言現象歷史發展的歷史語言學，新詞共時表現的詞彙語義學，預測詞彙存留的計算語言學。他們都可以運用於字典學，設計語言教材，構建自然語言處理所需的資源。然而，在相關研究中少有同時採用量化和質性角度的探討。其次，前人研究中所選取的目標詞彙有其侷限性。同時，時間訊息以及各類語言學變相都應納入討論以及更深刻的了解詞彙穩定的肇因。詞彙以概念連結的組構模式以及隨著時間積累的心理詞庫都應在探討本議題時納入考量。因此，本文欲以量化和質性觀點切入研究，提出詞彙可能有的三種生命形態（擴散、穩定、失去活性），透過時間資訊以及六種語言學面向（聲韻、構詞、語意、句法、語用、社會語言學）來探討本議題，並期能將結果運用於詞彙預測以及資源建構。量化分析的角度來看，線性回歸模型用以研究區分不同時間點詞彙的語言學特色。語用學顯著地解釋了1950年以前存在的詞彙期使用穩定度的高低，而1950年以後所造的詞是否在語言中穩定使用則有賴語法面向的因素來解釋。這樣的結果暗示詞彙活得越久越與經驗性和語用性知識相關，但對於近期新生的詞彙句法結構的結合性對於其是否會被穩定使用有著決定性的意義。新起的擴散詞以及存在數世紀的詞彙在使用穩定度上十分相似，但藉由邏輯回歸模型可以發現數音節、近義詞數、同義詞數目、在回文中使用的活躍度、是否為外來語成功區別擴散詞以及存在數世紀的詞彙。另方面，語言學特質的角度而言1950年後新生的詞彙與近來新起的擴散詞有相似的語言學特徵。所以將1950年以後新生的詞作為訓練資料建構預測模型來理解現下擴散的詞未來發展的趨勢。結果顯示目標詞前後共現的不同詞彙數有顯著的預測能力，達到0.6335的準確度。質性分析的面向從同義詞間的競爭來探討，句法上的兼容性和該詞概念關係的豐富度應為是否能贏過其他同義詞而被大量使用的關鍵。此外，不同時間點生成的詞在貼文與回文中有不同的使用活性。不同於其他兩者擴散詞在回文中較為活躍，這暗示他們在類似回覆導向的口語風格中以及互動中較易擴散。根據這些研究發現，我們可以進一步應用於增補詞彙於語言資源中。語用上的穩定度、語法上的結合性，以及語意可作為增補詞彙的標準，較廣泛使用的異體詞，語意表達中較穩定使用的詞彙，以及來自同一概念經歷詞彙化的詞項皆收錄於增補後的詞，由此可知所提標準的涵蓋性。

關鍵字

詞彙穩定；詞彙生命；新詞；詞彙擴散；網路語言；語言改變；量化語言學；語料庫；字典學

並列摘要

Previous studies have many insights in understanding lexical items. They can be generally captured into two parts: linguistic analysis and application. Linguistic analysis mainly includes three angles: studies on historical development of linguistic phenomenon from Historical Linguistics, probes on synchronic emergence of neologisms from Lexical Semantics, and prediction models built for understanding survival of words from Computational Linguistics. They can all be applied on including words for Lexicology, designing language teaching materials, and constructing resources for Natural Language Processing. However, there is rarely a single work include quantitative and qualitative methods simultaneously. Second, the generality of included target words in previous studies needs reconsideration. Meanwhile, temporal information of lexical items and various linguistic aspects should be invited to probe deeper for understanding factors contributing to conventionalization of a word. The conceptual associations of organization in mental lexicon and temporal accumulation for mental lexicon should all be considered when facing this issue. Thus, this thesis is aimed to conduct quantitative profiling and qualitative analysis as well as to apply them in constructing lexical resources with proposing three life stages of lexical items (diffusion, conventionalization, and inactivation), including target words from different temporal points, and adopting linguistic variables from six linguistic aspects (phonology, morphology, semantics, syntax, pragmatics, and sociolinguistics). In quantitative profiling, the linear regression model has built to distinguish words from different temporal points. The result shows that pragmatics can best account behavioral performance of words before 1950 and syntax can best capture words after 1950, which implies that words live longer may correlated with rich experiential and pragmatic using knowledge, but for those who are born recently their structurally syntactic compatibility plays important role in deciding their fluctuation in use. Diffused words are similar to words existing over centuries in their Revised Constant U. From logistic regression model it is found that number of syllable, number of near-synonym, number of synonym, activeness in used in comments, and borrowing from other language or not are statistically significant variables that distinguish diffused words and words existing over centuries. On the other hand, words born after 1950 and diffused words are quite similar in their linguistic characteristics. Prediction model based on training data from words after 1950 are built to foretell potential life of diffused words. It shows that number of types co-occurring before target words is statistically valued in prediction. With words before 1950 and recent diffused words as test data the accuracy of model reaches 0.6335. Qualitative analysis on competitions among words from the same synset indicates that structural compatibility and involved conceptual relations may be the key for one lexical item to winning over the other synonymous member. Besides, words coming from different temporal points show differences in their activeness in being used in comments and posts on PTT. Diffused words are more active in comments, which implies they are more correlated with feedback oriented oral style and diffused in interaction. With these findings we can further apply them on proposing suggestions for lexicology. Pragmatically stable in use, syntactic compatibility, and semantically number of senses are taken as standard to expanding inclusion of words. The updated inclusion of popularly used variants, more stable semantic representation, and words lexicalized from the same conceptual experiences indicates the inclusiveness of proposed standards.

並列關鍵字

conventionalization ； life cycle of words ； neologism ； diffusion ； internet language ； language change ； quantitative linguistics ； corpus ； lexicology

參考文獻

Masini, F., & Huang, h.-q. (1997). Xiandai hanyu cihui de xingcheng: Shijiu shiji hanyu wailai ci yanjiu [The Formation of Modern Chinese Vocabulary: Loan Words in the Nineteenth Century]. Foreign Chinese Dictionary.

Hong, J.-f., Wu, Y., & Huang, C.-R. (2005). yitizi yu yiti ci cihui yuyi chutan [ Probe on Variants in Chinese Characters and Lexical Items]. CLSW2005.

Aitchison, J. (2001). Language change: progress or decay? Cambridge University Press.

Aitchison, J. (2012). Words in the mind : an introduction to the mental lexicon. Chichester, West Sussex ; Malden, MA : Wiley-Blackwell.

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, pp. 716-23.

被引用紀錄

謝舒凱、曾昱翔（2019）。深度詞庫：邁向知識導向的人工智慧基礎。中華心理學刊，61(3)，231-247。https://doi.org/10.6129/CJP.201909_61(3).0004

國際替代計量

詞彙穩定的秘密—對各語言學面向的質性與量化分析

主題瀏覽