透過您的圖書館登入
IP:52.14.22.250
  • 學位論文

應用語意分析於衡量文獻引用關係之探討

A Study on Applying Semantic Analysis in Measuring Citation Relationships

指導教授 : 陳光華
本文將於2024/07/07開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


本研究分析三種被廣泛應用的引用關係,包含直接引用、書目耦合、共被引。以瞭解於不同模型衡量引用關係後,其所得分析結果之異同。研究分析的模型,除實務應用之經典模型與本研究設計之兩種語意模型外,另納入相關研究提出之頻率模型、距離模型、辭彙模型,總計六種。語意模型的部分,本研究使用基於Wordnet與BERT設計之自然語言處理開源工具,以Awais(2011)資料集進行訓練後,判斷引用句的情感傾向與語意相似度。一方面,經由判斷引用句的情感傾向,分類直接引用關係;另一方面,則衡量引用句間的語意相似度,修正書目耦合與共被引的關係強度。頻率模型部分,則依文內引用頻率,調整直接引用、書目耦合、共被引之引用強度。辭彙模型上,則依引用句所用辭彙的相似程度,調整書目耦合、共被引之引用強度。距離模型則依文內引用的對應位置,調整共被引強度。 對於各模型衡量之結果,則比較其網路結構、分群結果、關鍵節點與強關係之情形,以確認在引用分析結果上的表現情形。本研究將各模型形成之引用網路,區別為整體網路與核心網路。對這兩類網路,比較各模型的節點數、關係數、網路密度、連結元件數(number of connected components)、傳導性(transitivity)、平均群聚係數(average clustering coefficience)上的差異。對分群結果的比較上,本研究以Modularity分群演算法對各模型核心網路之節點進行分群。於初步檢視分群數、孤立節點(singleton)數、群規模後,再以Adjusted Rand Index確認分群結果間的相似程度。接著,則以文字群聚度(textual coherence),量化衡量分群結果的表現。並以各群文獻標題中之高頻詞彙,確認各群的主題後,比較各模型間主題分析結果。最後,於節點與關係的部分,則檢視各模型中,來源文章在直接引用網路中的被引用的傾向與次數,以及書目耦合、共被引網路中的強關係書目組。經由前述方式分析不同模型在衡量各類引用關係的表現,本研究對在目前引用分析中,應用語意分析技術之優缺,加以綜整分析。 基於上述研究設計,本研究選定圖書資訊學領域之十五種期刊,以其中所刊登之10,088篇文章做為研究對象。由網路層級的分析結果來看,在直接引用關係上,判斷情感傾向並移除負面引用之後,對於整體網路與核心網路的結構影響並不明顯。而在書目耦合網路中,對關係強度進行調整後,核心網路的結構上有較大差異,但在整體網路結構上則無明顯變化。於共被引網路時,則不論在整體網路或核心網路上,各網路指標均指出有明顯差異。 由核心網路分群結果的相似程度來看,直接引用的部分,僅有經典模型的結果明顯不同,而頻率模型、Wordnet模型、BERT模型三者的分群結果則十分相似。書目耦合的部分,各模型的結果略有差距,但除了詞彙模型的較為明顯,其它模型間的差距並不明顯。在共被引的部分,各指標則指出,多數模型相互存在明顯差異。而文字群聚度、主題分析結果則顯示,語意模型應用在共被引時,文字群聚度較高,則具發掘研究領域新議題的能力。但當應用於直接引用、書目耦合時,除了沒有明顯改善文字群聚度外,主題分析的結果亦十分類似。 在節點與關係層次上,當來源文獻有被正面引用過時,其被直接引用數更可能高於未被正面引用過的文獻。此一傾向,在多個語意模型均判定此來源文獻有被正面引用或考慮進累積引用所需時間之後,會更為明顯。但在書目耦合與共被引關係的部分,則未觀察到使用語意分析的模型會提供更為優秀的表現。 綜觀而言,目前設計之語意分析模型的影響,依引用關係類型、分析層次的差異,有著不同影響。以網路層次而言,排除負面引用對於網路結果的影響甚微,這可能代表目前語意分析模型在負面引用偵測上仍力有未逮,或負面引用影響不如先前學者預期的明顯。而於書目耦合、共被引上,則對於核心網路結構均產生明顯影響。分群結果的比較,則顯示目前語意分析模型僅應用於共被引時有得到較明顯的改善。除了在文字群聚度上有較佳表現外,主題分析的結果也較能反映出領域變動情形。但應用於直接引用、書目耦合上時,則未有明顯改善。而由節點與關係層次的分析來看,應用語意分析模型區別引用句的情感傾向,有助於判斷被引用文獻的影響力。但使用語意相似度修正書目耦合與共被引時,則未觀察有進一步的改善。

並列摘要


The present study investigates three kinds of citation relationships, including direct citation (DC), bibliographic coupling (BC), and co-citation (CC), to understand the effects of considering semantic meanings when conducting citation analysis. Six models were included in this study. The classical model is the general way to implement citation analysis. The frequency model adjusts the strength of DC, BC, and CC by the number of citations. The lexical model revises the BC and CC strength based on the lexical similarity of citances. The distance model weights CC strength by considering the relative locations between citations. Another two models, Wordnet and BERT models, are based on the open-source tools and trained by the corpus provided by Awais (2011) to decide the citations' sentimental polarity and measure the semantic similarity between two citations. The sentimental polarity and semantic similarity were used to classify DC and weight BC/CC, respectively. To evaluate these models, the present study compares their results at three levels: network, cluster, and node/relationship. At the network level, six indicators were used, including number of nodes, number of edges, network density, number of connected components, transitivity, and average clustering coefficient. At the cluster level, the clusters resulting from the clustering algorithm based on modularity were first examined by number of clusters, number of singletons, and cluster size. Then, Adjusted Rand Index was used to measure the similarity between the clustering results. This study further evaluated the quality of clustering results based on textual coherence and subject analysis. At node/relationship level, this research examined the correlation between a reference's sentimental types and its DC counts. Whether the citation strength will be higher if two works' topics are highly similar was also investigated. The present study chose the 10,088 articles published in the fifteen journals of Library and Information Science (LIS) as the research subjects. The examination of network level showed that removing negative citations does not significantly affect the DC citation network. As to BC/CC citation network, weighting strength by the semantic meaning reveals different whole networks, especially the core networks. Comparing the clustering results of DC core networks indicated that the results of the frequency, Wordnet, and BERT models were highly similar. Only that of the classical model shows a different pattern. As to the BC core networks, no noticeable differences existed between the results of these models except the lexical model. Examining the clustering results of CC core networks revealed the existence of evident divergence. Textual coherence and subject analysis supports that the clustering results of CC core network based on the Wordnet/BERT models have higher textual coherence. The subjects identified from the clustering results of the two models better reflected the development of LIS in this period. The examination at node/relationship level revealed that the DC is probably higher if the source article has been cited positively. The tendency will be more evident when using multiple semantic models or considering the time effects. However, applying semantic models in weighting BC and CC did not improve their results. Overall, the effect of the semantic models proposed in this study varies by the type of citation relationship and at which level researchers analyze the result. At the network level, removing negative citations affects slightly. It shows that the current semantic tools may have difficult in identifying negative citations or that the effects of negative citations are not as critical as the arguments of the previous studies. As to BC/CC, however, applying semantic models does significantly affect. The examination at the cluster level indicates that applying semantic models in CC improves its textual coherence and better reflects the evolution in the domain. Yet, no similar effect is found when using semantic models in DC and BC. Additionally, classifying citations by their sentimental polarity helps identify the influence of the cited works. At the node/relationship level, however, adjusting BC and CC based on the semantic similarity may not improve the result.

參考文獻


Small, H. (1973). Co-citation in the scientific literature: A new measure of the relationship between two documents. Journal of the American Society for Information Science, 24(4), 265-269. doi: 10.1002/asi.4630240406
Abrizah, A., Noorhidawati, A., Zainab, A. N. (2015). LIS journals categorization in the journal citation report: A stated preference study. Scientometrics, 102(2), 1083-1099. doi: 10.1007/s11192-014-1492-3
Abu-Jbara, A., Ezra, J., Radev, D. (2013). Purpose and polarity of citation: Towards NLP-based bibliometrics. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (p. 596-606). Atlanta, Georgia: Association for Computational Linguistics.
Ahlgren, P., Chen, Y., Colliander, C., van Eck, N. J. (2020). Enhancing direct citations: A comparison of relatedness measures for community detection in a large set of PubMed publications. Quantitative Science Studies, 1-20. doi: 10.1162/qss_a_00027
Ahmed, T., Johnson, B., Oppenheim, C., Peck, C. (2004). Highly cited old papers and the reasons why they continue to be cited. Part II., The 1953 Watson and Crick article on the structure of DNA. Scientometrics, 61(2), 147-156. doi: 10.1023/B:SCIE.0000041645.60907.57

延伸閱讀