以維基百科為基礎之中文縮寫詞與同義詞庫建構

雖然過去對於辨識縮寫詞已有不少研究，但其研究範圍並未包含概括縮詞，此外，面對不斷增長及變化的詞彙，已成為資訊檢索及詞庫維護最大的問題。有別於過去以統計方式處理，本研究以維基百科的內文組成結構為基礎，提出數項創新且輕量級同義詞配對識別法。由於同義詞並沒有絕對客觀的標準答案可資核對，為驗證本研究所提出方法是否有效，我們進行兩階段包含主客觀方式評量。實驗結果顯示本研究所提出的方法，除了能有效萃取出縮寫詞、異形同義及同形異義詞之外，還能夠識別出過去研究無法解決的概括縮詞。在第一階段評量平均精確率為72%、召回率82%，其中縮寫詞的精確率高達92%，概括縮詞的召回率為90%。第二階段評量結果，使用者接受度亦達91%。在效率方面，平均找出1組同義詞只需要0.01秒。

關鍵字

同義詞；縮寫詞；概括縮詞；維基百科；同形異義詞

並列摘要

Purpose-A synonym can be any part of speech with the same or similar meaning of another word. Broadly speaking, it covers abbreviations in its scope. By convention, authors tend to indicate their writing with high artistic qualities by using numerous synonyms in context. Due to the interchangeable feature and the rampant growth of new usages, synonyms increase the difficulty of Natural Language Processing (NLP) and vocabulary maintenance. Unlike traditional approaches failed in its fallacy outcomes due to the adoption of statistical methods to determine synonyms, this study aims to construct a comprehensive synonym database via lightweight methods which would also take update issue into serious consideration. Design/methodology/approach-The study proposes a research framework based on the analysis of contextual structure of Wikipedia. Due to the lack of a recognized correct corpus to assess synonyms, we adopted a two-stage evaluation including subjective and objective ways. Taken the virtue of continuous user involvement and suggestion, the constructed synonym database will be synchronously updated accordingly. Findings-The proposed methods not only can correctly identify abbreviations, synonyms, and homographs, it can also successfully extract generalized terms with its multinomial sub-terms which had never done before. This finding indicates that a greater deployment of the comma algorithm can be undertaken to other customized application. The precision and recall rates of the first-stage evaluation are 72% and 82%, respectively. The user acceptance rate conducted in the second-stage reaching 91% was very promising. As for the efficiency evaluation, it took only 0.01 seconds to extract one set of synonyms from the system. Research limitations/implications-This study mainly focused on formal descriptions extracted from Wikipedia. It is suggested that future research may consider applying to confusion word set or social media to fill the gap. Practical implications-This paper contributes to automatic synonym construction research in several ways with a couple of practical implications. First, it demonstrates that a statistics-free, lightweight method can effectively generate a comprehensive coverage of synonyms. Second, this method can work with search engines to conduct big data analysis. Third, this study depicts that synonym construction can be portrayed in terms of ontology architecture to guarantee the sustainability of knowledge and the growth of literacy competencies of users. Originality/value-Even though there have been many researches towards synonyms, none of them proposed the resolution to identify the generalized term with its multinomial sub-terms. This study is the first of its kind to solve this problem. In addition, words will be labeled with their name entity such as names of people, places, and organizations. Search results will be displayed based on the ontology architecture in which the word association can be clearly visualized.