相關文件群集之階層式自動標籤

對於研究者而言，欲了解一個學術領域必定從閱讀過去學者的文獻以了解學術領域之背景脈絡開始，隨著資料庫和搜尋引擎的發明使得我們得以在資訊爆炸的時代快速保存並搜尋大量的資料，想在網路上找尋文獻已非難事，只是，除非研究者對研究領域有概觀性認知，可對所需的資訊做詳細描述後再至資料庫中進行資料搜尋，才有辦法精確的從龐大的資料量中尋得有用的資訊，否則研究者唯有對搜尋來的資訊進行人工讀取辨識，進而確認哪些為自己想要的資訊。　　近年來，越來越多學者開始探討如何進行學術領域辨識，透過了解學術領域之發展脈絡，從大量知識中挖掘自己所需的資訊。本研究提出一個可用來自動化的對一組具有二階層架構的資料集進行標籤擷取的流程，達成學術領域主題分析之目的，利用資訊檢索技術搭配詞在文章中出現的次數以及出現於資料集的文章中之文章篇數進行詞權重計算以及詞維度縮減，刪除文件中不重要的詞，並考量詞彙於標題及內文中的比重調整權重值，再搭配詞之間的語意相似性計算，擷取出文件群集中重要的關鍵詞彙，進而搭配關鍵詞彙網路、修改後相互資訊等詞彙間關聯程度的計算方法，從文章中萃取出可代表文件群集主題的重要關鍵詞與關鍵片語，其中關鍵詞為單詞，關鍵片語則包含二個詞組成的片語及三個詞組成的片語，最後根據文件群集標籤的方法，對關鍵詞與二詞關鍵片語之權重進行調整，以擷取出足以代表文件群集主題的標籤，並擷取出三詞關鍵片語的標籤用以輔助解釋單詞與二詞標籤在意義表達上的不足，最後再從所有文件群集主題的標籤中，找出足以代表階層式資料集中第一層資料集的主題。　　本研究以ACM定義之學術論文分類綱要之主題描述做為收集驗證用論文資料集之根據，同時利用主題描述做為評估實驗結果優劣之依據，和過去學者之系統實作結果進行比較後發現，透過本研究提出之方法可以更有效率的找出和ACM定義之主題描述相同的關鍵片語或關鍵詞，最後利用本研究之系統對Information Visualization學術領域之論文資料集進行主題分析，研究結果顯示，透過系統找出的標籤可幫助研究者快速辨析Information Visualization學術領域中所探討的子主題及其意涵，進一步推斷出各主題間的關係以了解學術領域的背景及發展脈絡，減輕研究者在搜尋文獻及進入領域時的負擔。

關鍵字

資訊檢索；語意相似性；字詞語彙關聯度；學術領域辨識；自動標籤

並列摘要

Due to the breakthrough of computer technology and the development of Internet, it isn’t difficult to find a huge amount of information from search engines and databases on a World Wide Web(WWW) network. However, finding relevant information from WWW is also a great challenge. Extracting concept from a collection of related literature is a useful technique with many potential applications. Concept consists of a list of labels, labels are defined as a list of representative key terms or key phrases in documents. 　　This study developed novel labels extraction procedures that combined techniques originated from the researches of information retrieval, semantic similarity analysis, correlation of co-occurrence calculation, and automatically labeling. These procedures were applied to two hierarchical datasets collected from ACM Digital Library and the CiteSeer citation database to gauge their effectiveness. 　　These procedures are capable of extracting descriptive labels from a documents cluster which is derived through the bibliometric method. The experimental results showed the descriptive labels derived by our procedures agreeing with the ACM classification scheme. A experiments using Information Visualization knowledge domain literature corpora was performed in order to show that these descriptive labels extracted by our method can clearly describe the content of each document cluster.

並列關鍵字

Information Retrieval ； Semantic Similarity ； Correlation of Co-occurrence ； Science Mapping ； Automatically Labeling

參考文獻

[6] Athanasios, P., Probability, Random Variables and Stochastic Processes. Second ed. 1984, New York: McGraw-Hill.

[7] B?rner, K., C. Chen, and K.W. Boyack, Visualizing Knowledge Domains. Annual Review of Information Science & Technology, 2003. 37.

[9] Chen, C., Visualization Of Knowledge Structures, in HANDBOOK OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, S.K. Chang, Editor. 2002, World Scientific Publishing Co.: River Edge, NJ,. p. 700.

[10] Chernoff, H. and E.L. Lehmann, The use of maximum likelihood estimates in χ 2 tests for goodness of fit. 1954, JSTOR. p. 579-586.

[15] Garfield, E., Citation Indexes for Science: A New Dimension in Documentation through Association of Ideas. Science, 1988. 122(3159): p. 108-111.

被引用紀錄

鍾少華（2007）。以Competitive Learning分類之混合模糊建模〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu200700079

陳瑋（2008）。鉅量引文資料分析〔碩士論文，國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-2608200817055000

謝汶修（2010）。自動標籤系統建置與驗證〔碩士論文，國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-0709201018573100

林宜貞（2010）。專利資訊檢索之領域自動命名〔碩士論文，國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-0909201011542200

鄭宇傑（2015）。以核運算方法與LDA主題模型產生文字標籤之比較研究〔碩士論文，國立臺北大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0023-1005201615094914

國際替代計量

相關文件群集之階層式自動標籤

未授權

主題瀏覽