故宮博物院古文物之中文關鍵字檢索系統之研究

現今臺灣故宮博物院保留過去所遺留下來的古文物，提供世人學習及欣賞。由於科技的進步，許多人會透過故宮的器物典藏檢索系統搜尋古文物，在使用的過程中，會發現通過品名搜尋，系統將直接從資料庫找出完全符合品名的古文物關鍵字，如果無法完全符合品名格式進行搜尋，則未能將此文物透過系統找出。本研究通過器物典藏系統的資料進行研究分析，古文物種類包含銅器、玉器與瓷器，運用三種方法來改善目前的檢索系統。首先由典藏檢索系統資料將進行字庫建立，利用索引方法將古文物的關鍵字取出後，透過相似度計算的值進行結果排序，根據其值可了解關鍵字與文物之間的關係;使用類神經網路找尋下一個將出現的關鍵字，並於檢索系統介面中顯示預測排名前幾名的詞供使用者選擇。最後，以潛在狄力克雷分配群集找出每個主題下所產生的關鍵字，並在檢索系統的搜尋結果中，建議該器形功能的前幾筆關鍵字給使用者查看。

關鍵字

中文文字探勘；檢索系統；餘弦相似度；類神經網路；潛在狄力克里分配

並列摘要

The National Palace Museum has retained the ancient legacy that can be studied and appreciated by the general public. As the result of technology development, everyone is using the search engine system of the National Palace Museum to look for the antiques. In the process of searching, the system can only directly look for the name of the antiques base on the exact keywords of the antiques in the established database. If the search keywords are not able to be matched exactly with the name in the database, the search will not be successful. This study examines and analyzes the National Palace Museum antiques system. In this system, the antiques category was divided into bronze, jade and porcelain. There are three ways to improve the current retrieval system. First, it is to extract keyword by the index from the established dictionary. Then, through the value of cosine similarity, we found the relationship between the keywords and the antiques, and sorted the value in search system. This function not only allows us to use the neural network to predict the next keywords; at the same time, the search system interface can display the top few words of prediction to give the user to choose. Lastly, using Latent Dirichlet Allocation to get the keywords for each topic and present the results at the search system, it shows the recommended top few key words of the device function to the user.

並列關鍵字

Text Mining ； Retrieval System ； Cosine Similarity ； Neural Network ； Latent Dirichlet Allocation

參考文獻

黃居仁,陳克偉,張莉萍,許蔥麗.(1995).中央研究院平衡語料庫簡介.Proceeding of ROCLLING, 7, 85-93.

Google Scholar

許薰尹,曾憲雄.(2005).宋詞斷詞與本體論之建置(Doctoral dissertation), 16-45. 林筱晴.(2004).語料庫統計值與網際網路統計值在自然語言處理上之應用:以中文斷詞為例.臺灣大

Google Scholar

學資訊工程學研究所學位論文, 6-12. 黃純敏,李亞哲,陳柏宏.(2015).以維基百科為基礎之中文縮寫詞與同義詞庫建構.資訊管理學報,

Google Scholar

22(2), 125-132. 熊回香,夏立新.(2008).自然语言处理技术在中文全文检索中的应用.情报理论与实践, 31(3), 432-435. Ben, S (2017) “Word2Vec introduction” https://github.com/bmschmidt/wordVectors

Google Scholar

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning

Google Scholar

國際替代計量

故宮博物院古文物之中文關鍵字檢索系統之研究

全文下載

主題瀏覽