應用分群技術於同義書目之過濾與最佳化

機讀編目格式（MARC21以下簡稱為書目格式）除了為全球圖書館建立書目資料庫之標準規範外，其功能還可以用來著錄與描述圖書文獻的內容。因此所有圖書館自動化系統多以此格式作為儲存標準、並且以此作為提供文獻檢索以及書目資料交換之依據。而我國之書目格式也因為國情的不同，由國立中央圖書館於民國七十一年制定《中國機讀編目格式》(CMARC)，作為我國書目發展之標準。由於圖書出版之推陳出新資料眾多，因此大部分的書目資料是透過館際合作來進行書目的交流，但由於書目編撰之工作是由人工進行，因此難免會有輸入錯誤或因為編目人員對於編目標準的認定不同而造成資料誤植，使得同一本書會有不同的多筆書目紀錄，導致書目資料混亂，參考價值大打折扣。也由於書目資料眾多格式特殊，因此如何透過資訊技術協助進行書目資料之整理，將是一大挑戰。所以本文提出將書目資料依照其格式欄位之重要性，將之分別給予不同權重，轉換為向量資料，然後進行向量空間的動態資料分群，同一集群內的資料代表類似的書目。之後集群內書目資料進行相似度計算，並依據所設定之門檻值選出可能為同一本書的重複同義書目，最後經過分數計算，過濾較為不良的書目，保留最佳化的書目。根據實驗結果顯示，本研究提出的方法透過分群技術，並依書目資料之特性，選擇具關鍵判別欄位，並且給予不同欄位資料加重其權重比例，作為比較之標準，在同義書目之過濾與最佳化整理上，相較於過去規則式的過濾，不但比較精準，並且可以大大縮減比對時間，為重複書目的整理提供新的方向，相信再經過細部調整，未來可以實際提供圖書館使用。

關鍵字

同義書目過濾；重複書目；中國機讀編目格式；機讀編目格式；動態資料分群

並列摘要

MARC21 (MAchine-Readable Cataloging for the 21st century), the standard specification of the bibliographic database in the world's libraries, is developed for content description of books. Therefore, all the library automation systems adopt this format as storage standard, in order to make bibliography retrieval and exchange of bibliographic records. The bibliographic format in our country is called CMARC (Chinese MARC) suitable for Chinese. Due to the large number of book publishing, most of the bibliographic records are exchanged through interlibrary cooperation. However, the compilation of the bibliographic work is carried out by manual, so inevitably there will be errors and inconsistence, making the same book has different multiple bibliographic records. Bibliographic information is so confusion that greatly reduce the reference value. Because there are a number of records for the same book, how to use information technology to assist in bibliography coordination will be a big challenge. Therefore the following approaches is presented in this research to identity the duplicate bibliographic records. First, Feature Selection: the words in the important fields of CMARC are chosen as features and given the weights according to the importance of fields. Second, Vector Construction, the weights of the features are integration of tf-idf computation and then every book record is represented as one vector. Third, the Dynamic Data Clustering, grouping is performed on the vector space. the book records in the same cluster behalf their bibliographic records are similar. Fourth, Synonymous Book Records Filtering, the similarity between pairs of vectors in the same cluster is computed, all the vectors with the similarity above the threshold are viewed as duplicate bibliographic records. Fifth, Book Records Optimization, the score of duplicate bibliographic records is calculated, and retaining the best one as the standard bibliography of this book. According to the experimental results, the presented methods are more accurate and faster than previous rule based methods. It is believed that after adjusting in detail, the presented methods can be actually used by library in the future.

並列關鍵字

Synonymous Book Records Filtering ； CMARC (Chinese MARC) ； Dynamic Data Clustering ； MARC21 (MAchine-Readable Cataloging for the 21st ； Duplicate Bibliographic Records

參考文獻

[1] 楊燕珠、陳志豐, "基於高頻項目集結合近似樣式匹配之文件分群 Document Clustering Based on Frequent Itemset Integrated with Approximate Pattern Matching," 資訊管理學報, 第十六卷專刊, Jan. 2009, pp.165-184.

[9] Fung, B. C. M., Wang, K. and Ester, M., “Herarchical Document Clustering Using Frequent Itemsets,” in Proceedings of SIAM Conference on Data Mining, 2003.

[12] Porter, M., “An Algorithm for Suffix Stripping,” Program, Vol. 14, No. 1, 1980, pp.130-137.

[13] Salton, G. and Buckley, C., “Term-weighting Approaches in Automatic Text Retrieval,” Information Processing & Management, Vol. 24, No. 5, 1988, pp.513-523.

[15] Salton, G., Wong, A. and Yang, C. S., “A Vector Space Model for Automatic Indexing,” Communications of the ACM, Vol. 18, 1975, pp.613-620.

國際替代計量

應用分群技術於同義書目之過濾與最佳化

未授權

主題瀏覽