以樹狀結構及新詞判斷分類XML文件

延伸標記語言(eXtensible Mark-up Language; XML)規格是由全球資訊網標準製定組織(W3C)制定，並於1998年2月成為推薦規格。XML已逐漸成為網路上不同系統和資料庫間資訊交換的新標準，加上其結構化的特性，使得在處理大量XML文件分類成為一個重要課題。目前XML在文件分類上有利用Naïve Bayes演算法、樣版辨識和影像處理分割技術、詞性標記和法則式技術以及TFIDF以解決分類問題等方法，由於過去的研究鮮少針對文件本身的內容作分析，可能造成含糊文件或衍生的相關文件無法正確分類。本研究先以文件的樹狀結構特性找出每個項目的重要性等級，並利用TFIDF方法取得特徵項目後，便可藉由比對各類別的特徵項目將文件正確分類。在分類過程中，同時考量文件中的重要新詞以提高分類正確率。為使分類器能不侷限在限有特徵項目中，本研究也提出一個加入重要特徵項目的機制，使分類器能適應廣泛內容的文件。本研究最後與同樣使用階層特性的XML文件分類方法作一比較，結果顯示本研究能顯著改善分類之正確率。

關鍵字

延伸標記語言；樹狀結構；文件分類；關聯資訊萃取；新詞

並列摘要

The extensible mark-up language (XML) devised by the W3C has been a universally accepted and recommended specification. Recently, XML has gradually become a standard information interchange protocol for different systems and databases on the web. In addition, since XML has a characteristic of structural syntax, the classification of the tremendous amount of XML documents is thus of special essential in the field of knowledge management. Various approaches have been proposed on XML classification, such as Naïve Bayes, template reorganization, image processing, tagged analysis, and TFIDF, etc. However, these approaches rarely focused on analyzing contents of documents, and thus sometimes result in incorrect classifications. In this paper, we employed a tree-like structure to obtain the importance of each term, and utilized the TFIDF calculation to attain the special terms in documents. The classification can therefore be process by identifying these special items among documents. The use of new-term in documents is also under consideration to leverage the accuracy of classification. Finally, the proposed approach was compared with other similar approaches, and the results showed that the proposed approach can significantly improve the accuracy of classification.

並列關鍵字

XML ； tree-like structure ； classification ； association extraction ； new-term

參考文獻

Aizawa, A.(2000).The Feature Quantity: An Information Theoretic Perspective of TFIDF-like Measures.Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.(Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval).

Google Scholar

Brill, E.(1992).A Simple Rule-Based Part of Speech Tagger.Proceedings of the Third Conference on Applied Natural Language.(Proceedings of the Third Conference on Applied Natural Language).:

Google Scholar

Berger, H.,Dittenbach, M.,Merkl, D.(2004).An Adaptive Information Retrieval System Based on Associative Networks.Proceedings of the First Asian-Pacific Conference on Conceptual Modeling.(Proceedings of the First Asian-Pacific Conference on Conceptual Modeling).

Google Scholar

Bernstein, A.,Provost, F.,Clearwater, S.,L. Getoor,D. Jensen, (editors)(2003).Working Notes of the IJCAI-2003 Workshop on Learning Statistical Models from Relational Data (SRL-2003).Acapulco, Mexico:

Google Scholar

Bray, T.,Paoli, J.,Sperberg-McQueen, C. M.,Maler, E.(2000).Extensible Markup Language (XML) 1.0.(W3C Recommendation, Technical Report REC-xml-20001006, World Wide Web Consortium).

Google Scholar

被引用紀錄

黃聖翔（2011）。TFIDF與熵值法在支援向量機上分類評估-以統計試題為例〔碩士論文，國立臺北科技大學〕。華藝線上圖書館。https://doi.org/10.6841/NTUT.2011.00706

國際替代計量

以樹狀結構及新詞判斷分類XML文件

全文下載

主題瀏覽