漸進式網頁文件分類技術

在本篇論文中，我們提出了一種漸進式網頁文件分類技術(簡稱PAS)。透過這種分類技術，由於分類器只需分析文件中部分關鍵區塊的內容，就足以確認文件之所屬類別，因此可以達到提升網頁分類效率的目的。一般而言，網頁文件可以根據其DOM架構分割為許多小的標籤區域。每塊標籤區域，通常會被以特定的視覺型態加以呈現於瀏覽器視窗中。而這種視覺型態，則由附加於此標籤區域上之HTML成對標籤所構成。根據我們的觀察，由於網頁的寫作習慣，標籤區域中內容對分類的益助性會隨著其視覺型態的不同而有不同的傾向。除此之外，在文件中具有相同視覺型態的標籤區域，也會因為文件寫作技巧的考量而具有不同的分類益助性。在本篇論文中，我們藉由分析大量網頁文件，並藉由EM與HMM等模式識別技術的輔助，識別出每種視覺型態的益助性特質，包括：益助性傾向、與益助性變化模式。我們將這兩種特質加以整合，提出了一套標籤區域益助性預測機制。在進行分類時，我們可以透過這套機制動態地預測每塊還未被分析之標籤區域的益助性，並漸進地擷取最有益助性之標籤區域進行分類運算，直到網頁類別被確認為止。為了減少錯誤預測的機率，預測機制會根據已分析過標籤區域之實際益助性，進行自身最佳化調整。此外，對於罕見視覺型態之益助性預測，預測機制會同時參考其近似之視覺型態的益助性特質，以期獲得較正確之預測。透過實驗，我們說明了參數設定對分類器效能的影響，並驗證了所提出之網頁分類技術的優越性。

關鍵字

網頁探勘；網頁分類；漸進式分析

並列摘要

In this thesis, we propose a web document classification scheme, called the Progressive Analysis Scheme (PAS), whose classification performance is improved by just analyzing few key parts sufficient for category confirmation. Based on the DOM tag-tree structure, a web document can be segmented into small tag-regions. Each tag-region is visualized by a visual type which corresponds to a specific nested combination of tag-pairs. Under observation, the profitabilities of tag-regions for classification will vary among visual types caused by the web authoring convention. In addition, in a document, the profitabilities of tag-regions of a visual type may also vary caused by the document writing knacks. In the thesis, for each visual type, we model the two kinds of profitability variations as the profitability tendencies and the tendency transition patterns based on the Expectation Maximization scheme and the Hidden Markov Model scheme. For classification, we integrate them into a profitability forecasting strategy further. Based on the forecasting strategies, we will forecast the potential profitabilities of unanalyzed tag-regions and extract continuously the most profitable unanalyzed tag-regions for classification until category confirmation. Dynamically, the forecasting strategies will be optimized for the document by feeding back the actual profitabilities of analyzed tag-regions to them. Thus, the profitabilities of next tag-regions can be forecasted more accurately. In addition, for each unreliable model generated by a sparse set of training samples, we propose a solution which is to support its forecasting process by the strategies of other similar visual types. Through simulations, the results will show that PAS has better classification performance than the previous approaches, such as the full-text (e.g. SVM) and sequential classifiers.

並列關鍵字

Web Mining ； Web Document Classification ； Progressive Analysis

參考文獻

[3] R. Braz, R. Girju, V. Punyakanok, D. Roth, and M. Sammons, “An Inference Model for Semantic Entailment in Natural Language,” In Proceedings of 12th National Conference on Artificial Intelligence (AAAI), 2005, pp. 1043-1049.

[4] S. Chakrabarti, B. Dom, and P. Indyk, “Enhanced hypertext categorization using hyperlinks”, In Proceedings of ACM SIGMOD’98, ACM Press, 1998, pp. 307-318.

[5] S. Chakrabarti, K. Punera, and M. Subramanyam, “Accelerated focused crawling through online relevance feedback,” In Proceedings of the Eleventh International World Wide Web Conference, 2002, pp. 148-159.

[6] S. Chakrabarti, MINING THE WEB: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2003.

[9] H. P. Edmundson, “New Methods in Automatic Extraction,” Journal of the ACM, Vol. 16, No. 2, 1968, pp. 264-285.

國際替代計量

漸進式網頁文件分類技術

全文下載

主題瀏覽