網站使用行為採礦(WUM)是一種應用資料採礦技術從網站日誌中取出知識,而藉以用來提升網站設計、預測使用者行為或網站個人化設計等。 網站使用行為採礦(WUM)可分為三個主要階段:資料前置處理(data preprocess)、模式發現(pattern discovery)以及模式分析(pattern analysis)。其中,資料前置處理占整個程序的60%以上,是最費時的一個階段。 Cooley 等人又將資料前置處理分為四加一個額外的步驟,分別為資料清理(data cleaning)、使用者識別(user/session identification)、路徑完成(path completion)和頁面識別(page view identification),一個額外的步驟為交易識別(transaction identification)。 直到現在,網站使用行為採礦的資料前置處理必須取得外部的領域知識(domain knowledge),例如:網站結構(Web structure)及網頁內容(Web content)分類,以致大大的影響網站行為採礦的應用。就分析師而言,必須花費許多時間以熟悉網站架構及網頁內容,對網站管理者而言,當提供詳細網站結構給分析師時,必須先考量網站資料機密性問題。我們認為應該在網站使用行為採礦的過程建立一個平台以協助分析師與網站管理者更良好的溝通。本論文提出一個機制是從網站日誌內隱含的資訊建構網站結構及找出網頁關係。實驗結果顯示重建網站結構及發現網頁關係的精確率達90%以上。這個方法可以容易的嵌入目前的前處理步驟,是一種切實可行的替代方法。
Web usage mining which extracts knowledge from Web server log is an application of data mining method. The mining results can be used for improving the Web design, predicating user behavior and personalizing Web site. Web usage mining has three major stages: data preprocessing, pattern discovery and pattern analysis. Data pre-processing, which normally spends more than 60% of the whole mining process, is most time consuming. Cooley divided data preprocessing into four and one optional steps. They were data cleaning, user/session identification, path completion, page view identification and transaction identification which is optional. Until now, the preprocessing of Web usage mining must gather external domain knowledge, such as Web structure and Web content classification, which greatly affects the application of Web usage mining. It takes more time for the analyst to be familiar with Web structure and content. For Web administrator, she/he may have concerns with the confidential Web data when giving the detailed Web structure to the analyst. Thus, we want to solve the problem by creating a platform between analysts and Web administrators to help them better communicate during the Web usage mining progress. In this thesis, we propose a framework that can reconstruct Web structure and discover the page relationship from Web server log’s implicit information. The experimental results showed that Web site reconstruction and page relationship discovery with precision of more than 90%. This method that can be easily embedded in the popular preprocessing stage is a workable and practical substitute method.