全球資訊網中之資料擷取、管理與分析

現今全球資訊網是受歡迎的一種交互式訊息傳播的媒介。網際網路已經變成了龐大的且具為無架構的資料容器。Peer-to-peer系統也已經變成廣泛的檔案分享平台。在本篇論文，我們探討了三項技術：為了全球資訊網資料探勘中個別使用者存取模式的擷取、使用者點選行為與使用者興趣對全球資訊網結構探勘的影響和P2P系統的搜尋策略。為了擷取個別使用者存取模式，我們設計且實做了存取模式蒐集伺服器去實施全球資訊網資料探勘。經由頁面轉換的概念，我們設計的方法將實際上的解決代理伺服器所造成的使用者行為蒐集上的困難。在結果上證實了使用我們設計的方法所產生的traversal patterns比原本網頁伺服器所產生的Patterns不僅包含了更多的資訊而且也更加精準。此外，為了探討在網頁結構探勘中在閱讀網頁使用者上的貢獻，使用者閱讀行為的影響已經被討論在VIPAS系統上。我們設計一個稱為AC-VIPAS的新演算法，此演算法將根據相似興趣的使用者的推薦來微調網頁次序。我們建立了評估以內文基礎的使用者叢集效能的實驗。實驗結果呈現出我們提出的以內文基礎的使用者叢集演算法的正確率是好過傳統的計數基礎的使用者叢集演算法。最後，為了改善P2P系統上的搜尋效率，我們提出一個叢集式的P2P系統，稱為PeerCluster。在PeerCluster中，所有加入的電腦都被分到一個興趣叢集，而在興趣叢集中所有的電腦都是具有同一主題的興趣。為了能夠在興趣叢集間快速路由及廣播，我們使用了hypercube網路拓普來實作我們的系統。而且，我們也增強PeerCluster具有系統自動修復機制以對抗不可預期的電腦故障與網路中斷。

關鍵字

網頁資料探勘；分散式計算；點對點系統

並列摘要

The World Wilde Web is a popular and interactive medium to disseminate information today. The Web has become a huge and mostly unstructured data repository. Peer-to-Peer system also has become a popular file sharing platform in recent years. In this dissertation, we consider three issues: capturing individual user's access patterns for Web data mining, the influence of user's clicking behavior and user's interest for Web structure mining, and the searching policy for P2P system. For capturing individual user's access pattern, we design and implement an access pattern collection server to conduct data mining in the Web. By using the concept of page conversion, the proposed method is able to resolve the difficulty imposed by proxy servers and capture the Web user behavior effectively. Using the devised mechanism, traversal patterns are generated and compared to those produced by the ordinary Web servers to validate our results. In addition, for considering the page readers' contribution in Web structure mining, the influence of user's interest in VIPAS system is discussed. We devise a new algorithm, called Adjustable Cluster based VIPAS (AC-VIPAS), to adjust Web pages' scores according to the recommendation of users with similar interest. The experiment is conducted to evaluate the performance of the content based user cluster. Finally, for improving the searching performance in Peer-to-Peer system, we propose a cluster-based peer-to-peer system, called PeerCluster. In PeerCluster, all participant computers are grouped into various interest clusters, each of which contains computers that have the same interests. To efficiently route and broadcast messages across/within interest clusters, a hypercube topology is employed. Moreover, we augment PeerCluster with a system recovery mechanism to make it robust against unpredictable computer/network failures.

並列關鍵字

Web Data Mining ； Distribution Computing ； Peer-to-Peer System

參考文獻

[47] R. Laboratories. Answers to Frequently Asked Questions About Today’s Cryptography

[1] Open Directory Project (ODP). In http://dmoz.org/.

[3] K. Aberer. P-Grid: A Self-Organizing Access Structure for P2P Information Systems. In

Proc. of the International Conference on Cooperative Information Systems, 2001.

[4] R. Agrawal, T. Imielinski, and A. Swami. Mining Associations between Sets of Items in

國際替代計量

全球資訊網中之資料擷取、管理與分析

全文下載

主題瀏覽