透過您的圖書館登入
IP:18.191.195.110
  • 學位論文

基於圖論之交易、地理與社群資料探勘技術

Graph-based Data Mining for Transactional,Spatial and Social-networking Data

指導教授 : 陳銘憲

摘要


資料探勘是一門與資料應用方式相關的技術領域,在過去十年此領域已受到相當的矚目。在此領域興起之初,各種技術發展主要以處理集合式資料以及序列式資料為主,應用範圍包含市場交易、電腦網路訊息、生物序列處理……等。然而,在更多實際的應用範圍裡,資料是與結構化資訊息息相關,顯示出研發新技術需求的急迫性,因此,基於圖論的資料探勘技術之發展顯得越來越重要。基於圖論之資料探勘技術其應用範圍相當廣泛,在本論文中,我們將依序各別探討其在三種複雜度之資料應用的實際問題。 首先,由於雲端計算的興起,那些缺乏資料探勘專業技術的人們如今也能靠著外包服務享受資料探勘所帶來的好處,然而,任何外包服務也將伴隨著資料隱私安全性的疑慮。在第二章,我們探討如何保護頻繁項集挖掘外包服務中的資料隱私安全。此議題主要有兩個挑戰:如何用合理的開銷保護資料隱私安全,以及如何對抗具有相關背景知識者之攻擊。為了解決這些挑戰,我們提出了k-support anonymity隱私保護概念與k-support anonymization演算法。此演算法藉由建立假的分類樹隱藏敏感物項,通過利用只有在葉級層的物項需要出現在交易資料中的特性,在保護資料隱私的同時亦能限制儲存開銷。 接下來,我們注意到由感應器收集到的資料包含地理位置屬性和訊息屬性,一般空間叢集因為只考慮地理屬性,只能找出資料點密集區,因此不適用於規劃訊息相似的空間區塊。所以,我們在第三章探討訊息數據的空間叢集問題。此議題的挑戰在於地理位置屬性和訊息屬性代表著不同的概念,在叢集的過程中不適合用同一種方式處理。為因應此挑戰,我們提出BiAgree演算法,此演算法藉由建立一圖形結構,將訊息屬性和地理位置屬性分別整合在圖形結構的節點與邊上。之後,只要將圖形結構分割成訊息一致的連接組件,就可以不受資料點密度的影響,規劃出訊息相似的空間區塊。此外,靠著維護此圖形結構,BiAgree演算法也具有即時計算能力,以獲得高質量或高效率的空間叢集結果。 最後,有鑑於社群資料及其在各業應用與服務的快速增長,社群資料中用戶者的隱私問題也逐漸受到關注。近幾年已有些研究被提出來解決用戶個人資料、用戶身分鑑定、用戶關係鑑定……等隱私議題,然而,相對於社群資料中富含的大量訊息,在被公開分享的社群資料中,用戶隱私議題尚未被充分解決。在第四章,我們探討一新的隱私議題──群體關係鑑定。一個人的群體關係資訊會自然隱含在社群資料之中,指示此人所參與的團體以及和他人的連繫關係,也可能代表著一些敏感的個人隱私,如在網路上的政治活動、疾病支持群組、朋友群……等等。為保護這種資訊,我們提出k-structural diversity隱私模型,並設計整數規劃方法尋找最加解決方案。此外,考量社群資料的龐大,我們也從不同的觀點,提出三個具擴展性的經驗演算法。

並列摘要


Data Mining is a data-and-application dependant technique, and has received significant attentions in the last decade. In the past years, various techniques have been developed to deal with set or sequence data in business marketing, computer networks, bioinformatics, to name a few. Many real applications, however, have called for the need of new techniques to tackle data with structural information, i.e., graphs. Graph-based data mining, which discovers novel knowledge in graph-represented data, is thus becoming more and more important. In this dissertation, motivated by the fact that graph-based data mining is still in its fancy compared to the wide applications, we attempt to address the use of graph-based data mining in realistic problems with three kinds of data complexity, respectively. First, due to the rise of cloud computing, people who lack of expertise in data mining and/or computational resources now can also take advantages from data mining by outsourcing their mining tasks. However, for any outsourcing service, privacy is a major concern. In Chapter 2, we study the problem of privacy protection in outsourcing frequent itemset mining. This problem has two challenges. One is on how to protect sensitive information, including the raw data and the frequent itemsets, with reasonable overhead and preserve the precise mining results. The other is how to protect against an attacker with related background knowledge such as item support information. To overcome these challenges, we propose k-support anonymity and develop a novel encryption approach that constructs a pseudo taxonomy tree to hide sensitive items. By leveraging the property that only the items at the leaf level of the taxonomy need to be appear at the transactions, the storage overhead is limited while the privacy protection is conformed. Second, note that data collected by sensors can consist of not only geographic attributes but also informative attributes. Since the spatial-alone clustering approaches consider only the geographic attributes to identify spatial clusters at data-dense regions, it is infeasible to obtain spatial clusters with informatively similar data points from such data by the spatial-alone clustering approaches. Therefore, we address the informative spatial data clustering (ISDC) problem in Chapter 3. One of the main challenges in this problem is that geographic and informative attributes represent different concepts and should not be tackled in the same way in clustering. To overcome this challenge, we proposed Algorithm BiAgree that introduces a graph structure, named NeiGraph, to integrate informative attributes and geographic attributes in vertices and edges, respectively. Afterward, Algorithm BiAgree is able to identify informatively similar regions regardless of the data density by partitioning NeiGraph into informative-consistent connected components. In addition, by maintaining NeiGraph, Algorithm BiAgree also provides the online computing capability to acquire the solutions with high quality and smaller computation time respectively. Finally, as the rapid growth in the number of services and applications leverage social network data, there is increasing concern about privacy issues in published social networks. Recently several studies have addressed the privacy issues on vertex/edge attributes, vertex identity, link disclosure, and so on. However, compared to the rich information inherent in graph data, the privacy issues in publications of social networks have not been fully solved. In Chapter 4, we address a new privacy issue, referred to as the community identification. The community identity of an individual is a kind of structural information that indicates the neighborhood or connections of the individual. The community identity could also represent the personal privacy information sensitive to the public, such as on-line political activity group, on-line disease support group information, or friend group in a social network. To protect such information, therefore, we propose a new privacy model, named k-structural diversity, and develop an Integer Programming formulation to find the optimal solutions to k-SDA. Moreover, we devise three scalable heuristics to solve the large instances of k-SDA with different perspectives.

參考文獻


[1] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of ICDE, 1995.
[2] R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proc. of ACM SIGMOD, 2000.
[3] C. D. Ahrens. Meteorology Today, 9th ed., Brooks-Cole, 2009.
[4] L. Backstrom, C. Dwork, and J. M. Kleinberg. Wherefore art thou r3579x?: Anonymized social networks, hidden patterns, and structural steganography. In Proc. of WWW, 2007.
[7] R. Buyya, C. S. Yeo, and S. Venugopal. Market-oriented cloud computing: Vision, hype, and reality for delivering it services as computing utilities, in. In Proc. of CSSE, pages 10–1016, 2008.

延伸閱讀