透過您的圖書館登入
IP:18.217.150.123
  • 學位論文

滿足差分隱私之多支持度閾值頻繁項集挖掘

Frequent Itemset Mining Under Differential Privacy with Multiple Minimum Supports

指導教授 : 郭斯彥

摘要


頻繁項集挖掘是一個被廣泛研究的資料探勘領域,它的目的是找到資料庫中,項目與項目間的關聯。但是,懷有惡意的使用者可能會在挖掘的過程中獲取敏感的私人資訊。而差分隱私是現今廣泛使用的隱私保護標準,透過添加精心設計的噪音,可提供強大的隱私保證,同時又可獲取具有統計意義的資訊。但是如何加入噪音又能讓資料保持可用性,是個困難的課題。 在現有研究中,多數頻繁項集挖掘的演算法都使用同一個閾值來判斷項集是否頻繁。但是,使用同個閾值來挖掘項集會產生「稀有項目問題」,且在一般使用情況下,為每個項目設置各自的支持度閾值,更能合理反映項目的本質。此外,現有方法的執行效率不佳也是一大挑戰。 在本文中,我們提出了一種基於FP-growth的演算法——DPCFP-growth++,在滿足差分隱私的情況下,解決頻繁項目集挖掘中的稀有項目問題。我們首先通過隨機截斷來降低每筆交易的敏感性,並計算各個項目的支持度與分配項目各自的閾值。最後,將資料庫中的每筆資料依序存入MIS樹,並從樹中推得出頻繁項集。我們在每個步驟中皆加入拉普拉斯噪音,並證明此演算法符合差分隱私之定義。在真實數據集和合成數據集上的實驗顯示,我們的演算法在噪音與資料可用性之間取得良好平衡,並且在執行時間方面遠勝過現有的演算法。

並列摘要


Frequent itemset mining is an extensively studied research domain of data mining. The aim is to find interesting correlations between items in a transaction database. However, malicious user might gain sensitive information in the mining process. Differential privacy is the de facto standard when it comes to protecting data. Combining differential privacy with frequent itemset mining can provide strong privacy guarantee while generating statistical information from sensitive data. In existing studies, most algorithms are focusing on finding frequent patterns using one predefined threshold with the protect of differential privacy. However, using single threshold to extract itemsets creates “rare item problem”, and setting respective support thresholds for each item is more adequate to reflect the nature of widely varied items. Also, the execution time of prior approaches are lengthy. In this thesis, we propose a novel FP-growth-like algorithm DPCFP-growth++ to solve the rare item problem of frequent itemset mining while guarantee differential privacy at the same time. We first limit the sensitivity by truncating the transactions, and assign each item with its own threshold. Finally, we construct a differentially private MIS-tree and derive frequent itemsets from the tree. We add Laplace noise in each step, and prove our algorithm satisfies differential privacy. The experiments on real-world and synthetic datasets illustrate that our algorithm achieves both high utility and efficiency.

參考文獻


[1] R. Agrawal, T. Imielinski, and A. Swami. “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 ACM SIGMOD international conference on Management of data (SIGMOD ’93), pp. 207-216, 1993.
[2] B. Liu, W. Hsu, and Y. Ma. “Mining association rules with multiple minimum supports,” in Proceedings of the 5th ACM SIGKDD international conference on Knowledge discovery and data Mining(KDD ’99), 1999.
[3] Y. Hu, and Y. Chen. “Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism,” in Decision Support System, vol. 42, no. 1, pp. 1-24, 2006.
[4] R. Uday Kiran, and P. Krishna Reddy. “Novel techniques to reduce search space in multiple minimum supportsbased frequent pattern mining algorithms,” in Proceedings of the 14th International Conference on Extending Database Technology (EDBT/ICDT ’11), pp. 11-20, 2011.
[5] W. Gan, J. C.W. Lin, P. Fournier-Viger, H. Chao, and J. Zhan. “Mining of frequent patterns with multiple minimum supports,” in Engineering Applications of Artificial Intelligence, vol. 60, pp. 83-96, 2017.

延伸閱讀