運用粗集合理論以俾利含不精準資料之大數據分析

由於大數據（Big Data）時代的來臨，大量資料不斷的被生產與保存；且研究機構國際數據資訊（International Data Corporation，IDC）亦指出，全球資料量正以每年 50% 的速度成長，預估未來六年，資料量將成長10倍之多，然而因為政府在大數據一詞出現後，所有資料都未過濾將其保存下來，因此導致資料的可用性相對降低，如何提升資料可用性亦成為目前最受矚目議題之一；其次，利用非平行式資料處理的大數據資料時，面臨處理速度、記憶體及儲存空間上的限制。因此，本研究提出將資料探勘技術與分散式平行運算技術加以整合，進而將大數據資料量簡化並分析。研究中運用資料探勘之資料精簡（Data Reduction，DR）技術，針對資料記錄精簡(Record Reduction)以及資料數值精簡(Value Reduction) 加以考量。資料探勘技術係採用模糊分群方法（Fuzzy C-mean，FCM）及粗集合理論（Rough Set Theory，RST）達成資料精簡（DR）之目的。首先(1)透過FCM將資料加以分群；接著(2)運用RST從中獲得較精簡的知識規則；並(3)透過RST之近似集合概念，從原始資料集中，將資料集中的不精準資料集加以挑選出來。利用上述兩種技術除了可以達成DR的目標，進而使得後續資料變得更加明確而有效，大幅提升資料的可用性，進而使得整個運用於Hadoop雲端平行運算平台上之大數據資料量的分析與應用上。

關鍵字

大數據； Hadoop ；資料探勘；模糊分群；粗集合理論；資料精簡；資料記錄精簡；資料數值精簡

並列摘要

Since the advent of Big Data era, large amounts of data are constantly being produced and stored; and research institutions International Data Corporation also pointed out that the global amount of data is growing at an annual rate of 50% growth forecast the next six years, the amount of data will grow 10 times as much, but because the government appears after the word big data, all data saved none of its filtration, thus resulting in relatively lower availability of information, how to improve the availability of information has become the most attention by one of the topics; secondly, when the use of the non-parallel processing of large data type data, face restrictions processing speed, memory and storage space. Therefore, this study presents the data mining technology and distributed parallel computing technology to be integrated, the large-volume data and simplify the analysis. Research using data mining the Data Reduction (DR) technology, to be considered for Record Reduction and Value Reduction. Data mining technology system using Fuzzy C-mean and Rough Set Theory reached streamline data purposes. First (1) through the FCM will information be clustering; and then (2) the use of RST derive more condensed knowledge rules; and (3) through an approximate collection concept RST's, from the original data set, the data set is not accurate data sets to be singled out. In addition to using the above two techniques to achieve DR goals, thus making the follow-up information becomes more clear and effective, significantly increasing the availability of information, thus making the analysis and application of large amounts of data on the entire parallel computing platform used Hadoop Cloud .