非同步局部模組分散式分類與不規則變動之巨量資料探勘

大數據資料強調大量、快速、多元、價值、及精確五種特性。涵蓋多類型的資料，包括科學工程分析，社群網路，傳感器，物聯網，及多媒體應用等。對於在大數據中，如何有效率地處理資料，進而轉為結構性的資訊，其資訊探勘技術需求日趨迫切且更具挑戰。分散式分類系統在整合分散式模型與資料扮演關鍵性的角色，分散式系統主要是利用統計分析及協同整合子資料庫之模型，讓多個區域性裝置可以同時蒐集資料。隨著大數據應用、無線與行動技術的普及，由分散式裝置產生的各式不同特性之資料量逐漸增加。分散式分類模型面臨下列數個巨量資料下衍生的難題：1) 非同步局部資料分散式分類, 地區性裝置受限於有限資源如電力、資料儲存空間，及區域性或其他規劃等因素，收集不完整屬性且僅有局部之資料。傳統收集完整資料，或是利用抽樣統計等相關技術整合的方法，將不再適用非同步之不完整資料的分散環境中。2) 就資料本身而言，快速變化的資料分佈，其變化模式與改變趨勢亦隨著外在環境有所變動，其多元的變化亦增加傳統分析及判斷資料是否改變之複雜度。若只使用單一固定長度之時間快門觀察資料，將會大幅降低預測模式反應資訊變化之效能。3) 為了更進一步擴充分散式分類模型的使用規模，利用模型轉換技術，將傳統普及之高效率非規則式模型，轉化為可傳輸之規則式模型，因此，如何將非規則式模型轉化為適當規則式模型，將成為決定分類模型效能的關鍵。本論文試著解決以上的問題，我們首先將目光著眼於分散式分類模型系統，並設計一套整合非同步局部資料的分散模型之方法，使得整體分散式分類系統的區域模型效能可以被妥善運用；由於分散式分類系統允許區域裝置收集非固定量之區域資料，當收集資料以產生資料模組時，因資料多樣性及變動多樣化等特性，單一時間窗所造成的錯誤率也隨之上升而大幅降低系統效能，我們進而提出連續性叢集方法，讓系統根據時間與資料分佈將資料適當切割以產生符合資料分佈之模組。最後，本論文提出兩種模型轉換方法，使傳統無法傳至伺服器整合之非規則式模型轉至規則式模型，以擴展分散式分類模型可使用之規模並提升整體效能。不論是在理論分析或者實驗測試上，本論文所提出之分散式分類模型皆較傳統分散式分類有更卓越的效能提升與更廣泛之應用。

關鍵字

分散式分類；非同步局部資料；不規則重覆資料；連續性叢集

並列摘要

Big Data emphasizes on 5Vs (Volume, Velocity, Variety, Value and Veracity) relevant to variety of data (scientific and engineering, social network, sensor/IoT/IoE, and multimedia-audio, video, image, etc) that contribute to the Big Data challenges. This phenomenon introduces the urgent requirement for efficiently managing data to structured information. One predominate approach is distributed classification ensemble, which improve prediction efficiency by using ensemble of distributed model or integrated by combining distributed information via statistics, to allow multiple devices collect data concurrently. With the popularity of Big Data applications, wireless and mobile technology, the amount of data in various characteristics generated by distributed devices has been tremendously increasing. As a result, distributed classification in Big Data has new challenges. There are three main challenges in distributed big data systems: 1) the distributed classification models are asynchronous and incomplete from distributed devices. Traditional distributed classification algorithms, which rely on horizontal sub-databases or vertical sub-databases, cannot be applied in this scenario. 2) Due to various characteristics of Big Data, simply separating data to equal size for constructing models takes away the significant performance benefit of classification models. In particular, non-regular recurring data are especially vulnerable to models derived from equally separated windows because noise data interfere most of the models in fixed-size buckets. 3) In our distributed environment, arbitrarily transforming popular lazy models to rules will increase the diversity of local models and reduce additional transmission bandwidth consumption. This dissertation tries to solve the above problems. First, this dissertation focuses on distributed streaming environment scenario, and proposes a rule-based distributed classification for asynchronous partial data (DIP). Our proposed method DIP selects models based on the amount of local databases and the quality of local models such that the performance gain can be fully utilized. DIP saves the communication bandwidth by transferring organized information, instead of individual instances. In addition, our distributed classification method DIP enables local devices collect various amount of local data. Due to data diversity and change diversification, the performance of classification models built from fixed-size windows or chunks declines. We investigate the data characteristic of non-regular data and introduce sequential clustering which adaptively forms sequential clusters of data based on data distributions and time to reduce noise data of inter-cluster interference and enhance classification prediction of derived models. Finally, this dissertation proposes two model transformation methods, which transforms data distributions to rules, to facilitate popular lazy classifiers in our distributed classifier. In both theory analysis and tested experiments, the proposed distributed classification framework can achieve a significant performance gain and bigger scope, as compared to the traditional distributed classification ensemble and existing dynamically changing methods.

並列關鍵字

distributed classification ； asynchronous partial data ； non-regular recurring concept data ； sequential clustering

參考文獻

[20] J. H. Chang and W. S. Lee. A sliding window method for finding recently frequent

- a brief view into the application of the sprint algorithm family. Seventh

[1] Bayes net generator. http://weka.sourceforge.net/doc.dev/weka/

2015-05-30.

[2] Random decision generator. http://www.dbs.ifi.lmu.de/

國際替代計量

非同步局部模組分散式分類與不規則變動之巨量資料探勘

全文下載

主題瀏覽