應用虛擬樣本方法改善不平衡大數據分類性能

在大數據的時代，企業經常能夠獲得大量的資料建構一個學習模型來進行決策。對大數據而言，如此的學習模型很有可能受到不平衡資料集(imbalanced data set)的影響而產生有偏差的訓練，造成模型傾向於數量較多的類別。因此，使用不平衡類別資料集來建構一個可靠的大數據學習模型是目前企業最重要的挑戰之一。為了解決這個問題，本研究提出一個新的增加少數抽樣(over-sampling) 方法來增加少數量類別的數量，提出的方法是使用整體趨勢擴散(mega-trend-diffusion；MTD)技術生成虛擬樣本，以及應用可能性評估機制(plausibility assessment mechanism；PAM)來評估虛擬樣本合適性，其目的在降低分類上的偽陽性率(false positive rate；FPR)而不影響其他評估分類性能之指標如：分類正確性、geometric mean (Gmean)與F1-measure(F1)。在此，我們使用一個模擬資料集來建構支持向量機器(support vector machine；SVM)的分類模型，而實驗結果顯示所提出的方法能夠有效地改善不平衡大數據的分類性能。

關鍵字

不平衡大數據；增加少數抽樣；虛擬樣本；偽陽性率

並列摘要

In the age of big data, enterprise normally can obtain numerous data to build a learning model to make a decision. For big data, such learning model tends to majority class due to imbalanced data set likely leads to a biased training. Hence, using an imbalanced data set to build a reliable learning model for big data is one of the most important challenges in enterprise. For solving this, this paper proposes a new over -sampling method to increase the data size in minority class. The proposed method is to use the mega-trend-diffusion (MTD) technology to generate virtual samples and the plausibility assessment mechanism (PAM) to access the suitability of virtual sample. In addition, this paper is to decrease the false positive rate (FPR) on classification and not to influence the other indices for accessing the classification performance, such as accuracy, geometric mean (Gmean), and F1-measure (F1). In this paper, a simulated data set is used to build the support vector machine (SVM) classification model, and the experiment results show that the proposed method can effectively improve classification performances for imbalanced big data sets.

並列關鍵字

Imbalanced Big Data ； IBD ； Over-sampling ； Virtual Sample ； VS ； False Positice Rate ； FPR

被引用紀錄

曾冠倫（2017）。以工業4.0為基礎之智慧工廠大數據平台建置〔碩士論文，中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201700450

國際替代計量

應用虛擬樣本方法改善不平衡大數據分類性能

全文下載

主題瀏覽