處理長尾分布與屬性資料扭曲之資料探勘技術

長尾分布資料與屬性資料扭曲是常見的兩種資料特性，並且會影響其預測誤差值大小與準確度高低。長尾分布資料發生於許多領域，由於長尾分布中的尾部資料數稀少，導致分析或預測時誤差值也相對較大，不利決策判斷。本論文對於長尾分布問題，提出兩不同技術分別降低尾部資料預測誤差值過大的現象。第一種技術是將所有資料在特定範圍內鄰居數的多寡，轉換成其抽樣機率大小之分布，並透過重複抽樣得到多個訓練集合，再整合所有模型的預測結果為最後預測值。論文中亦提出新的混合策略來整合協調傳統模型與整合模型的優點，預測來自長尾分布中不同位置的資料。第二種技術則是藉由重取樣的超取樣與低取樣的方法來解決長尾分布資料問題，同樣地，論文亦對此第二種技術提出新的混合策略來整合協調傳統模型與改良模型的優點，預測來自長尾分布中不同位置的資料。根據論文的實證評估結果，兩技術顯著降低長尾分布中尾部資料的誤差值，且各自的混合策略在傳統模型與改良模型之間呈現了截長補短之功能。除了長尾分布缺乏足夠資料的議題外，資料本身的正確性也會影響分析或預測準確度高低，造成資料不正確的因素包括隨機誤差與系統誤差，其中又以系統誤差中的受測者誤差類型，常見於各類系統所得到的資料集合中，換言之，資料的觀測值並非全為實際值。現行方法大都只侷限在處裡隨機誤差，或只侷限以決策樹演算法來處理類別扭曲資料。本論文針對具有扭曲資料的類別屬性，藉由專家對該屬性所提供的先驗資訊，將其各種觀測結果依不同的條件機率值，轉換為可能為真的出象，並且呈現在多個樣本集合中，然後對每一個樣本集合，透過重複抽樣的方式得到數個訓練集合，整合此數個訓練集合下的模型，做為其樣本集合的預測結果，然後再整合每一個樣本集合的預測結果做為最終結果。根據論文的實證評估結果，此技術顯著優於傳統模型處理類別扭曲資料的準確度。

關鍵字

機器學習；長尾分布；類別不平衡；重取樣；扭曲資料

並列摘要

Data characteristics are critical to prediction effectiveness, especially for the long-tailed regression problem and the specific attribute distortion problem. However, the current techniques are applied to the general prediction tasks without the ability to deal with such specific data characteristics. Both density bagging and bin-resampling techniques are developed respectively to solve the long-tailed regression problem. However, both two techniques pay for accuracy in the head and even the central part of the long-tailed distribution. This thesis addresses two different hybrid methods corresponding to density bagging and bin-resampling respectively, which can improve the prediction performance for the tail part of the long-tailed distribution without sacrificing more prediction accuracy for the head and even the central part. Three datasets are finally taken to evaluate the performance of our proposed techniques and their hybrid methods respectively and compared with several ensemble methods. A data characteristic of a specific attribute distortion problem indicates that an observe outcome of an instance corresponding to an input attribute is not always the true outcome in real world applications. We develop a state populate bagging to solve the specific attribute distortion in classification analysis. We first transform several true datasets into observed datasets according to the distortion matrices corresponding to their specific attributes, and afterwards transform each one of them into a possible true dataset according the reverse distortion matrices. Next step is to sample several same size sets with replacement on each one of possible true datasets. State populate bagging with two voting layers not only practices an observed outcome into possible true outcomes but also captures ensemble gain for any classifying algorithms without limiting to only a specific one. Finally, several true data sets from UCI machine repository are taken to reverse true data sets into observed data sets, and afterwards we evaluate the performance of state populate bagging and compared with several benchmark algorithms.

並列關鍵字

Machine Learning ； Long Tail Distribution ； Class Imbalance ； Resampling ； Distortion Data

參考文獻

Balog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework for expert finding. Information Processing & Management, 45(1), 1-19.

Barua, S., Islam, M. M., Yao, X., & Murase, K. (2014). MWMOTE--majority weighted minority oversampling technique for imbalanced data set learning.Knowledge and Data Engineering, IEEE Transactions on, 26(2), 405-425.

Barua, S., Islam, M. M., & Murase, K. (2013). ProWSyn: Proximity weighted synthetic oversampling technique for imbalanced data set learning. In Advances in Knowledge Discovery and Data Mining (pp. 317-328). Springer Berlin Heidelberg.

Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140.

Breiman, L. (1999). Prediction games and arcing algorithms. Neural computation, 11(7), 1493-1517.

國際替代計量

處理長尾分布與屬性資料扭曲之資料探勘技術

主題瀏覽