資料複雜度指標對資料探勘分類技術的影響

資料探勘領域中的分類技術經常被用於處理各種分類問題。如何從眾多的分類技術中選擇合適的方法進行分析研究即成為一個重要的課題。以往對於各種分類器的性能評估，通常是比較分類器對於一些測試資料集的預測正確率或模型訓練時間等等……。然而在實務上，每一個不同的分類問題皆有其獨特的資料複雜度，對於所有的測試資料集都給予相同權重的評估方法顯然過於理想化。因此，本研究引入九種資料複雜度指標以量化分類問題的資料特徵，並利用分類錯誤率、敏感度以及特異度來觀察這些資料複雜度指標對於七種常用的分類技術之影響。研究結果顯示，不同的資料特徵的確會對分類技術的效能產生影響。因此未來在處理分類問題時，研究者即可參考本研究結果，先行計算較具代表性的資料複雜度指標以預估可能的分類情形，並且依照資料的結構與特徵來選擇較合適的分類方法以進行後續的研究。

關鍵字

資料複雜度；資料探勘；分類器；分類錯誤率；敏感度；特異度

並列摘要

Classification techniques in data mining are often used to deal with a variety of classification problems. Choosing suitable method for analysis from many classification techniques becomes an important issue. For the performance evaluations of the classifiers, researchers used to compare them on several datasets in terms of classification accuracy or training time, and so on. In practice, however, different classification problems has their unique data complexities. The assessment methods that give same weight to all datasets is obviously idealistic. Therefore, we adopt nine data complexity indices to quantify the data characteristics and use classification error rate, sensitivity, and specificity to observe the influence of these data complexity indices among seven commonly used classification techniques. The results show that different data characteristics indeed have an impact on classification performance. So when dealing with classification problems, researchers can firstly calculate data complexity indices suggested in this paper to estimate the classification difficulties, and use the data complexity indices to choose appropriate classification method for the follow-up study.

並列關鍵字

data complexity ； data mining ； classifiers ； classification error rate ； sensitivity ； specificity

參考文獻

2. 吳泳慶(2007)，｢中文垃圾郵件客製化過濾系統之研究｣，淡江大學統計學系應用統計學碩士班碩士論文。

4. 洪惠萍(2009)，｢以非對稱權重矩陣改善順序型分類器之績效評估指標｣，淡江大學統計學系應用統計學碩士班碩士論文。

5. 陳宇邦(2011)，｢順序型變數轉換在決策樹之應用｣，淡江大學統計學系應用統計學碩士班碩士論文。

1. Breiman, L. (2001), Random Forests, Machine Learning, 45, 5-32.

3. Cleveland, W. S. (1981), LOWESS: A program for smoothing scatterplots by robust locally weighted regression, The American Statistician, 35, 54.

被引用紀錄

葉丞峻（2017）。適用於分類變數資料的二元不平衡資料自動分類系統〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2017.00933

王詩詠（2013）。資料複雜度指標在資料探勘分類方法之重要性〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2013.00426

江奕（2013）。資料探勘技術應用於病患存活狀態之預測〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2013.00063

國際替代計量

資料複雜度指標對資料探勘分類技術的影響

全文下載

主題瀏覽