本論文利用樣本鄰近點亂度(Instance NeighborEntropy)(INE) 來評 估一個類別架構的模糊度(Class Structures Ambiguity)(CSA) 。主要的 想法是觀察每一個樣本點x的最近k個鄰近點(k Nearest Neighbor) , 若是 這k個點大都是和x屬於同一個類別, 那麼x這個點亂度較低, 反之, 則x這個 點亂度較高, 故若一個類別架構中, 包含較多亂度很高的點, 那麼這個類別 架構的模糊度便比較高。為了降低因k值太大或太小所引發的問題本論文利 用兩點間距離的倒數作為計算該點模糊度的權重值(Weighting) 。另一方面, 為了說明所提出的INE方法能評估類別架構的模糊度CSA , 本論文將代表 類別架構的模糊度CSA的值與利用SVM分類器分類所得的正確率Accuracy 做比較, 一般而言, CSA的值越高, 則分類的正確率越低, 反之, CSA的值 越低, 則分類的正確率越高, 故為一種負相關。本論文採用皮爾遜相關係數ρ (Pearson’s correlation coefficient) 來計算CSA的值與分類的正確率的關 連性。本論文實驗包含兩種型態的類別架構資料, 第一種資料的類別架構模 糊度是人為控制, 第二種資料為LIBSVM 所提供的現實資料。前者主要是 利用人為控制下的不同類別模糊度, 來說明INE方法能有效地評估類別架構 的模糊度CSA , 後者則是來探討現實類別架構的模糊度。實驗結果中顯示, 皮爾遜相關係數ρ值在第一種人為控制的資料接近-1(完全負相關) , 並且計 算CSA時, 有加入權重值較沒有權重值的ρ值更接近-1 。在第二種LIBSVM 資料中, 則是用來探討現實資料的模糊度應用。
In this thesis Instance Neighbor Entropy INE with weighting was proposed to estimate the Class Structure Ambiguity (CSA) of class structures. The main idea of the INE(x)k for one instance x was to compute the weighted entropy of class probability distribution of the top k nearest neighbors of that x. The weighting associated with that entropy was determined according to the inverse of the distance between the x and the other instances. One instance was seemed as ambiguous one if most of its neighbors came from the other classes. Therefore, one class structure might be ambiguous if it contained a lot of ambiguous instances. To evaluate the effectiveness of the CSA via INE, the Pearson’s correlation coefficient ρ between the values of accuracy achieved by SVM classifiers and the values of CSA was computed and expected to be close -1 (complete negative correlation) as possible. For experiments, there were two types of datasets. One was according to some seed points for each class and, for each seed point, there were a fixed number instances generated randomly under normal distribution while with class ambiguity under control. The other was selected from the LIBSVM as read world datasets. Experimental results showed that the evaluation of the CSA via INE(x)k did reveal the degree of class ambiguous with datasets generated randomly because the values of the ρ almost as -1, and the INE(x)k with weighted entropy evaluated more precisely than that without weighted entropy when with both types of datasets.