基於AdaBoost.MH之模糊化文件分類法

本論文中，我們提出了一個Fuzzy AdaBoost.MH 演算法，而且將此Fuzzy AdaBoost.MH 方法運用在文件分類上。Boosting 的主要觀念為利用許多weak hypotheses，透過Boosting 架構得到這些weak hypothesis 權重，最後將這些 weak hypotheses 予以合併，形成一個高準確度的強分類法。我們使用fuzzy rule 作為weak hypothesis，利用decision stump rule 為基礎的方法來當作我們判別的依據，而每一個fuzzy rule 則是以文件中的term 為依據。在文件特徵表示法中，每一個n-gram term 常作為文件最基本的特徵;然而每一文件所包含的 n-gram 數目常會是一個巨大的數量，因此在系統的設計中，我們使用term 出現的頻率來當作term 篩選的方法，並且將通過篩選的term 放入我們的rule pool 中。每一回合，Fuzzy AdaBoost.HM 從rule pool 中挑選出最好的fuzzy rule，所有fuzzy rule 的集合則是系統分類的依據。同時，我們提出了一個Fuzzy Number 的表示法，來表示每一條fuzzy rule 的信心度。這些fuzzy rule 的信心度訊息是我們做為推論分類結果的依據。當訓練的過程結束之後，我們可以經由程度轉化的過程推論我們最後的模糊化分類結果。本論文中也使用了三種文章集進行實驗，而在實驗的數據中，Fuzzy AdaBoost.MH 皆能有不錯的分類結果。

關鍵字

模糊化；文件分類法；群體學習法

並列摘要

In this paper, we propose a fuzzy AdaBoost.MH algorithm and apply fuzzy AdaBoost.MH to document classification domain. The main idea of boosting is to generate many, relatively weak hypotheses and to combine these weak hypotheses into a single highly accurate classifier. In rule design, we employ decision stump rule as the basic discriminative function and each rule is correspondent to a weak hypothesis. In system design, we employ term frequency as filtering criterion to construct a rule pool. On each round, the best fuzzy rule can be selected from the pool using AdaBoost framework. Meanwhile, we propose a fuzzy number representation to represent each rule’s confidence. These fuzzy rules with confidence information are the bases of classification inference. When the training phase is completed, the final fuzzy classification result can be obtained from the inference result with a degree transformation process. The experimental results show that fuzzy AdaBoost.MH works very well in three data corpora.

並列關鍵字

fuzzy ； document classification ； adaboost

參考文獻

[1] R. Polikar, “Ensemble Based Systems in Decision Making”, IEEE

Circuits and Systems Magazine, vol.6, no.3, pp. 21-45, 2006.

Sci., vol. 55, no. 1, pp. 119–139, 1997.

for text categorization,” Machine Learning, vol. 39,no. 2/3, pp.

135–168, 2000.

國際替代計量

基於AdaBoost.MH之模糊化文件分類法

全文下載

主題瀏覽