成本導向多標籤學習演算法與應用

本論文的第一部份研究一個新的機器學習問題，稱之為成本導向多標籤分類問題。在這個問題中，每一筆資料的不同標籤可以有不同的分類錯誤成本。我們首先利用機器學習演算法中簡化問題的技術，將成本導向多標籤分類簡化成成本導向單標籤分類問題。此外，我們提出了一個基於基底擴展模型的方法來解成本導向多標籤分類問題。此方法稱為一般化k標籤集合群體分類法。此群體分類中，每一個基底函式是一個標籤冪集合分類器。基底函式的係數的學習方式是最小化成本導向錯誤率。我們推導出快速的求解係數的計算方式。此方法也可以應用在一般的多標籤分類問題。在一般的多標籤分類問題和成本導向多標籤分類問題的實驗結果都證實我們提出的新方法的預測效果更好。如何在應用問題中找出分類錯誤成本，是一個重要的實務問題。本論文的第二部份研究兩個成本導向分類問題的應用：醫學影像分類與社群標籤預測。在醫學影像分類問題中，我們發現了正例資料中的病患不平衡問題。這個問題嚴重影響影像分類器的預測能力。我們利用成本導向學習法設計了病患平衡學習演算法。利用這個方法我們成功地贏得了KDD Cup 2008年冠軍。在社群標籤預測問題中，我們提出了利用標籤計數當作分類錯誤成本，並利用成本導向多標籤學習法解決這個問題。實驗結果證實成本導向多標籤學習法，不論是在成本導向評量標準或是在一般評量標準都比我們在MIREX 2009音樂標籤預測比賽中得到冠軍的方法預測效果還要好。在社群書簽預測的實驗結果也證實我們所提出的方法較其他方法有更好的預測效果。

關鍵字

成本導向多標籤分類；多標籤分類；群體分類法；音樂標記與搜尋；醫學影像分類；病患平衡式學習法；成本導向分類

並列摘要

We study a generalization of the traditional multi-label classification, which we refer to as cost-sensitive multi-label classification (CSML). In this problem, the misclassification cost can be different for each instance-label pair. For solving the problem, we propose two novel and general strategies based on the problem transformation technique. The proposed strategies transform the CSML problem to several cost-sensitive single-label classification problems. In addition, we propose a basis expansion model for CSML, which we call the Generalized k-Labelsets Ensemble (GLE). In the basis expansion model, a basis function is a label powerset classifier trained on a random k-labelset. The expansion coefficients are learned by minimizing the cost-weighted global error between the prediction and the ground truth. GLE can also be used for traditional multi-label classification. Experimental results on both multi-label classification and cost-sensitive multi-label classification demonstrate that our method has better performance than other methods. Cost-sensitive classification is based on the assumption that the cost is given according to the application. “Where does cost come from?” is an important practical issue. We study two real-world prediction tasks and link their data distribution to the cost information. The two tasks are medical image classification and social tag prediction. In medical image classification, we observe a patient-imbalanced phenomenon that has seriously hurt the generalization ability of the image classifier. We design several patient-balanced learning algorithms based on cost-sensitive binary classification. The success of our patient-balanced learning methods has been proved by winning KDD Cup 2008. For social tag prediction, we propose to treat the tag counts as the mis-classification costs and model the social tagging problem as a cost-sensitive multi-label classification problem. The experimental results in audio tag annotation and retrieval demonstrate that the CSML approaches outperform our winning method in Music Information Retrieval Evaluation eXchange (MIREX) 2009 in terms of both cost-sensitive and cost-less evaluation metrics. The results on social bookmark prediction also demonstrate that our proposed method has better performance than other methods.

並列關鍵字

cost-sensitive multi-label classification ； multi-label classification ； ensemble method ； tag-based music annotation and retrieval ； medical image classification ； patient-balanced learning ； cost-sensitive learning

參考文獻

[59] Han-Hsing Tu and Hsuan-Tien Lin. One-sided support vector regression for

[1] Sameer Agarwal, Kristin Branson, and Serge Belongie. Higher order learning

Learning, 2006.

Advances in Neural Information Processing Systems, 2002.

[3] James Bergstra, Norman Casagrande, Dumitru Erhan, Douglas Eck, and Balazs

國際替代計量

成本導向多標籤學習演算法與應用

主題瀏覽