透過您的圖書館登入
IP:216.73.216.60
  • 學位論文

多標籤分類中對稀有標籤的閥值調整策略之討論

On the Thresholding Strategies for Rare Labels in Multi-label Classification

指導教授 : 林智仁

摘要


在多標籤分類任務中,標籤出現次數間的不平衡是個常見的問題。對於出現次數稀少的標籤來說,用來產生二元預測值的預設閥值往往不是最佳的。然而,在過去的文獻中已觀察到直接透過最佳化 F 值來選取新閥值容易造成過擬合。在此篇論文中,我們解釋了為什麼藉由調整閥值來最佳化 F 值以及類似的評價指標時特別容易過擬合。接下來,我們分析了 FBR 啟發法 —— 一個既有的對於此過擬合的解法。我們為其成功之處提供了解釋,但也點出 FBR 的潛在問題。針對所發現的問題,我們提出了一個新技巧,在閥值最佳化時對 F 值做平滑化處理。我們以理論證明,如果選取了恰當的參數,平滑化可為調整後的閥值帶來良好的性質。延續平滑化的概念,我們更進一步提出同時最佳化微觀平均 F 與巨觀平均 F 的方法。其享有平滑化所帶來的好處,但是更為輕量化,不需要調整額外的超參數。我們在文字與節點分類的資料集上驗證了新的方法的有效性,其一致的超越了 FBR 啟發法。

並列摘要


In multi-label classification, the imbalance between labels is often a concern. For a label that seldom occurs, the default threshold used to generate binarized predictions of that label is usually sub-optimal. However, directly tuning the threshold to optimize for F-measure has been observed to overfit easily. In this work, we explain why tuning the thresholds for rare labels to optimize F-measure (and similar metrics) is particularly prone to overfitting. Then, we analyze the fbr heuristic, a previous technique proposed to address the overfitting issue. We explain its success but also point out its potential problems. Then, we propose a new technique based on smoothing the F-measure when tuning the threshold. We theoretically prove that, with proper parameters, smoothing results in desirable properties of the tuned threshold. Based on the idea of smoothing, we then propose jointly optimizing micro-F and macro-F as a lightweight alternative free from extra hyperparameters. Our methods are empirically evaluated on text and node classification datasets. The results show that our methods consistently outperform the fbr heuristic.

參考文獻


Janez Brank, Marko Grobelnik, Natasa Milic-Frayling, and Dunja Mladenic. Training text classifiers with SVM on very few positive examples. Technical report, Technical Report MSR-TR-2003-34, Microsoft Corp, 2003.
Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon-Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, Japinder Singh, and Inderjit S Dhillon. Extreme multi-label learning for semantic matching in product search. In Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2021.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research, 9:1871--1874, 2008.
Rong-En Fan and Chih-Jen Lin. A study on threshold selection for multi-label classification. Technical report, Department of Computer Science, National Taiwan University, 2007.
Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), page 855–864, 2016.

延伸閱讀