利用分類樹演算法偵測代表性不足之高同質性群體

演算法偏差(algorithmic bias)的出現在資料探勘領域中引起很多的關注，儘管演算法目的在減少決策中人為帶來的偏見，然而，由於演算法依賴大量數據的訓練，當數據具有缺乏多樣性等問題時，數據本身會將偏見引入算法中，從而產生對某些族群的不利結果。過去的研究多聚焦在訓練數據中的不平衡問題(data imbalance)所引起的偏差，包含類別不平衡（class imbalance）以及預測變量不平衡(predictor imbalance)。本研究旨在偵測由預測變量組合(combinations of predictors)所組成代表性不足之高同質性群體，我們將問題從檢測受到歧視的個人或群體延伸到識別特定行為，例如某部分用戶在應用程式上具有非常相似的結果。通過偵測高同質性子群體並研究其中的模式(patterns)，可以為應用程式設計師提供介面設計與優化的建議，此外，這些資訊還能夠指出需要針對哪些用戶進行進一步質化研究。此研究中我們比較以不純度為分類基礎的分類樹(impurity-based trees)和以統計方法為分類基礎的分類樹(statistics-based trees)對此問題的適用性。研究結果發現，與以統計方法為分類基礎、採用置換測驗(permutation test)的分類樹相比，以不純度為分類基礎的分類樹使用不純度(特別是使用Gini impurity measure)逐一衡量每個預測變量可能拆分點，不自動為預測變量校正偏差，更有可能偵測到代表性不足之高同質性群體。為了說明這個問題，我們將分類樹演算法應用於由電動機車共享租賃服務公司所蒐集有關應用程式用戶行為的大型數據集。

關鍵字

演算法偏差；不平衡問題；不平衡預測變數組合；分類樹

並列摘要

The presence of algorithmic bias has recently attracted a lot of attention in the data mining community. Although data mining algorithms are designed to tackle and reduce human bias in decision making, the algorithms are trained on data, which itself can still introduce bias into the algorithms, and thus generate unwanted outcomes that discriminate against certain categories of people. Previous studies have focused on biases that arise from imbalance issues in the data used to train algorithms (training data): class imbalance (unbalanced outcome) as well as predictor imbalance can both lead to bias towards the majority class. In this research, we extend the study of detecting minority subgroups by considering combinations of predictors that create a subgroup(s) are almost perfectly classified . We also extend the problem from detecting discriminated individuals or groups, to identifying specific behavior profiles, such as on web mobile applications, that have extremely homogeneous outcomes. By detecting homogeneous subgroups and studying the subgroups’ different patterns and profiles, detection can provide insights for app designers. Such information can also point to patterns that require further qualitative investigation. We focus on decision trees and compare the suitability of impurity-based trees and statistics-based trees for this task. We find that the most potent approach is using impurity-based CART-type trees, such as those constructed by rpart in R, which do not correct for predictor bias and use impurity measures for selecting splits. Specifically, we find that using the Gini impurity measure is most suitable. This approach is more likely to find homogeneous subgroups compared to the two-step, permutation-test-based approach taken by statistics-based trees such as ctree. To illustrate these issues, we apply the different tree approaches to a large dataset on app user behavior collected by a leading e-scooter sharing economy service.

並列關鍵字

Algorithmic Bias ； Data Imbalance ； Predictor Combination Imbalance ； Classification Tree

參考文獻

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias. ProPublica. Retrieved from https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

Google Scholar

Baer, T. (2019). Understand, Manage, and Prevent Algorithmic Bias: A Guide for Business Users and Data Scientists. Apress

Google Scholar

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees Belmont. CA:Wadsworth International Group.

Google Scholar

Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32.

Google Scholar

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

Google Scholar

國際替代計量

利用分類樹演算法偵測代表性不足之高同質性群體

全文下載

主題瀏覽