透過您的圖書館登入
IP:3.148.202.74
  • 學位論文

釐清 Gains chart 與 Lift chart 之混淆以增進實務中的有效應用

Clarifying Confusions about Gains and Lift Charts to Improve Their Current Underuse in Practice

指導教授 : 徐茉莉

摘要


Gains chart及lift chart為檢驗資料探勘方法預測結果之評估標準,尤用以評估排序問題(ranking problem)。此二圖主要依分類結果之機率排序,以協助排名靠前的數據子集選擇特定門檻。即使gains chart與lift chart 已應用於許多領域,且常被教科書及期刊論文提及,兩者之間仍有許多術語及定義上的混淆處,造成使用上的困難或是錯誤解讀。因此,本論文研究旨在釐清上述混淆以增進gains chart及lift chart 在實務中的有效應用。 本研究先透過展示其他分類評估標準(如:準確率(accuracy)、ROC (Receiver Operating Characteristic) 曲線、敏感度(sensitivity)、特異度(specificity)等在文獻中的主導地位,以顯示gains chart及lift chart應用率相對低落之問題。再經本研究調查結果,此二圖之命名和定義在多數刊物及資料探勘軟體中經常混淆不清。故本研究乃以清晰、有條理的方式組織gains chart及lift chart之不同術語、計算方法、以及相關定義,藉以闡明其用途與再現性;繼而引入使用gains及lift數值的十分位圖、利潤圖、與非累積圖;且做為整合之用,我們創建了一個gainslift R語言之套件,提供清晰並一致的gains chart及lift chart。最後,本論文提出此二圖的三種主要用途,用於比較不同情況下資料探勘方法的預測結果,並以Kaggle平台的實際案例進行說明。在此實際案例中,我們亦提供使用gainslift套件的範例圖表。

並列摘要


Gains chart and lift chart are two useful data mining performance measures for evaluating ranking problems. These two charts are based on ranking the data by the classification probability, which then helps choose a threshold for targeting a subset of top-ranked data. Although deployed in some application areas, and mentioned in textbooks and papers, there are confusions in terminology and definition around gains and lift charts which leads to difficulty or wrong interpretations when using them. In this research, we clarify the above confusions to improve their current under use in practice. We bring up this issue by showing the dominance of other classification evaluation criteria, such as accuracy, ROC curve, sensitivity, and specificity through our literature search. Our survey also shows that the naming and definition of gains chart and lift chart are often mixed up in both publications and data mining software. We organize the disparate terminology, computation approaches, and perspectives on gains and lift charts in a clear, methodic way to clarify their uses and reproducibility. Decile, profit, and non-cumulative charts using gains and lift values are also introduced successively. As an integration of this research, we created the gainslift R package to provide consistent and clear gains and lift charts. Finally, we propose three uses of the charts for comparing performance of data mining algorithms on different circumstances, and illustrate them with a practical case from the Kaggle platform. The example of gains and lift charts derived from our package are also provided in this case.

參考文獻


Bing, H., Xu, H. & Yujiang, O. (2013). Research of Using Fourier Series Fitting Cam Lift Curve Based on the Least Square Method. In 2013 Third International Conference on Intelligent System Design and Engineering Applications (pp. 1144-1147).
Brandenburger, T., & Furth, A. (2009). Cumulative gains model quality metric. Advances in Decision Sciences, 2009.
Flach, P. (2012). Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press.
Friedman, J., Hastie, T., & Tibshirani, R. (2017). The elements of statistical learning. New York: Springer series in statistics.
Jaffery, T., & Liu, S. X. (2009). Measuring campaign performance by using cumulative gain and lift chart. In SAS Global Forum (p. 196).

延伸閱讀