巨量資料分析下如何調整顯著水準

眾所周知，在巨量資料分析時，不論進行任何檢定，P值都會非常小，如果使用過去的顯著水準，會造成所有檢定都被拒絕。但顯著水準是可容忍犯型I錯誤發生機率的上限，它是給定值。在巨量資料分析時，為什麼要調整顯著水準，以及要如何調整顯著水準？因為巨量資料的樣本數接近母體總數，所以P值會隨著樣本數佔母體總數的比例而改變，此比例愈大，P值愈小。此外，因為母體總數未知，故無法計算巨量資料下調整後的P值，在使用現有的統計軟體去分析巨量資料的情況下，我們改以變通的作法，調整顯著水準來因應。根據P值與樣本數佔母體總數的比例之關係式，本研究建議在一個母體平均數的單尾檢定時，當樣本數佔母體總數的比例R=80%時，顯著水準0.05要調整為0.0001；R=90%時，要調整為9.861×10^(-8)；R=95%時，要調整為9.426×10^(-14)。在巨量資料下，當樣本佔母體的比例大於95%時，樣本平均數已經非常接近母體平均數，本研究建議此時分析作敘述統計即可，不需要再做任何的假設檢定。

關鍵字

巨量資料；檢定；顯著水準； P值

並列摘要

For big data analysis, the typical hypothesis test may return a very small p-value. If we do not adjust the significance level, it may cause all the tests to be rejected. However, the significant level, or the upper limit of the probability of a type I error, is a given value. In big data analysis, why and how should the significance level be adjusted? Because the sample size approximates the population size in big data analysis, the p-value will also change with the ratio of the sample size to the population size, the larger the ratio is, the smaller the p-value is. Additionally, because the population size is unknown, it is impossible to calculate the adjusted p-value. Hence, when using the current statistical package, we must adjust the significance level to analyze big data. According to the relationship between a p-value and the ratio R of the sample size to the population size, it is suggested that the significance level of 0.05 should be adjusted to 0.0001 if the ratio R is 80%; it should be adjusted to 9.861×10^(-8) if the ratio R is 90%; and it should be adjusted to 9.426×10^(-14) if the ratio R is 95%. When the ratio R is more than 95% in big data, the sample mean is very close to the population mean, and descriptive statistics analysis have suggested that inference and hypothesis testing is not necessary.

並列關鍵字

big data ； hypothesis testing ； significance level ； p-value

國際替代計量

巨量資料分析下如何調整顯著水準

全文下載

主題瀏覽