透過您的圖書館登入
IP:18.117.188.64
  • 期刊

巨量資料分析下如何調整顯著水準

How to Adjust Significance Level for Big Data Analysis

摘要


眾所周知,在巨量資料分析時,不論進行任何檢定,P值都會非常小,如果使用過去的顯著水準,會造成所有檢定都被拒絕。但顯著水準是可容忍犯型I錯誤發生機率的上限,它是給定值。在巨量資料分析時,為什麼要調整顯著水準,以及要如何調整顯著水準?因為巨量資料的樣本數接近母體總數,所以P值會隨著樣本數佔母體總數的比例而改變,此比例愈大,P值愈小。此外,因為母體總數未知,故無法計算巨量資料下調整後的P值,在使用現有的統計軟體去分析巨量資料的情況下,我們改以變通的作法,調整顯著水準來因應。根據P值與樣本數佔母體總數的比例之關係式,本研究建議在一個母體平均數的單尾檢定時,當樣本數佔母體總數的比例R=80%時,顯著水準0.05要調整為0.0001;R=90%時,要調整為9.861×10^(-8);R=95%時,要調整為9.426×10^(-14)。在巨量資料下,當樣本佔母體的比例大於95%時,樣本平均數已經非常接近母體平均數,本研究建議此時分析作敘述統計即可,不需要再做任何的假設檢定。

關鍵字

巨量資料 檢定 顯著水準 P值

並列摘要


For big data analysis, the typical hypothesis test may return a very small p-value. If we do not adjust the significance level, it may cause all the tests to be rejected. However, the significant level, or the upper limit of the probability of a type I error, is a given value. In big data analysis, why and how should the significance level be adjusted? Because the sample size approximates the population size in big data analysis, the p-value will also change with the ratio of the sample size to the population size, the larger the ratio is, the smaller the p-value is. Additionally, because the population size is unknown, it is impossible to calculate the adjusted p-value. Hence, when using the current statistical package, we must adjust the significance level to analyze big data. According to the relationship between a p-value and the ratio R of the sample size to the population size, it is suggested that the significance level of 0.05 should be adjusted to 0.0001 if the ratio R is 80%; it should be adjusted to 9.861×10^(-8) if the ratio R is 90%; and it should be adjusted to 9.426×10^(-14) if the ratio R is 95%. When the ratio R is more than 95% in big data, the sample mean is very close to the population mean, and descriptive statistics analysis have suggested that inference and hypothesis testing is not necessary.

延伸閱讀