Have library access?
IP:3.215.16.238
  • Journals

巨量資料分析下如何調整顯著水準

How to Adjust Significance Level for Big Data Analysis

Abstracts


眾所周知,在巨量資料分析時,不論進行任何檢定,P值都會非常小,如果使用過去的顯著水準,會造成所有檢定都被拒絕。但顯著水準是可容忍犯型I錯誤發生機率的上限,它是給定值。在巨量資料分析時,為什麼要調整顯著水準,以及要如何調整顯著水準?因為巨量資料的樣本數接近母體總數,所以P值會隨著樣本數佔母體總數的比例而改變,此比例愈大,P值愈小。此外,因為母體總數未知,故無法計算巨量資料下調整後的P值,在使用現有的統計軟體去分析巨量資料的情況下,我們改以變通的作法,調整顯著水準來因應。根據P值與樣本數佔母體總數的比例之關係式,本研究建議在一個母體平均數的單尾檢定時,當樣本數佔母體總數的比例R=80%時,顯著水準0.05要調整為0.0001;R=90%時,要調整為9.861×10^(-8);R=95%時,要調整為9.426×10^(-14)。在巨量資料下,當樣本佔母體的比例大於95%時,樣本平均數已經非常接近母體平均數,本研究建議此時分析作敘述統計即可,不需要再做任何的假設檢定。

Keywords

巨量資料 檢定 顯著水準 P值

Parallel abstracts


For big data analysis, the typical hypothesis test may return a very small p-value. If we do not adjust the significance level, it may cause all the tests to be rejected. However, the significant level, or the upper limit of the probability of a type I error, is a given value. In big data analysis, why and how should the significance level be adjusted? Because the sample size approximates the population size in big data analysis, the p-value will also change with the ratio of the sample size to the population size, the larger the ratio is, the smaller the p-value is. Additionally, because the population size is unknown, it is impossible to calculate the adjusted p-value. Hence, when using the current statistical package, we must adjust the significance level to analyze big data. According to the relationship between a p-value and the ratio R of the sample size to the population size, it is suggested that the significance level of 0.05 should be adjusted to 0.0001 if the ratio R is 80%; it should be adjusted to 9.861×10^(-8) if the ratio R is 90%; and it should be adjusted to 9.426×10^(-14) if the ratio R is 95%. When the ratio R is more than 95% in big data, the sample mean is very close to the population mean, and descriptive statistics analysis have suggested that inference and hypothesis testing is not necessary.

Parallel keywords

big data hypothesis testing significance level p-value

Read-around