  • 期刊


How to Adjust Significance Level for Big Data Analysis




巨量資料 檢定 顯著水準 P值


For big data analysis, the typical hypothesis test may return a very small p-value. If we do not adjust the significance level, it may cause all the tests to be rejected. However, the significant level, or the upper limit of the probability of a type I error, is a given value. In big data analysis, why and how should the significance level be adjusted? Because the sample size approximates the population size in big data analysis, the p-value will also change with the ratio of the sample size to the population size, the larger the ratio is, the smaller the p-value is. Additionally, because the population size is unknown, it is impossible to calculate the adjusted p-value. Hence, when using the current statistical package, we must adjust the significance level to analyze big data. According to the relationship between a p-value and the ratio R of the sample size to the population size, it is suggested that the significance level of 0.05 should be adjusted to 0.0001 if the ratio R is 80%; it should be adjusted to 9.861×10^(-8) if the ratio R is 90%; and it should be adjusted to 9.426×10^(-14) if the ratio R is 95%. When the ratio R is more than 95% in big data, the sample mean is very close to the population mean, and descriptive statistics analysis have suggested that inference and hypothesis testing is not necessary.
