  • 學位論文


Data Construction Method to Small Sample Sets: Theory and Applications

指導教授 : 王小璠


小樣本問題泛指因資料量不足導致分析績效不佳與錯誤推論的情況。欲解決此類問題,收集更多的資料顯然是最直接而有效的方式。然而,某些情況下,欲獲得足夠的資料是相當困難,甚至不可行的。舉例而言,當商品具有短生命週期的特性,為能快速回應市場需求,企業常須從有限的資料中擬訂生產策略;特別是現實生活中的災害性地震、海嘯、龍捲風、恐怖攻擊、精神疾病等罕見事件,其可用資料本就稀少,但多數卻伴隨著難以估量的社會成本支出。因此,如何從少量的資料中萃取出更多的有效資訊,是個刻不容緩的議題。 本論文定義了多重集合除法,從而發展出一個名為資料建構法的新方式,用來在原始樣本所給定的數值範圍內產生額外的資料,以克服因資料量太小資料行為無法顯現的問題。另一方面,資料建構法可由額外產生的資料推導出隸屬函數,以作為進一步分析之用。藉由上述的特性,資料建構法將可以用來填補小樣本所造成的資訊缺口,並提高推論的正確性。 論文中,首先說明了資料建構法的理論背景、特性、與相關步驟。為驗證資料建構法的有效性,本論文運用額外產生的資料於建構信賴區間以估計未知母體平均數,與改善監督式類神經網路的分類正確率等小樣本議題,並比較數個既存的方法於這兩項議題的績效。實驗的結果顯示,資料建構法所產生的額外資料不僅能提供有效的資訊,分析績效也表現較佳。爾後,本論文將資料建構法應用於評估台灣災害性地震的分佈與預測精神分裂症患者何時病症復發等個案。研究結果顯示,資料建構法所推導出的隸屬函數,對於這兩個空間與時間上的稀少事件議題,皆能提供良好的預測績效。


Small sample problems indicate that the lack of sufficient training data leads to poor prediction performance and erroneous conclusions as well. For that matter, collecting more data is the simplest and most direct choice. However, in some cases it is difficult, even impossible, to pursue additional data for analysis. For example, ever since the product life cycle became shorter and shorter, it has become more and more difficult to collect sufficient data for acquiring management knowledge in the early stages in manufacturing systems. It indeed has been an awkward situation for managers. Also, there are some circumstances, namely rare events, to infrequently occur in the real world, e.g. severe earthquakes, tsunamis, tornados, terrorist attacks, and periodic psychotic episodes of individual schizophrenics. In analyzing such events, their available data are rare in nature, and what is more, they often go along with high socio-economic cost. Thus, how to bring out the information from a small sample as more as possible remains a critical issue. To serve the needs above, we have advanced a heuristic measure termed as Data Construction Method (DCM) based on the multiset division. The DCM can not only generate addition data within the domain value of the given sample for revealing the data’s patterns, but also creates the membership function from the generated data for further applications. In this way, the DCM is taken to filling up the information gaps caused by small-sample-sets. To demonstrate the effectiveness of DCM, after presenting the DCM’s theoretic background, properties, and algorithm, we compared the DCM with several existing approaches in estimating the population mean and improving the supervised neural network learning performance. The results show that the DCM performs better in a comparative manner. Then, we applied the membership function derived from the DCM data to the studies of predicting the severe earthquakes in Taiwan and forecasting the psychotic episode of individual schizophrenics. The results show that the DCM can provide an appropriate reference for prediction.


1. Abu-Mostafa Y. S., Hints and the VC-dimension, Neural Computation 5: 278-288, 1993.
2. Ang, R. P., Use of Jackknife statistic to evaluate result replicability, Journal of General Psychology 125: 218-228, 1998.
3. Ayuso-Gutiérrez, J.L., and del Río Vega, J.M., Factors influencing relapse in the long-term course of schizophrenia. Schizophr. Res. 28:199-206, 1997.
5. Barnett, A.H., Mackin, P., Chaudhry, I., Farooqi, A., Gadsby, R., Heald, A., Hill, J., Millar, H., Peveler, R., Rees, A., Singh, V., Taylor, D., Vora, J., and Jones, P.B., Minimising metabolic and cardiovascular risk in schizophrenia: diabetes, obesity and dyslipidaemia, J. Psychopharmacol. 21: 357-373, 2007.
6. Bier M. Vicki, Statistical methods for the use of accident precursor data in estimating the frequency of rare events, Reliability Engineering & System Safety, 41(3): 267-280, 1993.
