透過您的圖書館登入
IP:3.23.101.60
  • 學位論文

大量時間序列之預測與叢聚分析

Forecasting and Clustering Large Collections of Time Series

指導教授 : 徐茉莉

摘要


本論文主要聚焦於兩個與時間序列大集合分析的主要主題:預測(forecasting)與集群(clustering)。我們首先提出一個快速且易於使用的普通最小二乘線性回歸(OLS-LR)模型,該模型在應用於預測時間序列的大量集合時,可以很好地近似於諸如自回歸積分移動平均(ARIMA)、指數平滑(ETS)、和狀態空間模型等更複雜的方法。本論文使用此OLS-LR模型作為幾種預測和集群時間序列方法的基礎。 此OLS-LR模型可用於分別預測每個時間序列,也可透過預處理(如資料清理和集群)作為單一模型用以預測多個時間序列。在本論文中,我們使用OLS-LR模型來預測多個短時間序列(short time series)。這個的方法將所有序列組合在一起,並使用單一估計模型(estimated model)分別預測每個序列。為了演示和評估這著方法,我們針對臺灣每所學校的一年級學生人數進行建模。利用2014年之前的資料,我們為2015-2019年間臺灣每所學校一年級教室每年的數量建立了預測模型。 接下來,我們採用OLS-LR模型來預測分層和分組時間序列。預測分層時間序列或分組時間序列包括兩個步驟:計算基本預測和調整預測,以使分解序列的值加起來成為相應的聚合值。基本預測可以透過普遍的時間序列預測方法(例如ETS和ARIMA模型)進行計算。對帳步驟(reconciliation step)則是一個可調整基本預測的線性過程,以確保它們的一致性。然而,因為每個模型必須針對每個序列進行數值優化,當要預測的序列數量龐大時,使用ETS或ARIMA進行基本預測可能會在計算上面臨挑戰。我們提出一種基於OLS-LR模型的解決方案以避免此計算問題,並且使用單步法(single-step approach)來獲得已對帳的預測結果,而非一般情況下的兩步法(two-step approach)。本論文所提出的方法透過允許合併外部資料和處理缺失值增加了靈活性,並使用兩個資料集來進行演示:澳大利亞每月的國內旅遊和每日Wikipedia頁面的流覽量。本論文比較了使用ETS和ARIMA進行對帳的方法,發現我們所提出的方法速度更快,且具備相近的預測準確性。 為了對許多時間序列進行集群,我們提出了一套奠基於OLS-LR模型的兩種新方法,以擷取時間資訊(如趨勢,季節性和自相關)以及與領域相關(domain-relevant)的橫斷面屬性(cross-sectional attributes)。這些方法基於模型的分區(MOB)樹,可以作為對大量時間序列進行集群的自動化但透明的工具。我們提出了單步法和兩步法。單步法使用單一OLS-LR模型、使用趨勢、季節性、時間序列延遲(lag)和領域相關的橫斷面屬性對序列進行集群。兩步法則先根據趨勢、季節性和與領域相關的橫斷面屬性進行集群,然後通過自相關和與領域相關的橫斷面屬性對殘差序列進行進一步的集群。兩種方法均能產生可由領域專家解釋的集群。我們透過考量Wikipedia文章綜合流覽量時間序列的預測應用來演示所提出的集群方法的有用性。我們比較所提出的集群方法與替代方法,顯示基於樹的方法(tree-based approach)所產生的預測結果與適用於各個序列的ARIMA模型相當,但預測速度更快且效率更高,因此適合於擴展到大量的時間序列。此外,我們的方法也產生可用以解釋時間序列簇生成的簡單參數預測模型。

並列摘要


In this thesis, we focus on two main topics related to the analysis of large collections of time series: forecasting and clustering. We start by proposing a fast and user-friendly Ordinary Least Squares Linear Regression (OLS-LR) model, which can be a good approximation for more complex methods such as Auto-regressive Integrated Moving Average (ARIMA), Exponential Smoothing (ETS) and state-space models, for forecasting large collections of time series. We use this OLS-LR model as the basis for several methods for forecasting and clustering time series. This OLS-LR model can be used for forecasting each time series individually and also by some pre-processing (data cleaning and clustering) can be used as a single model for forecasting multiple time series. We use the OLS-LR model for forecasting many short time series. Our approach combines all the series together and uses a single estimated model to forecast each series individually. To illustrate and evaluate this approach we model the number of first grade students in each school in Taiwan. Using data until 2014, we developed a forecasting model for the annual number of first grade classrooms at each school in Taiwan in 2015-2019. Next we adopt the OLS-LR model for forecasting hierarchical and grouped time series. Forecasting hierarchical or grouped time series involves two steps: computing base forecasts and reconciling the forecasts so that values of disaggregated series add up to the corresponding aggregated values. Base forecasts can be computed by popular time series forecasting methods such as ETS and ARIMA models. The reconciliation step is a linear process that adjusts the base forecasts to ensure they are coherent. However using ETS or ARIMA for base forecasts can be computationally challenging when there is a large number of series to forecast, as each model must be numerically optimized for each series. We propose a solution based on the OLS-LR model that avoids this computational problem, and uses a single-step approach to obtain the reconciled forecasts, rather than the usual two-step approach. The proposed method adds flexibility by allowing in incorporating external data and handling missing values. We illustrate our approach using two datasets: monthly Australian domestic tourism and daily Wikipedia pageviews. We compare our approach to reconciliation using ETS and ARIMA, and show that our approach is much faster while providing similar levels of forecast accuracy. For clustering many time series we propose a set of two new methods based on the OLS-LR model that captures temporal information (trend, seasonality and autocorrelation) as well as domain-relevant cross-sectional attributes. The methods are based on model-based partitioning (MOB) trees and can be used as an automated yet transparent tool for clustering a large collection of time series. We propose a single-step approach and a two-step approach. The single-step method clusters series using trend, seasonality, time series lags and domain-relevant cross-sectional attributes, using a single OLS-LR model. The two-step method first clusters by trend, seasonality and domain-relevant cross-sectional attributes, and then further clusters the residuals series by autocorrelation and the domain-relevant cross-sectional attributes. Both methods produce clusters that are interpretable by domain experts. We illustrate the usefulness of the proposed clustering approaches by considering a forecasting application of Wikipedia article pageviews time series. We compare the proposed clustering approach to alternatives and show that the tree-based approach produces forecasts that are practically on par with ARIMA models fitted to the individual series, yet are significantly faster and more efficient, thereby suitable for scaling to large collections of time-series. Moreover, our method produces simple parametric forecasting models for interpretable clusters of time series.

參考文獻


- Akaike, H. (1998). Information theory and an extension of the maximum likelihood principle. In Selected Papers of Hirotugu Akaike, pages 199-213. Springer Series in Statistics (Perspectives in Statistics). Springer.
- Armstrong, J. S. (2001). Principles of forecasting: a handbook for researchers and practitioners, volume 30. Springer Science & Business Media.
- Ashouri, M., Cai, K., Lin, F., and Shmueli, G. (2018). Assessing the value of an information system for developing predictive analytics: The case of forecasting school-level demand in taiwan. Service Science, 10(1):58-75.
- Ashouri, M., Shmueli, G., and Sin, C.-Y. (2019). Tree-based methods for clustering time series using domain-relevant attributes. Journal of Business Analytics, pages 1-23.
- Abfalg, J., Kroegel, H.-P., Kroger, P., Kunath, P., Pryakhin, A., and Renz, M. (2006). Similarity search on time series based on threshold queries. In EDBT, pages 276-294. Springer.

延伸閱讀