應用文本探勘、谷歌趨勢關鍵字與最小二乘向量回歸於股價之預測

股市的走向一直學界熱門討問的議題，因此本研究預測常見的美股道瓊工業、納斯達克綜合、羅素2000三個大盤指數。在大數據時代以前，文獻多以歷史資料或時間序列預測股市走勢，較缺少考慮市場外部因素，常用的技術指標又過於艱深，而且兩者皆難以跳脫因果關係的框架。在大數據研究中，使用網路搜尋量和社交網路進行預測是時下的趨勢，考慮以上股市和研究趨勢，本研究以歷史數據代表內部因素，關鍵字搜尋量代表外部因素，建立混合、單獨預測比對實驗結果。為解決挑選關鍵字的問題，本研究提出兩種關鍵字選取法，第一是人力挑選Google首頁的熱門關鍵字。第二是自動化的探勘Twitter用戶文本，以術語抽取器組合關鍵字，稱為文本探勘社交網路法，並建立三階段的實驗架構。首先單獨使用Google趨勢的第一階段中，本研究發現以Twitter文本找出社交網路的關鍵字，對預測是有助益的，但在股市預測中MAPE沒有表現在最佳範圍內，未來可以嘗試應用在其他領域的資料。第二階段混合Google趨勢和歷史資料預測y_t 〖,y〗_(t+1)的實驗中，本研究發現預測y_(t+1)時，反而是單用歷史數據較好，這代表Google趨勢具有時效性，無法應用於預測y_(t+1)或y_(t+n)的狀況。而預測y_t時混合兩者的MAPE皆優於對照組，且得到所有組合中最好的結果。因混合歷史資料於預測y_t時，歷史數據值並無法提前取得，所以我們加入第三階段實驗，以GARCH模型找出歷史屬性推估值再放入第二階段實驗一的架構中，本研究發現只要搭配良好的自回歸模型，就可以得到接近最佳結果的預測值。

關鍵字

股市預測；文本探勘； Googl趨勢；最小二乘向量回歸；推特

並列摘要

In this study, values of three stock markets, Dow Jones Industrial Average, Nasdaq Composite and Russell 2000, are predicted. Traditionally, time series models were applied in forecasting stock markets without considering external factors. This study uses Least Squares Support Vector Regression (LSSVR) model with hybrid data containing historical data and Google Trends keywords to forecast stock markets. This study proposes two ways to select keywords for Google Trends. The first one is the selection of popular keywords on the Google Trends homepage, and the second one is based on the text of Twitter. In this study, a three-stage experiment architecture was proposed to forecast stock markets and the Auto Regressive Integrated Moving Average (ARIMA) model is used predict time series data of stock markets. Numerical results show that the proposed model is a feasible way in predicting stock markets.