以排序佳化為基礎的數值疊代及計算資源分配演算法之研發

很多實際問題，例如: 倉儲控制，電腦及通訊網路，可以用穩態的馬可夫決策問題(Stationary Markov Decision Problems, StMDPs)模型來表示，其中轉移機率及轉移成本往往都是由於問題的複雜及不確定性，必須利用模擬的方式來估算其值。以模擬為基礎的策略疊代法則(Simulation-Based Policy Iteration, SBPI)是用來解決具有這種問題的典型方法。SBPI解法由策略評估(policy evaluation)與策略改善(policy improvement)兩個步驟所組成。策略評估是以多時段模擬估計每一狀態之期望成本(Cost-To-Go, CTG)，是SBPI最消耗計算時間的步驟。所以必需在策略評估的步驟上縮短模擬時間。由SBPI的模擬實驗結果觀察得知粗略的CTG估值仍然可以得到夠好的策略，但是可以大量節省模擬時間。雖然估值粗略,但是決策精度會隨著疊代次數增加而逐漸改善。這引發了我們研究的動機及改進方法的構想。本論文先提出一個以模擬為基礎的數值疊代法(Simulation-Based Value Iteration, SBVI)解決StMDPs。相較於SBPI，SBVI在一次疊代的策略評估時對每一狀態僅以單段模擬評估單段成本，再加上前次疊代所估的CTG來成為新的CTG估值。如此之CTG估值雖然在先期的疊代較SBPI粗略，但可以得到還不錯的決策且節省模擬時間。而決策精度會隨著疊代次數而逐漸接近或達到最佳決策。一個中維度問題的實驗結果顯示SBVI大約可以節省一百倍的模擬時間而得到與SBPI相同的策略精確度。然而，SBVI所需的模擬時間在高決策精度的範圍隨精度要求、問題維度而快速增加。我們進一步利用在估值精確度還是粗略的時候，接近最佳決策的排序就已經形成的特性結果，來排序佳化的觀念發展了一個以排序佳化為基礎的數值疊代演算法(Ordinal Optimization-Based Value Iteration, OOBVI)。有別SBVI以CTG的估值精度來決定停止策略評估的模擬，OOBVI是以排序的信心度(APCS)來決定停止策略評估的模擬。APCS可隨疊代逐步調整模擬次數，如此可進一步減少CTG估值所需的模擬但仍達成所要求的策略精度。根據一個中維度問題的模擬結果，OOBVI相較於SBVI可以節省大約四倍的模擬時間而得到與SBVI相同的策略精確度。且在進一步的實驗結果顯示，OOBVI模擬時間會隨著狀態個數呈現近似線性成長而SBVI是近似於指數成長。所以，我們預料在大維度問題之下，OOBVI是比SBVI有效率的。 OOBVI在策略評估是對每一個狀態去作相同程度排序的策略評估，但我們認為不需要對於所有的狀態都作相同程度排序的策略評估。藉由使用可變動的停止策略評估標準，這種對狀態分配計算資源(Computing Budget Allocation over State, CBA-S)的方式可以使OOBVI得到接近使用高排序的信心度的策略精度，但模擬時間可望減少。根據一個中維度的問題的模擬結果，OOBVI with CBA-S的策略精度較使用高排序的信心度的策略精度差大約一個百分比，但模擬時間大約可以節省十倍。總結研究的貢獻及價值，我們改善現有典型解決需要模擬的StMDP問題的SBPI方法。最後依據有限的模擬實驗結果，我們預料OOBVI with CBA-S會是最有潛力的方法，能夠大量節省模擬時間得到相同的policy accuracy。

關鍵字

演算法；馬可夫決策；排序佳化

並列摘要

In many stationary Markov Decision Problems (StMDPs) modeling of real problems: for instance, inventory control problems, computer, and communication networks, both the transition probability and cost function must be generated via computer simulation because of problem complexity and uncertainty. Simulation-Based Policy Iteration (SBPI) is a typical solution in these problems. SBPI includes a sequence of policy evaluation and policy improvement steps. In Policy evaluation step, we evaluate cost-to go value for each state via simulation of multi-stages. It is the most time-consuming step in SBPI algorithm. It means that if we want to decrease CPU time with a good optimal policy accuracy (PA), we have to shorten the CPU time in policy evaluation step Observation of simulation experiment of SBPI is rough estimation accuracy of CTG may lead to good enough policy but much less simulation time. Though the value estimation is rough, policy accuracy improves gradually with iteration process. This motivates our search and idea of improvement approach of algorithm. In this paper, we first propose Simulation-Based Value Iteration (SBVI) to solve StMDPs. In policy evaluation step compared with SBPI, SBVI evaluates stage-wise cost only to evaluate stage-wise cost and adds estimation of CTG in previous iteration to update the values in current iteration. Estimation of CTG is rougher than SBPI in early iteration but still leads to good enough policy and spends less simulation time. And policy accuracy will approach or reach optimal policy with iteration. In the numerical study that compares SBPI with SBVI by a medium dimension problem, simulation time can be saved around two orders in SBVI than SBPI to get the same level policy accuracy. However, simulation time of SBVI in high range of PA rapidly grows with PA and problem dimensions. We further exploit the property that ranking of optimal policy has formed, although estimation accuracy is rough in advance. We propose Ordinal Optimization-Based Value Iteration (OOBVI) by using the concept of OO. OOBVI adopts ranking accuracy (APCS) instead of estimation accuracy as stopping criterion to stop simulation of policy evaluation. APCS can modulate simulation replications with iteration to save simulation time at the same desirable policy accuracy. According to simulation result of a medium dimension problem, OOBVI can save four times simulation time compared with SBVI to reach the same PA in SBVI. And from observation of further simulation, growth of simulation time in OOBVI is approximated by linear but exponential in SBVI to get the same desirable PA. We anticipate that OOBVI is more efficient than SBVI in large dimension problems. OOBVI uses the same ranking accuracy for all states in policy evaluation step, but we consider that it is unnecessary. The innovative idea is variable stopping criterion for states. Combination of computing budget allocation over states (CBA-S) and OOBVI can be expected to get high PA for high stopping criterion but simulation time decreases obviously. By a medium dimension problem, OOBVI with CBA-S can get almost the same PA for high stopping criterion but simulation time saves tenfold. Summary of contribution and value of our research is improvement of SBPI which is typical solution to solve StMDP that needs simulation. From our finite simulation experiment, we expect that OOBVI with CBA-S is most potential algorithm to get the same PA but less simulation time.

並列關鍵字

MDP ； markov decision proess ； ordinal optimazation

參考文獻

[Ber95] D.P. Bertsekas. “Dynamic Programming and Optimal Control,” Volume I, II. Athena Scientific, Belmont, 1995.

[BeT96] D.P. Bertsekas, and J. N. Tsitsiklis, “Neuro-Dynami Programming,” Athena, 1996.

[CCD97] H. C. Chen, C. H. Chen, L. Dai, and E. Yucesan, “New Development of Optimal Computing Budget Allocation for Discrete Event Simulation,” in Proceedings of 1997 Winter Simulation Conference, December 1997, pp.334-341

[CCY00] C.H. Chen, H. C. Chen, and E. Yucesan,” Computing Efforts Allocation for Ordinal Optimization and Discrete Event Simulation,” IEEE Transactions on Automatic Control,p960-964, May 2000.

[Che96] C. H. Chen, "A Lower Bound for the Correct Subset-Selection Probability and Its Application to Discrete-event System Simulations," IEEE Transactions on Automatic Control, Vol. 41, No. 8, pp. 1227-1231, August 1996.

國際替代計量

以排序佳化為基礎的數值疊代及計算資源分配演算法之研發

全文下載

主題瀏覽