以排序佳化與模擬為基礎的策略疊代演算法之研發

穩態的馬可夫決策過程(Stationary Markov Decision Process)問題是隨機最佳控制中的一類重要問題，於經濟、電腦、通訊、製造等系統上都有應用。在真實的問題中，轉移成本 (transition cost) 往往都複雜而不確定，必須利用模擬的方式來估算其值，以模擬為基礎的策略疊代法則 (Simulation-Based Policy Iteration, SBPI)乃因應而生，用來解決具有這種特性的問題。 SBPI解法由策略評估(policy evaluation)與策略改善(policy improvement)兩個步驟所組成。策略評估是對給定的控制策略加以模擬進行one-step-analysis cost-to-go的估算。當其值達到設定的精度時才停止模擬。而策略改善則是根據所估算的one-step-analysis cost-to-go進行最佳控制的選取更新，如此反覆直至收斂。模擬實驗顯示，SBPI最消耗計算時間的步驟為策略估計。有鑑於上述的實驗觀察，為了快速的找到一個最佳策略，在這個論文中，我們利用排序佳化 (Ordinal Optimization)的觀念發展了一個以排序佳化與模擬為基礎的策略疊代演算法 (Ordinal Optimization and Simulation Based Policy Iteration, OOBPI)。有別於在傳統的SBPI在每一次疊代中以模擬的方式求得一個精確的one-step-analysis cost-to-go估計值後再來選取該狀態下的最佳策略，OOBPI則在每一次疊代中，只對每個狀態中所可以選擇的控制模擬估算one-step-analysis cost-to-go到能以設定的信心程度排序出最佳控制即可，由各狀態的最佳控制組成下一次疊代的策略。此法在精神上利用到OO排序收斂速度高於績效評估的收斂速度來減少我們在策略評估步驟中的計算時間。我們也根據Cooper 和Henderson所提出的SBPI的收斂性條件，找到了一組滿足其收斂條件的解。在進而結合陳俊宏教授所提出尋找滿足信心程度 (satisfactory confidence level)的下限的方式，提出使OOBPI收斂的各次疊代最低信心度設定的充分條件。最後我們將OOBPI以一個取代問題(the replacement problem)以及實作在SBPI的模型加以比較，可以發現在相同合理夠好的最佳策略正確性的基準下，我們發現OOBPI可有效率的找到夠好的策略。當問題維度變大時，亦能比SBPI更快速的找到最佳策略。

關鍵字

碼可夫決策問題；排序佳化；蒙地卡羅模擬

並列摘要

Stationary MDP is an important class of stochastic optimal control problems, and there are lots of applications in economic, computer, and communication, manufacturing systems. In the real problem, the transition cost is uncertain and complex. It can only be achieved via Monte-Carlo simulation. For the problems with complex transition cost, simulation-based policy iteration has been developed to solve them. SBPI includes a sequence of policy evaluation and policy improvement steps. The policy evaluation step is to estimate the one-step-analysis cost-to-go with the given policy via simulation. We stop doing simulation until the estimation reaches a specified accuracy level. The policy improvement step is to update the policy by selecting the best control among all admissible controls of every state according to the estimated one-step-analysis cost-to-go and compose all the selected (state, control) pair to form an improved policy. The policy evaluation step and policy improvement step iterates until the improved policy is totally the same to the policy we have at present. The simulation result shows that policy evaluation step is most time-consuming step in SBPI. Based on the observation from the above experiment, to quickly select an optimal policy, in this thesis, we combine the concept of OO with SBPI and propose an OOBPI algorithm. In the traditional SBPI approach, we select the optimal control after obtaining an accurate estimation of one-step-analysis cost-to-go. But OOBPI compares the relative orders of one-step-analysis cost-to-go among admissible controls of each state to a specified level of confidence. And the improved policy is composed of the selected best control in every state. This algorithm use the concept that the convergence rate of relative ranking is much faster than the convergence rate of performance estimation to reduce the simulation time in policy evaluation step. Based on the convergence sufficient conditions proposed by Cooper and Henderson, we find a set of numbers which fits the sufficient condition. We combine the approach proposed by C.H. Chen to find a lower bound of satisfactory confidence level and provide a sufficient condition about how to set up the lowest satisfactory confidence level at every iteration to make sure that OOBPI will converge to optimal policy with probability 1. Comparing with the replacement model and the model implemented in SBPI, we find OOBPI could efficiently find a good enough policy with the same reasonable and good enough optimal policy accuracy. When the model size increases, OOBPI can also find an optimal policy in a shorter CPU time..

並列關鍵字

Monte-Carlo Simulation ； Ordinal Optimization ； Markov Decision Process

參考文獻

[1] M. L. Puterman, Markov Decision Process, John Willy & Sons, Inc., NY, 1994.

[3] B.W. Hsieh, C.-H. Chen, and S.-C. Chang, "Scheduling semiconductor wafer fabrication by using ordinal optimization-based simulation," IEEE Transactions on Robotics and Automation, vol. 17, no. 5, pp. 599-608, Oct. 2001.

[4] V. Fabian, Stochastic Approximation, Optimization Methods in Statistics, Academic Press, New York, 1971.

[5] H. J. Kushner, and D. S. Clark, Stochastic Approximation for Constrained and Unconstrained Systems, Springer-Verlag, 1978.

[6] L. Dai, "Convergence Properties of Ordinal Comparison in the Simulation of Discrete-event Dynamic Systems," Journal of Optimization Theory and Applications, Vol. 91, No.2, pp. 363-388, November 1996.

國際替代計量

以排序佳化與模擬為基礎的策略疊代演算法之研發

全文下載

主題瀏覽