透過您的圖書館登入
IP:18.222.148.124
  • 學位論文

以增強式學習法設計機台派工法則之研究

Research on Design of Machine Dispatching Policy Using Reinforcement Learning

指導教授 : 張時中

摘要


在半導體晶圓的製造線具有多樣化的產品、回流特性的複雜製程、不確定性的機台、客戶導向、高投資成本、生產週期短等特性。如何將不同種類的產品有效地派工(Dispatching)給機台,來富有彈性地製造不同產品,達到準時交貨仍然是一個具有相當挑戰性的研究課題。在同一部機台處理不同製程的晶圓時需要製程轉換,而製程的轉換需要設置時間。一方面,為準時而彈性地製造不同產品,需要適時地調整設置次數;另一方面,減少設置次數可以降低在製品,及減少等候時間。而一種產品製程會不只一次經過同一類機台的回流現象也造成同一產品不同製程競爭機台的情況,因此產能必須恰當配置以達成準時交貨與平衡產流的順暢。本論文研究有設置時間之單一機台派工問題,及可調速率機台之速率切換問題。前者著重在產品平均等候時間與設置次數的取捨,而後者的目標是在等候與高低速的製造成本間作最佳取捨。因此要如何決定下一個加工產品的種類,及切換速度的時機,是本論文所研究的挑戰。 由於現場環境隨時間變化,必須持續調整派工法則。本論文嘗試以增強式學習(Reinforcement Learning)來處理這兩類問題。它將可以隨著環境不同而持續學習,藉由定義合理的回饋(Reward),估計價值函數(Value Function)來找到最適合的派工決策。針對派工是依據現場狀況做決定的特性,我們假設狀態之間具有馬可夫性質(Markov Property),將派工問題近似成連續時間的馬可夫決策過程(Continuous-time Markov Decision Process)。 在有設置時間的單一機台派工方面,我們對工作到達率為平穩(Stationary)的問題,用法則疊代法(policy iteration)解出最佳解。但法則疊代法無法解決非平穩(Non-stationary)及機率特性不詳的問題。因此我們嘗試應用Sarsa[RsA98] 的增強式學習演算法,它是一種每個動作都依循著某一特定法則的法則性(On-policy)的學習,具有簡單易計算的特性。它不需要機率特性即可解馬可夫決策過程的問題。在平穩馬可夫決策問題實驗中顯示出隨著學習次數增加,學出來的決策與最佳解有95%的正確率。我們進而將增強式學習應用在具有非平穩環境的有設置時間的單一機台派工問題上,並與一個用一樣機率選取不同種類產品的隨機法則來做討論與比較。結果顯示,在非平穩的工作到達率下,增強式學習使平均等待時間穩定在某一定值附近,而隨機法則的平均等待時間則是逐漸增加。增強式學習也比隨機法則的產出增加了30% ,設置的次數也比隨機法則少釵h。研究結果證實增強式學習可以解決法則疊代無法處理的派工問題,即具有非平穩環境的馬可夫決策過程。但是增強式學習的學習速度不夠有效率,且也無法證實非平穩環境的最佳解為何。而若從給予一個由Kumar與Seidman, 1991, 所提出的清除法則開始學習,相對於沒有學習能力的清除法則,增強式學習對於平均等候時間隨學習次數減少。 另外對於具有可調速率機台之速率切換問題,我們考慮生產花費成本與等候權重成本間的取捨,試圖找出最佳切換速率的時機。一如前述之機台派工問題,我們將可調速率機台之速率切換問題表示成馬可夫決策過程,並且使用增強式學習嘗試學出最佳切換時機。結果發現,需要學上千萬次才可以得到最佳解,不夠有效率。另外用法則疊代來佐證最佳切換速度的時機,並探討了各個參數與最佳切換時機的關係:包括工作到達率與高速加工所需成本。實驗顯示出,越高的工作到達率,及越低的高速加工成本,會使得轉換時機發生在等候工作較少時。最後,我們發現在給予最佳法則下,增強式學習對於學習環境改變後的最佳法則,其學習速度可以較從無開始學習到最佳法則快一千倍以上。而對於實際生產線的派工學習,則有待進一步的評估。

關鍵字

派工 設置 產能配置 增強式學習

並列摘要


Semiconductor fabrication is characterized by a variety of products, complex re-entrant flow, machine uncertainty, customer orientation, high investment, and short product life cycle. Effective methods for different lots dispatching that lead to achieve production flexibility and on time delivery still pose significant challenges to both researchers and practitioners. It requires the setup time when changing the processing type in the same machine. In the one hand, it needs appropriate setup for producing different lots flexibly and timely. On the other hand, it can reduce the workload level and waiting time by decreasing setup times. However, the reentrant flow causes the competition among different type lots to the same machine. Hence machine capacity must to be allocated effectively to reach on time delivery and balancing the production flow. We studied the single machine with setup time dispatching problem and adjustable service rate machine problem. The objective of the former is tradeoff average waiting time and setup times, and the latter is tradeoff waiting and service cost. How to choose the next product type and timing to switch the service rate are the challenges we meet. The dispatching policy must be adjusted continuously because the environment changes over time. We tried to solve the problem using Reinforcement Learning (RL). It can interact with environment and find the suitable policy by Reward function and Value function. We assumed that the states have Markov property and formulate dispatching problem as Continuous-time Markov Decision Process (MDP). In the single machine with setup time dispatching problem, we used Policy Iteration (PI) to find the optimal policy on the Stationary job arrival environment. But PI cannot solve Non-stationary problems or unknown system dynamics problems. We referred to the RL Sarsa algorithm [RsA98] to apply to our dispatching problem. It is an on-policy learning that learns the value of the policy that is used to make decisions. And it is conceptually and computationally simple to solve MDP without system dynamics. In the stationary case, RL learned 95% correctness of optimal policy with enough learning step. Furthermore, we applied RL to Non-stationary dispatching environment and compared with Random Policy. The Results showed that RL stabilized the average weighted waiting time but Random Policy did not. RL increased the 30% throughput and decreased switched numbers than Random. This research showed that RL can deal with the dispatching problem that PI cannot. But the learning speed is not effective. We also don’t know the optimal policy in the Non-stationary environment. However, starting with given a Clearing Policy that proposed by Kumar and Seidman, 1991, RL makes less average waiting time than Clearing Policy. In the adjustable service rate machine problem, we considered the tradeoff of the service and waiting cost, and tried to find the timing to switch the service rate. We also formulated this problem as MDP and applied RL to solve. The Results show that RL need to learn 10 million steps for finding optimal switched point. We studied the relationship between parameters, including arrival rate and high service rate cost, and timing to switch. We found that the higher arrival rate and lower high service cost let the switch at fewer waiting jobs. Finally, we found that RL, which had prior knowledge, learned 1000 times faster than one’s had not. Besides, it still requires evaluating for real dispatching learning.

並列關鍵字

Reinforcemen Setup Dispatching Machine allocation

參考文獻


[Bel57] R. E. Bellman “Dynamic Programming” , Princeton University press, 1957
[Ber95] Dimitri P. Bertsekas, “Dynamic Programming and Optimal Control”, Athena Scientific ,Belmont, Massachusetts,1995
[Cha02] Chang Y. Z. ”A Learning Agent for Supervisors of Semiconductor Tool Dispatching” NTUEE Master Thesis 2002
[Chi94] C. Chiu , ”A Learning-Based Methodology for Dynamic Scheduling in Distributed Manufacturing Systems ”, Ph.D. Dissertation, Purdue University,1994
[CWH97] D. W. Collins, K. Williams, and F. C. Hoppensteadt , "Implementation ofMinimum Inventory Variability Scheduling 1-Step Ahead Policy in a Large Semiconductor Manufacturing Facility, " Proceedings of the 1997 6th International Conference on Emerging Technologies and FactoryAutomation, 1997.

被引用紀錄


楊鈞傑(2012)。使用遞迴式增強學習法建立股價指數期貨交易策略〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2012.10025
李文仙(2011)。彈性流線型工廠排程派工法則選擇之研究-以TFT-LCD偏光板製程為例〔碩士論文,元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-2801201414592945

延伸閱讀