模擬為基之深度強化學習於零工式生產排程

本研究主要探討深度強化學習於零工式生產排程問題，考量系統中工單動態來到事件引發的動態情境下，採用Deep Q Network深度強化學習方法，以最小化一段時間內的平均工單流程時間(Mean Flow Time) 為目標，探討動態排程與集中式派工問題。本研究以模擬為基礎之建構強化學習『環境』。利用離散事件模擬結合強化學習，以下次事件模擬法進行時間推進，取代生產系統中未知的狀態轉移機率。本研究參考Open AI Gym框架，引入狀態轉移之介面函數於模擬模式中，作為能與學習代理人(Agent)互動的環境，使代理人能觀察環境之即時狀態做動作決策，環境負責執行動作與狀態轉移，並對此給予獎勵，透過兩者反覆互動之系統轉移過程收集訓練樣本，進行強化學習代理人的訓練。本研究提出以模擬為基之深度強化學習方法，並於應用零工式生產系統中。首先定義其狀態、動作、獎勵函數等強化學習要素。接著建構基於自注意力機制的Deep Q Network，透過保留工單動態來到之時序性資訊的狀態編碼，使代理人從過去經驗學習狀態與派工法則的關聯及探索更多可能之最佳決策。實驗結果驗證了透過適當的獎勵值以及類神經網路模型的設計，深度強化學習方法可學習到如何根據系統狀態即時調整派工，也可以隨環境中動態情境的不同，做決策調整，在工單動態來到之情境下，學習到超越最佳單一派工法則的績效表現，發揮了強化學習於動態決策問題之優勢。

關鍵字

深度強化學習；動態排程問題；離散事件模擬

並列摘要

This research applied deep reinforcement learning in job shop scheduling problem. Considering the scenario of the system with dynamic job arrival, Deep Q Network learning algorithm is used to minimize the mean flow time within a period of time for dynamic scheduling and centralized dispatching problem. The "environment" of reinforcement learning is contrasted based on simulation. By using discrete event simulation, the next event advancement method is used for the state transition without the unknown transition probability in the production system. This study introduces the OpenAI Gym compatible interfaces into the simulation model to make "environment" have an interactive relation to RL agent. When time advance to the decisions point, the agent is able to observe the state from environment and decide the action, and the environment must transit to the next state and feedback the reward after executing the action. By collecting such state transition process as training samples, RL agents are trained with the experiences. This research proposes a simulation-based deep reinforcement learning approach for the job shop scheduling problem. First, the research define the RL elements such as its state, action, and reward, and then a Deep Q Network with self-attention module is constructed. By encoding the state with the temporal information about the dynamic arrival of jobs, the agent must be able to learn the relationship between the state and the dispatching rule from the past experience and explore better decisions. The experimental results verify that with the appropriate designs of reward function and neural network architecture, the performance of the deep reinforcement learning dynamic scheduling method can be better than traditional dispatching methods.