基於生成對抗模仿學習求解零工式排程

過去幾年來，零工式排程在生產排程中一向都是最受矚目的問題，該問題需要考量到製程的執行順序並在特定的機台上執行，這些限制使得零工式排程在解題上的難度相較其他問題大幅增加。最早，零工式排程問題多以數學規劃法進行求解，但此方法的解題時間會隨著問題需求增加而跟著指數成長，因此在實務應用上較為困難，取而代之的改以啟發式演算法在允許時間範圍之內求得近似最佳解。隨著電腦科學的發展，DeepMind團隊以強化學習訓練的AlphaGo在圍棋賽局中擊敗職業選手，今年更以AlphaStar在高維度動作的遊戲中贏得比賽，這些強化學習在現實世界中的成就給予了我們解題上不同的選擇。強化學習在線下需要做長時間的訓練，但在線上時它可以因應環境狀態並根據過往經驗快速做出決策，其訓練上最大的困難是要在複雜的環境中定義互動時的獎勵。本篇論文會以生成對抗模仿學習對零工式排程問題進行求解，該模型由強化學習所延伸出來，並結合了生成對抗網路與模仿學習的優勢使得我們不必如同以往的強化學習方式對零工式排程中選擇工作的獎勵進行定義,而是以生成對抗網路中的判別器直接給予獎勵。本篇目的在於驗證生成對抗模仿學習模型在零工式排程問題上的解題表現並與其他模型做比較

關鍵字

零工式排程；模仿學習；強化學習；生成對抗網路

並列摘要

Job Shop scheduling Problems (JSSPs) have been one of the most remarkable issues in production scheduling over decades. The restriction of given order in operations and the operations must be processed on a specific machine make JSSPs difficult to be solved. JSSPs could be formulated as optimization problems with constraints, explaining why many previous researches have proposed to use mathematical programming to obtain optimal solutions. However, it is easy to become an intractable problem when dealing with large-scale problem by using for mathematical programming, so it is difficult to apply mathematical programming to most real-world problems. On the other hand, meta-heuristic algorithms make it possible to search for sub-optimal solutions or even optimal solutions in some problems within acceptable time limits. The advance of deep learning has made promising results in many domains. For example, DeepMind developed the AlphaGo with deep reinforcement learning (DRL), and won professional players in the Go. Furthermore, AlphaStar won the eSports competition in high-dimensional action games this year. These achievements in DRL inspire us to use DRL to deal with JSSPs. Although the training of DRL requires a lot of time, it can make quick decisions in different environment states based on the experience when applying to the problems. When applying DRL to JSSPs, one of the challenges is to define rewards in such complicated environments unless the model could explore the solution space very well. In this thesis, we will solve JSSPs based on Generative Adversarial Imitation Learning (GAIL), which is extended from RL. It combines the characteristics of Generating Adversarial Networks and Imitation Learning. The rewards in GAIL is returned by the Discriminator directly. The purpose of this thesis is to verify the performance of GAIL on JSSPs and compare it with other models.

並列關鍵字

Job Shop Scheduling ； Imitation Learning ； Reinforcement Learning ； Generative Adversarial Nets

參考文獻

[1]Michael Pinedo.Scheduling. Springer, 2012.[2]Panneer DD Dominic, Sathya Kaliyamoorthy, and M Saravana Kumar. “Efficientdispatching rules for dynamic job shop scheduling”. In:The International Journalof Advanced Manufacturing Technology24.1-2 (2004), pp. 70–75.[3]Henry Laurence Gantt.Work, wages, and profits. Engineering Magazine Co., 1913.[4]J.E.Beasley.OR-Library.url:http://people.brunel.ac.uk/~mastjjb/jeb/info.html.[5]George B Dantzig. “Maximization of a linear function of variables subject to linearinequalities”. In:Activity analysis of production and allocation13 (1951), pp. 339–347.[6]Sanjoy Dasgupta, Christos H Papadimitriou, and Umesh Virkumar Vazirani.Algo-rithms. McGraw-Hill Higher Education, 2008.[7]Ailsa H Land and Alison G Doig. “An automatic method for solving discrete pro-gramming problems”. In:50 Years of Integer Programming 1958-2008. Springer,2010, pp. 105–132.[8]Melanie Mitchell, James P Crutchfield, Rajarshi Das, et al. “Evolving cellular au-tomata with genetic algorithms: A review of recent work”. In:Proceedings of theFirst International Conference on Evolutionary Computation and Its Applications(EvCA’96). Vol. 8. Moscow. 1996.[9]Scott Kirkpatrick and CD Gelatt Jr. “and MP Vecchi”. In:Optimization by simu-lated annealing. Science220.4598 (1983), pp. 671–680.[10]Fred Glover. “Tabu search—part I”. In:ORSA Journal on computing1.3 (1989),pp. 190–206.33

Google Scholar

[11]Shrikant S Panwalkar and Wafik Iskander. “A survey of scheduling rules”. In:Op-erations research25.1 (1977), pp. 45–61.[12]Linus Schrage. “Letter to the editor—a proof of the optimality of the shortestremaining processing time discipline”. In:Operations Research16.3 (1968), pp. 687–690.[13]Peng Si Ow and Thomas E Morton. “The single machine early/tardy problem”. In:Management science35.2 (1989), pp. 177–191.[14]Kenneth R Baker and John J Kanet. “Job shop scheduling with modified due dates”.In:Journal of Operations Management4.1 (1983), pp. 11–22.[15]John H Blackstone, Don T Phillips, and Gary L Hogg. “A state-of-the-art surveyof dispatching rules for manufacturing job shop operations”. In:The InternationalJournal of Production Research20.1 (1982), pp. 27–45.[16]Ronald A Howard. “Dynamic programming and markov processes.” In: (1960).[17]Kenji Doya et al. “Multiple model-based reinforcement learning”. In:Neural com-putation14.6 (2002), pp. 1347–1369.[18]Jan Gläscher et al. “States versus rewards: dissociable neural prediction error signalsunderlying model-based and model-free reinforcement learning”. In:Neuron66.4(2010), pp. 585–595.[19]Christopher JCH Watkins and Peter Dayan. “Q-learning”. In:Machine learning8.3-4 (1992), pp. 279–292.[20]Richard S Sutton et al. “Policy gradient methods for reinforcement learning withfunction approximation”. In:Advances in neural information processing systems.2000, pp. 1057–1063.[21]John Schulman et al. “Trust region policy optimization”. In:International Confer-ence on Machine Learning. 2015, pp. 1889–1897.[22]John R Hershey and Peder A Olsen. “Approximating the Kullback Leibler diver-gence between Gaussian mixture models”. In:2007 IEEE International Conferenceon Acoustics, Speech and Signal Processing-ICASSP’07. Vol. 4. IEEE. 2007, pp. IV–317.34

Google Scholar

[23]Sergey Levine.Supervised Learning of Behaviors.url:http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_2_behavior_cloning.pdf.[24]Sergey Levine.Advanced Imitation Learning Challenges and Open Problems.url:http://rail.eecs.berkeley.edu/deeprlcourse-fa17/f17docs/lecture_17_challenges.pdf.[25]Ian Goodfellow et al. “Generative adversarial nets”. In:Advances in neural infor-mation processing systems. 2014, pp. 2672–2680.[26]Lantao Yu et al. “Seqgan: Sequence generative adversarial nets with policy gradi-ent”. In:Thirty-First AAAI Conference on Artificial Intelligence. 2017.[27]Jonathan Ho and Stefano Ermon. “Generative adversarial imitation learning”. In:Advances in Neural Information Processing Systems. 2016, pp. 4565–4573.[28]Vijay R Konda and John N Tsitsiklis. “Actor-critic algorithms”. In:Advances inneural information processing systems. 2000, pp. 1008–1014.[29]John F Muth, Gerald Luther Thompson, and Peter R Winters.Industrial scheduling.Prentice-Hall, 1963.[30]S Lawrence. “Resouce constrained project scheduling: An experimental investigationof heuristic scheduling techniques (Supplement)”. In:Graduate School of IndustrialAdministration, Carnegie-Mellon University(1984)

Google Scholar

國際替代計量

基於生成對抗模仿學習求解零工式排程

查找全文

主題瀏覽