利用有利條件訓練神經網路-以六子棋為例

DeepMind 的圍棋程式 AlphaGo 打敗韓國職業九段棋士李世乭取得了巨大的成就，隨後更提出通用演算法 AlphaZero，不僅在圍棋項目上勝過 AlphaGo，亦展示了該演算法在西洋棋及日本將棋也能訓練成功。然而，除了有效的演算法，DeepMind 也動用了龐大的運算資源，才於複雜且變化多端的圍棋項目上訓練出極致的棋力。本研究研發的棋類為國立交通大學資訊工程系吳毅成教授發明之六子棋 (Connect6)。六子棋除了改善了五子棋先手有優勢的缺陷，還擁有更的複雜度，且相對於圍棋，六子棋擁有一些優勢策略。本實驗使用 AlphaZero 演算法，搭配一些已知的優勢策略應用於神經網路的訓練，藉以降低蒙地卡羅樹搜索空間及避免無效盤面，期望能因此取得較有效的特徵，在有限的硬體資源下，能增進訓練神經網路的效率。在六子棋的項目中，我們採用兩種方法來嘗試改進神經網路的訓練，第一個方法為限縮落子範圍，這個概念源自於六子棋已經被證實遠離戰場的一方必定會落敗，因此我們於訓練時將落子範圍限縮在九宮格內，藉以讓神經網路訓練時能較快學到貼身肉搏的行為。第二個方法為套用 domain knowledge，依照六子棋的遊戲特性，設計了必勝落子以及防禦落子，並修改蒙地卡羅樹搜尋的擴展行為，使其在判斷為必勝落子或防禦落子時能優先擴展該節點，除了減少不必要的搜索外，也能藉此讓神經網路學習到攻擊及防禦的行為。本研究採用的兩種方法皆獲得成功的結果。以限縮落子範圍以及套用domain knowledge的實驗數據來說，皆以較少的訓練時間及訓練量得到超越原始版本的神經網路模型，其中套用 domain knowledge所訓練神經網路模型在棋力上表現不俗，獲得了相當高的勝率。因此，我們可以推論套用有利條件訓練神經網路是相當可行的。

關鍵字

電腦對局；六子棋；蒙地卡羅法；神經網路；深度學習

並列摘要

DeepMind developed AlphaGo, a computer Go program, which beat South Korean professional Go player Lee Sedol. Soon afterwards, DeepMind introduced a more general algorithm, AlphaZero. It was not only better than AlphaGo in Go game, but also got great success in Chess and Shogi. However, besides an effective algorithm, DeepMind used huge computing resources for mastering the game of Go. This research will focus on developing program for Connect6, which was proposed by Professor I-Chen Wu. Connect6 is similar to Gomoku. But it’s more complex than Gomoku and eliminates the advantage of first player. In contrast to Go, there’re some advantageous strategies for Conncet6. The experiments try to apply some advantageous strategies in AlphaZero algorithm for training neural networks. These methods will reduce Monte Carlo Tree Search space and avoid invalid boards. We expect that the approach can let the neural networks get better features and improve its training efficiency with limited hardware. In Connect6, we try to add two methods for training neural networks. First, because Connect6 was proved that playing at breakaway moves will lose the game, thus this research limits the valid moves inside the 3×3 grid areas of the existing stones and trains a new model that will get the breakaway prevention feature. Second, this research applies domain knowledge. As we know, Connect6 has some advantageous moves. This research focuses on threat moves and defensive moves, and preferably expands these moves in Monte Carlo Tree Search. We reduce the search space and get the attack and defense features in the neural networks. According to the experiments, we find out that limiting the area of valid moves and applying domain knowledge are feasible. These methods show higher win rate with less training time and tree search time. Based on the results, we infer that the new methods are better than the original ones.