Raiju is a reinforcement learning experiment for Halite, an AI programming competition where bots face off in a game with "a branching factor billions of times higher than that of Go".

https://github.com/k15z/raiju

Introduction

Halite is a game played on variable size boards with a variable number of players who fight for control over the board. Each square on the board has a certain random production factor and when a player moves a piece onto a square, it gains that amount in strength for every move where it stays still. In addition, the board wraps around so pieces can move up from the top of the board and appear at the bottom.

The key takeaway here is that with variable size boards, we can't simply feed the entire game into a neural network and expect good results. Instead, after each timestep, we look at the game from the perspective of each of the individual pieces we control and choose an action for each piece.

The pros of this approach are that it works for all size boards and it naturally handles the wrap-around effect - we just take an N by N box around our current piece where N can be any number. If N > M where M is the size of the board, we can simply "extend" the board by placing copies to the right, bottom, and bottom-right. Another pro is that this approach is, in essence, a neural network controlled cellular automata which are known to be able to model a diverse range of situations.

The cons of this approach are that we implicitly assume that pieces should move independently of one another, conditioned on the current state of this board. Without more data, we can't determine what kind of effect this will have on our player's effectiveness but this model is sufficient for now - we will relax this constraint later by allowing pieces to communicate.

dueling DQN + SARSA

We start by setting up a simple dueling deep Q-network (DQN). Our state matrix has shape (36, 36, 3) and contains the production, strength, and player_id fo the 36*36 square centered on the current cell. We feed this state matrix into two convolutional layers - (4, 4, 8) and (8, 8, 16) respectively with ReLU activations before applying 2x2 max pooling to produce a hidden representation with shape (18, 18, 16). We flatten this hidden representation and use it as the input for two separate fully-connected layers.

The first fully-connected layer produces a single output value - the value/reward function. The second fully-connected layer produces five output values - the advantage/action function for each of the five actions, STAY, LEFT, RIGHT, UP, DOWN. We add the value/reward to each of the five outputs to produce the final estimate of Q.

Note: An excellent explanation of the architecture of a dueling deep Q-network as well as the intuition behind it can be found on the Torch blog at http://torch.ch/blog/2016/04/30/dueling_dqn.html.

Now, we perform offline reinforcement learning by running a dozen matches on randomly generated boards. We extract the (state, action, reward, state', action') tuples from the game logs where the reward is defined as the sum of the strength of all our pieces minus the strength of all enemy pieces. We use the standard SARSA update rule and train our network for a few epochs before running more matches and repeating.

Note: Although the winner of the game is determined by territory (the total number of pieces) except in the case of ties, we achieved slightly better results by using the total strength which, although strongly correlated, may not lead to the same strategies.

ZBot3 (Raiju) vs ZBot3 (Raiju)

After just two simulate-train swaps, our simple DQN has learned to easily crush the random player provided in the Python starter kit and creates some interesting visualizations when pitted against itself.

Modifications

Next, we make some minor modifications to help enhance our model. We start by concatenating an additional variable to the hidden representation in our neural network which represents the maximum time remaining in the game, allowing our network to create different behaviors for the opening, middlegame, and endgame. We also carefully increase our discount factor from its original value of 0.5 to 0.9 to emphasize long-term rewards, keeping a close eye on the results to make sure our model doesn't diverge.

Communication

After retraining our model, we proceed to try and weaken our earlier independence assumption by adding inter-piece communication. The simplest way to allow pieces to communicate is to append the previous move that was undertaken at each square to our state matrix, creating a new matrix of size (36, 36, 4). This means that during each timestep, each piece can see what all of the nearby pieces chose to do during the previous timestep.

Think of this like honey bees "dancing" to communicate - it slows them down a little and makes them less efficient at collecting nectar individually but makes them more successful as a group. Although this severely constrains the amount of information that each piece can communicate, we choose to implement this rather than create a dedicated message-passing channel because it is finals week and I have more important things to do than rewrite the built-in Halite game logger.

References