Training an Unbeatable Tic-Tac-Toe AI using Reinforcement Learning

Introduction

The Goal

In this article, we will train a neural network to play the classic game of tic-tac-toe. The network will be trained using a methodology known as reinforcement learning, one of the main machine learning paradigms. Our code will be written in Python and leverage the Ray RlLib package for its rich feature support and industry-grade performance. By the end, we will have created an AI model that never loses and runs in real-time on any platform supporting the Open Neural Network Exchange (ONNX) file format.

This article assumes you are comfortable with Python and have a decent handle on machine learning concepts. There are plenty of great resources to learn all about machine learning, neural networks, gradient descent, etc. If you find that I am referencing something you don't understand, a quick Google search should help you out. If you don't have python up and running, a Google search can also help you with that. Optionally, you can check out my article to get started.

Our Approach

To create our unbeatable tic-tac-toe AI, we will need to:

  • Design an environment in which two agents can play tic-tac-toe by taking actions and receiving reward
  • Write scaffolding code to manage training, hyperparameter optimization, and checkpoint handling
  • Identify the best model and export to a transferable format (ONNX)
  • Evaluate model performance to ensure it does not lose

Our environment will be built using PettingZoo, a library capable of representing generalized multi-agent RL problems. The model itself is a neural network implemented in PyTorch. However, we won't be directly calling PyTorch APIs to train the model. Instead, we will use the Ray RlLib library to train the model for us, converting network outputs into actions and using subsequent rewards to update the weights.

We will also use other parts of the Ray ecosystem to monitor training and allocate resources (compute time) to the most promising combination of hyperparameters. Once we have a desirable model, we will export the PyTorch neural network to ONNX format. Finally, we can test the model to confirm that it never loses.

Is this Overkill?

Readers familiar with the Minimax algorithm may question the need for machine learning in this situation. By implementing the minimax algorithm we can easily build an unbeatable tic-tac-toe program. Make two of these programs face off against each other, and every move will be the best possible move — resulting in a draw. This solution requires no machine learning, is lightweight, and examples can be found all over the internet.

Why then should we attempt to solve the problem with reinforcement learning? Our solution will be (marginally) slower than minimax, and will take a considerably longer time to develop. The answer is two-fold:

  1. We want to learn about reinforcement learning and how to apply it in Python using the Ray RLLib library. Choosing a simple application such as tic-tac-toe makes it easier to stay focused on what we want to learn.
  2. Implementing the minimax algorithm may not be feasible when the complexity of the game grows.

When evaluating which move to make next, the minimax algorithm iterates through all possible moves each player could make, alternating turns until it encounters game-ending conditions. By doing this, it generates a map of every outcome that could possibly arise from the next move and uses this map to determine which move results in the highest number of successful outcomes.

In a relatively simple game like tic-tac-toe, it is quick to iterate through all possible outcomes given any move. Any game will have a maximum of 9 moves before a tie is reached. If symmetry is eliminated, there are 765 unique board positions at the beginning of the game before X makes its first move. This is easily handled by a recursive function, and there are many examples on the internet of how to implement this.

But what about a game like chess, where a player has many more options on their turn? American mathematician Claude Shannon estimated that there are 1012010^{120} possible chess games — more than the number of atoms in the universe! Obviously it is not feasible to compute every game-end state in this situation. Modifications to the naive minimax implementation avoid a fully exhaustive search, such as limiting how far you look into the future before deciding which move to make.

While we could explore these modified versions of minimax, reinforcement learning could also provide a viable solution. The simple game of tic-tac-toe serves as an opportunity to get our feet wet, familiarizing ourselves with RL in preparation for more complex applications.

Reinforcement Learning Crash Course

Before we dive in, it would be helpful to understand the gist of reinforcement learning. So what is it? Reinforcement learning teaches an agent how to make decisions that maximize the reward signal it receives from the environment. Let's break that down:

  1. One or more agents live in an environment
  2. Agents observe things in their environment and use this information to make decisions based on a policy
  3. The environment rewards agents based on their actions (reward can be negative)
  4. Agents update their policy based on the reward they receive
  5. This cycle of observations, actions, rewards, and updates continues until we reach a stopping condition

The policy of an agent determines how it makes decisions, and is ultimately what we are trying to train. This is the part that is represented by a neural network. As with any good challenge, there are a few difficulties that we must overcome when training an agent with RL:

  1. The consequences of an action may not be immediately received. The agent must be able to understand how actions taken now impact the possibility of receiving rewards in the future.
  2. Consider an agent whose policy is good, but could be better. The agent must decide if it will stick with the strategy it knows to produce acceptable reward, or explore new modifications to its strategy at the risk of missing out on reward it could have received.

Thankfully, machine learning researchers and engineers have developed algorithms and strategies that address these issues. Some popular RL algorithms include Proximal Policy Optimization (PPO), Deep Q Networks (DQN), Soft Actor Critic (SAC), and IMPALA. For more details on RL and commonly used algorithms, take a glance at this paper.