Training an Unbeatable Tic-Tac-Toe AI using Reinforcement Learning

Validation and Exporting

Why Test our Model?

As with other areas of machine learning, it is important to have a metric of performance that is not derived from the training data. In supervised and unsupervised learning, we would set aside a small portion of our dataset before training begins. This way we can evaluate how well our model generalized to new situations. If the model has poor performance on the test-set, it may have over-fit.

In reinforcement learning, we don't have an explicit dataset like we do in other ML paradigms. Our training data comes from simulating an environment and is inherently built into the training process. However, we can still devise methods to evaluate how our model performs without relying entirely on training statistics.

If we ever plan on using this model for practical purposes, we will need to extract it from the Ray ecosystem. Users are not likely to install and configure the software stack we used to train the model (Python, CUDA, etc.) Thankfully, the PyTorch provides an API for exporting to the Open Neural Network Exchange (ONNX) file format. ONNX is supports a wide variety of execution environments (desktop, web-browser, etc.), making it ideal for our use case.

Once we have exported our model, we can test it against a variety of tic-tac-toe games to see how it performs. Keep track of the results, and we have the statistics we want.

Exporting to ONNX

First thing first, we need to export our model. As it turns out, the agent we trained with RlLib inherits from the standard PyTorch class nn.Module. This allows us to leverage PyTorch's API to directly export to ONNX format.

But there's a catch. RlLib's new API is still a little hazy on some of the details. While Ray's RLModule inherits from PyTorch, for some reason it does not support torch.onnx.export. After much experimentation, I figured out that we need to:

  • Inspect the weights file (model_state.pt) created by Ray when it saved the checkpoints during training.
  • Write our own PyTorch class that mimics the architecture we infer from the weights file.
  • Load the weights into an instance of our custom class.
  • Export to ONNX

To begin, we first need to identify where the weights file was saved. In my case, the file was located at:

results/My-Experiment/d0b80b71/checkpoint_000011/learner_group/learner/rl_module/pX/module_state.pt

The trial name (d0b80b71) and checkpoint number will likely be different if you ran training on your own machine. As you may have guessed, this is the weights file for agent X. For agent O, simply change pX to pO. Once you have located the state file, restore the RLModule and inspect it's state dict:

1 from ray.rllib.core.rl_module import RLModule
2
3
4 def show_model_interior(state_folder: str | PathLike) -> None:
5 module = RLModule.from_checkpoint(state_folder)
6 state_dict = module.get_state(inference_only=True)
7 print('Model State Dict:')
8 for name in state_dict:
9 data: np.ndarray = state_dict[name]
10 print(f'{name}\t{data.shape}\t{type(data)}')
11 print(module.forward)

You should see something like this:

1 Model State Dict:
2 encoder.actor_encoder.net.mlp.0.weight (256, 9) <class 'numpy.ndarray'>
3 encoder.actor_encoder.net.mlp.0.bias (256,) <class 'numpy.ndarray'>
4 encoder.actor_encoder.net.mlp.2.weight (256, 256) <class 'numpy.ndarray'>
5 encoder.actor_encoder.net.mlp.2.bias (256,) <class 'numpy.ndarray'>
6 pi.net.mlp.0.weight (10, 256) <class 'numpy.ndarray'>
7 pi.net.mlp.0.bias (10,) <class 'numpy.ndarray'>
8 (<class 'rl_module.ActionMaskingTorchRLModule'>, <class 'rl_module.ActionMaskingRLModule'>, <class 'ray.rllib.algorithms.ppo.torch.ppo_torch_rl_module.PPOTorchRLModule'>, <class 'ray.rllib.core.rl_module.torch.torch_rl_module.TorchRLModule'>, <class 'torch.nn.modules.module.Module'>, <class 'ray.rllib.algorithms.ppo.ppo_rl_module.PPORLModule'>, <class 'ray.rllib.core.rl_module.rl_module.RLModule'>, <class 'ray.rllib.utils.checkpoints.Checkpointable'>, <class 'abc.ABC'>, <class 'object'>)
9 <bound method TorchRLModule.forward of ActionMaskingTorchRLModule(
10 (encoder): TorchActorCriticEncoder(
11 (actor_encoder): TorchMLPEncoder(
12 (net): TorchMLP(
13 (mlp): Sequential(
14 (0): Linear(in_features=9, out_features=256, bias=True)
15 (1): ReLU()
16 (2): Linear(in_features=256, out_features=256, bias=True)
17 (3): ReLU()
18 )
19 )
20 )
21 (critic_encoder): TorchMLPEncoder(
22 (net): TorchMLP(
23 (mlp): Sequential(
24 (0): Linear(in_features=9, out_features=256, bias=True)
25 (1): ReLU()
26 (2): Linear(in_features=256, out_features=256, bias=True)
27 (3): ReLU()
28 )
29 )
30 )
31 )
32 (pi): TorchMLPHead(
33 (net): TorchMLP(
34 (mlp): Sequential(
35 (0): Linear(in_features=256, out_features=10, bias=True)
36 )
37 )
38 )
39 (vf): TorchMLPHead(
40 (net): TorchMLP(
41 (mlp): Sequential(
42 (0): Linear(in_features=256, out_features=1, bias=True)
43 )
44 )
45 )
46 )>

We are interested in the actor_encoder and pi, which are used for inference. The actor encoder contains two linear layers with ReLU activation. The first layer accepts 9 inputs, which correspond to the 9 squares of the tic-tac-toe game. The second layer has 256 outputs, which are feed into pi. Pi narrows this into 10 output features, which correspond to the 10 actions our agent had during training. We can create a PyTorch class that mimics this setup:

  • In the init method, we define our layers to mimic the names created by RlLib.
  • In the forward method, we define how these layers connect. Our forward method takes two inputs, the observations x and the action mask y. The code logit[y == 0] = FLOAT_MIN ensures that we do not select an unacceptable action.
  • At the very end, we apply the softmax function to receive a probability for each action. We then select the action with the highest probability.
1 class ModelExport(nn.Module):
2 def __init__(self):
3 super().__init__()
4 self.actor_encoder = nn.Sequential(
5 nn.Linear(in_features=9, out_features=256, bias=True),
6 nn.ReLU(),
7 nn.Linear(in_features=256, out_features=256, bias=True),
8 nn.ReLU(),
9 )
10 self.pi = nn.Sequential(
11 nn.Linear(in_features=256, out_features=10, bias=True),
12 )
13 self.softmax = nn.Softmax(dim=0)
14
15 def forward(self, x, y):
16 x = self.actor_encoder(x)
17 logit = self.pi(x)
18 logit[y == 0] = FLOAT_MIN
19 weights: torch.Tensor = self.softmax(logit)
20 idx = torch.argmax(weights)
21 return idx

Now we are ready to create the ONNX file using some helper functions. You can find the full code in the repository:

1 folder = construct_state_path(
2 checkpoint_dir='Random-First-Move/d0b80b71/checkpoint_000011',
3 policy_name='pX'
4 )
5 replacement_map = {
6 'encoder.actor_encoder.net.mlp': 'actor_encoder',
7 'pi.net.mlp': 'pi',
8 }
9 model = create_torch_model(ModelExport(), folder, replacement_map)
10 onnx_export(model, 'exports/model-X.onnx')

You should now have a file named model-X.onnx in your exports/ folder. Yay!

Our Recursive Test Algorithm

To test the model, I devised a recursive function that iterates through every possible game of tic-tac-toe that the model allows. For those unfamiliar with recursion, I will leave you with the classic computer science joke: "To understand recursion, you must first understand recursion" and a link to the wikipedia article.

We start with an empty board, alternating turns until the game is over. Sounds like a normal game, right? The difference is, when our algorithm takes a turn, it copies the game and tries every possible move. On the next turn, the AI agent is asked to respond to each of these games. The process continues until all games are terminated.

This has the potential to grow quickly. Say we are testing agent X. On our turn, we now have 8 different games — one for each empty square. In each of those 8 games, X makes a move, leaving us with 6 empty squares. We now have 8*6=48 games. If we assume the game does not end early, this gives us 8*6*4*2=384 games. If we are testing agent O, we will start with 9 games, giving us 9*7*5*3*1=945 possibilities.

However, we will likely not reach every possible sequence of moves. This is because our model has been trained to win. In many cases, the game will terminate early, so we will not experience all possible situations. This ensures we are not putting the model in a situation it would never allow. If our model is good, it will never allow the opponent to set up two ways to win — so it would be misleading to claim that our model loses in that situation, since it would never be in that situation in the first place.

Results

Using this code, I found that agent X is unbeatable, while agent O loses in three situations:

To determine what is happening, we can print the sequences that lead to agent O losing. The GIF provides an animation of the first sequence.

  • [4, 0, 1, 7, 5, 3, 2, 8, 6]
  • [4, 0, 5, 3, 1, 7, 2, 8, 6]
  • [4, 0, 5, 3, 7, 2, 1]
Animation of flaw 1

The model misses an opportunity to win, choosing to block us instead. This only occurs because X deliberately does not prevent O from having the opportunity to win. On turn 6, X claims square 2 instead of square 6. The model never encounters this behavior during training because it was trained against agent X, which would never make a mistake like this. The model does not realize it could win, choosing instead to block the opponent.

So our models are (almost) unbeatable. I will leave it as an exercise for the reader to try and eliminate these three flaws in agent O.

Demonstration

Since we exported the model to ONNX, it can be used in any programming language that supports the ONNX runtime. ONNX has a web runtime, so I created a demonstration where you can test your skills against the AI we created! You can find it here.

That's it for now! Thanks for following along. Be sure to look at the repository for the full source code.