Approximate Nash Equilibrium by Regret Matching

Two-player games can be solved by following a very intuitive algorithm called Regret Matching ​1​. Players modify their action probabilities according to the so-called regrets or advantages, which can be thought as consequences of alternative choices. For a good overview of the topic, see the friendly yet detailed introduction by Neller and Lanctot ​2​.

The following code snippet demonstrates an efficient implementation for two-player zero-sum finite games in PyTorch, and tests on the game called Rock-Paper-Scissors with Crash. In this variant, the scissors crashes against rock, and the loss is doubled. This shifts the equilibrium choices in favour of “paper”, at the exact proportion of 25%/50%/25%. Remarkably, regrets can be used to estimate the approximation accuracy, which approaches the perfect solution under mild conditions ​3​. This is also demonstrated in the snippet.

import torch

## Game: Rock-Paper-Scissors with Crash, the equilibrium achieved with 25%/50%/25%

N_ACTIONS = 3 # 0: rock, 1: paper, 2: scissors
PAYOFF = torch.tensor([[0,-1,2], [1,0,-1], [-2,1,0]], dtype=torch.float).cuda() # gain of row player / loss of col player

## Utils

def getStrategy(cfr):
  """Return a strategy corresponding to observed regrets
  Args:
    cfr (array): counter-factual regret, shape (N_ACTIONS,)
  Returns:
    weights (array): strategy, shape (N_ACTIONS,)
  """
  weights = torch.clip(cfr,0,torch.inf)
  weights = weights/torch.sum(weights)
  N_ACTIONS = weights.shape[0]
  weights = torch.nan_to_num(weights,nan=1/N_ACTIONS)
  return weights

#@torch.jit.script
@torch.compile(mode='max-autotune')
def getEquilibrium(PAYOFF, N_ITER:int = 500, stochastic_advantage:bool=False):
  # auxiliary variables
  N_ACTIONS = PAYOFF.shape[0]
  cumCFR1 = torch.zeros(N_ACTIONS,).cuda()
  cumCFR2 = torch.zeros(N_ACTIONS,).cuda()
  cumStrategy1 = torch.zeros(N_ACTIONS,).cuda()
  cumStrategy2 = torch.zeros(N_ACTIONS,).cuda()
  strategy1 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS
  strategy2 = torch.ones(N_ACTIONS,).cuda()/N_ACTIONS
  # training loop
  for _ in range(N_ITER):
    # sample actions and observe regrets
    if stochastic_advantage:
      # a) stochastic variant, often implemented in tutorials
      action1 = torch.multinomial(strategy1,num_samples=1).squeeze()
      action2 = torch.multinomial(strategy2,num_samples=1).squeeze()
      cfr1 = PAYOFF[:,action2]-PAYOFF[action1,action2]
      cfr2 = - (PAYOFF[action1,:]-PAYOFF[action1,action2])
    else:
      # b) averaged variant
      PAYOFF_avg = strategy1.view(1,-1).mm(PAYOFF).mm(strategy2.view(-1,1))
      cfr1 = (PAYOFF.mm(strategy2.view(-1,1))-PAYOFF_avg).squeeze()
      cfr2 = (strategy1.view(1,-1).mm(PAYOFF)-PAYOFF_avg).squeeze() * (-1)
    # update strategies proportionally to regrets
    strategy1 = getStrategy(cumCFR1)
    strategy2 = getStrategy(cumCFR2)
    # track cumulated regrets and strategies
    cumCFR1 += cfr1
    cumCFR2 += cfr2
    cumStrategy1 += strategy1
    cumStrategy2 += strategy2

  # averaged strategies converge to Nash Equilibrium
  avgStrategy1 = cumStrategy1/cumStrategy1.sum()
  avgStrategy2 = cumStrategy2/cumStrategy2.sum()
  # estimate approximation error (upper bound)
  eps = 2*torch.max(cumCFR1.max(),cumCFR2.max())/N_ITER
  torch.cuda.synchronize()
  return (avgStrategy1,avgStrategy2,eps)

getEquilibrium(PAYOFF) # eps < 0.03

Regrets are expensive to compute in large games, but can be approximated using modern machine-learning techniques. This approach has recently found many applications, including solvers for Poker and even larger card games ​4,5​ .

  1. 1.
    Hart S, Mas-Colell A. A Simple Adaptive Procedure Leading to Correlated Equilibrium. Econometrica. Published online September 2000:1127-1150. doi:10.1111/1468-0262.00153
  2. 2.
    Neller TW, Lanctot M. An Introduction to Counterfactual Regret Minimization. Teaching Materials. Published 2013. Accessed 2023. https://www.ma.imperial.ac.uk/~dturaev/neller-lanctot.pdf
  3. 3.
    Waugh K. Abstraction in Large Extensive Games. University of Alberta Libraries. Published online 2009. doi:10.7939/R3CM74
  4. 4.
    Brown N, Lerer A, Gross S, Sandholm T. Deep Counterfactual Regret Minimization. In: ; 2019.
  5. 5.
    Adams D. The Feasibility of Deep Counterfactual Regret Minimisation for Trading Card Games. AI 2022: Advances in Artificial Intelligence. Published online 2022:145-160. doi:10.1007/978-3-031-22695-3_11

Published by mskorski

Scientist, Consultant, Learning Enthusiast

Leave a comment

Your email address will not be published. Required fields are marked *