Background: Multi Armed Bandits (MAB) are a method of choosing the best action from a bunch of options. In order to choose the best action there are several problems to solve. These are:

  1. How do you know what action is "best"?
  2. What if the "best" action changes over time? How do you know it's changed?
  3. How do I balance choosing the best action with collecting new information?

Purpose: The purpose of this post is to explain what a Multi Armed Bandit (MAB) is, and how it solves those problems.

MAB Setup Requirements

In it's essence, MAB is just A/B testing in an intelligent and automated way. Because it is like A/B testing it has very similar requirements for setup, but because it is automated it often requires defining these items with a bit more precision. Let's start with what we need to have before we can use a MAB.

Things to Define The things needed to set up new actions are the same as needed to set up an A/B test:

  1. What the action is (ie send an email)
  2. When selected, what should be done (ie use x template to send automated email to y customer)
  3. What is the success criteria
  4. What is the metric

Most of the above is business information that business stakeholders decide in conjunction with technical teams. The tricky piece is all metrics must be comparable to each other. This is easy if it's all emails and you just want to measure click through rate, but sometimes comparisons are harder. A common approach is to define each of these in terms of dollars (if possible).

Example:Which is better: A 10% click through email rate, or 2% phone answering rate? Without defining it in terms of dollars (profit or revenue), how do you know which is better for the business?

Terminology

Before we go further there are 2 terms we need to understand:

  • Exploitation: This simply means we are going to pick the action we think is best based on the metric
  • Exploration: This means we are not going to pick the best action, but rather we are going to survey other action. One particular action may change in effectiveness over time and we want to spend some amount of time exploring so we can stay on top of these changes.

The Minimal Implementation

Let's describe what the simplest implementation would look like. The point of this is to understand the core of what a MAB is. Once we understand that we can look at all the 'tricks' that are used to make it fit a particular business and work better.

First we will do 2 things:

  1. What percent of the time will be exploiting vs exploring (see terminology section above)
  2. Calculate metrics for all actions

Then loop through this repeatedly:

  1. Pick action (Using probability to define whether we are exploiting or exploring)
  2. Update data with results when available
  3. Recalculate metrics using new data
  4. Repeat

This is the simplest MAB we can create. Let's look at what we need to do to make it 'smarter' and fit a business.

Enhancements & Customization

Now that we understand what a MAB is, let's look at how to make it smarter. Which of these we do is problem dependent, but these are the key ideas to consider.

1. Use segments

Rather than calculate metrics over the entire universe of customers.

Different segments of customers may behave differently. The "best" action for 1 group of customers may be different than the "best action" for a different group of customers. Instead of treating all customers equal, break them into segments so we can personalize actions more.

Example:One segment could be "people over 65 that live in a cold climate area that has opened at least 1 email in the past"

2. Use smarter exploration methods

Rather than picking exploration actions based on random chance use statistics to determine which need more sampling

The most common 2 methods of defining exploration:

  • Thompson Sampling: Choose best action based on random belief based on the distributions rather than just the metric. 1 action may be better overall but based on a random sample in the distribution a 'less good' action may be better than a 'best action' based on distribution overlap.
  • Upper Confidence Bound: Choose based on how confident you are in your measurements. The is no need to explore a category you have high confidence in (ie very stable with lots of data).

3. Incorporate business logic

There are 2 main places this can be added

Action Selection

Sometimes not all actions are appropriate or we want to modify the probability an action is selected based on specific customer behavior. Let's look at a couple examples of how we can weave in business logic.

Example:If no phone number on file all actions related to email cannot be selected This is the simplest example of a business rule. It could be that overall sending an email is the best action for customers, but we wouldn't want to choose an email action if we have no way to reach out to them.

Example:If have customer has not answered the last 10 phone calls, reduce call action probabilities by 10% This is a simple example. Making a phone call may be the best way overall to make a sale, but we have information on this specific customer that they do not answer the phone. We may still want to try on occasion, but less often for this specific customer.

Example:If a customer has bought X products before, remove Y offer from consideration This is where it's important to incorporate human expertise and knowledge of a business in. The best technical solutions don't ignore human expertise, but rather enhance human expertise and help scale that expertise to more customers.

Action Execution

You can have some generic actions such as "Send Offer Email" and insert business logic there. For example "Send Offer Email" may send template 1 with cruise destinations for landlocked states or template 2 with resort destinations for people in ocean border states.

4. Deep Learning

This is rarely the best starting point and requires a great deal of expertise. Rather than picking actions based on metric you could pick an action based on predicted outcomes using an output of NN (meaning NN needs to be updated in real time).

Using deep learning can often improve accuracy quite a bit, but it comes with a lot of negatives as well.

Pros:

  • Increased accuracy

Cons:

  • Harder to change, add, and remove actions
  • Less "Interpretable", meaning it is harder to understand rationale behind what/why certain actions are considered "best"
  • Much longer to set up and much harder to maintain over time
  • Higher risk of hidden biases

Think carefully before going this route! Using deep learning for this is an incredibly powerful approach in the right situations, but those situations are not as common as many people believe.

Code it up!

Here we will code the simplest example so we can see a skeleton of how it works and get you started on the path of implementing one for your use-case.

import pandas as pd
from collections import defaultdict 
import numpy as np
import math, statistics, random

First let's load and take a look at our data. The dataset we will be using is website latency. We want to pick the website at each given moment with the lowest latency.

Note: This dataset is easy because it’s very easy to have comparable metrics. We don’t have to worry about comparing phone calls to emails, we have the same universal metric for each action.

df = pd.read_csv('univ-latencies/univ-latencies.txt')
df.head()
acu-edu acadiau-ca adrian-edu agnesscott-edu aims-edu uni-freiburg-de alfred-edu alvernia-edu alverno-edu american-edu ... williams-edu wsc-nodak-edu winona-msus-edu wpi-edu wright-edu yale-edu yu-edu yorku-ca upenn-edu ens-fr
0 396 381 488 506 333 1327 132 70 456 121 ... 220 1898 434 125 304 94 460 347 532 429
1 271 261 488 504 276 1084 89 23 409 34 ... 263 1032 294 74 269 252 98 265 233 293
2 271 141 325 545 266 1078 86 27 837 33 ... 409 891 292 77 288 41 78 261 114 300
3 268 136 324 1946 331 1342 88 24 531 35 ... 361 421 298 76 228 42 153 266 322 532
4 273 136 321 549 290 1192 143 26 434 32 ... 982 522 296 73 764 41 98 262 234 271

5 rows × 760 columns

So first, we need to define out metrics and how we will update the metrics. As new data comes in we need to always keep track of the data so we can determine which is the best action and which we need to explore periodically.

class Metrics:
    def __init__(self):
        self.values = defaultdict(lambda: [])
        self.metrics = dict()
    
    def update_metrics(self):
        '''calculate the metrics for each action'''
        self.metrics = dict() # Reset metrics
        for key in self.values.keys(): # For each action
            self.metrics[key] = statistics.mean(self.values[key]) # average the latencies for our metric
      
    def add_data(self,series):
        for key in series.keys(): # for each action
            self.values[key].append(series[key]) # Append the new data 
            if len(self.values[key]) > 7: self.values[key].pop(0) # only keep last 7 values - older data is no longer relevant
        self.update_metrics() # Update metrics using recent data

Next, we need the actual Multi-Armed Bandit that can select actions based on the metrics the previous class calculates. Let's build this now.

class MAB:
    def __init__(self,exploration_prob):
        self.exploration_prob = exploration_prob # User defined
        self.actions = pd.DataFrame(columns=['action','latency','min_latency','regret'])
    
    def perform_action(self,m):
        
        # Identify the "best" action based on the smallest average latency
        ideal_action = min(m.metrics, key=m.metrics.get) 
        
        # Based on exploration probabiliity choose either random exploration or choose the "best action"
        if np.random.rand() < self.exploration_prob: action = random.choice(list(m.metrics.keys()))
        else: action = ideal_action
            
        # Store results
        idx = mab.actions.index.max()+1
        if math.isnan(idx): idx = 0
        self.actions.loc[idx] = [action,m.metrics[action],m.metrics[ideal_action],m.metrics[action]-m.metrics[ideal_action]]
    

We can now use those to run through the entire dataset and choose actions. Regret measures the difference between the latency we chose and the latency of the best choice, which helps us understand the cost of our exploration.

m = Metrics() # Initialize Metrics
mab = MAB(0.15) # Initionalize our MAB with a 15% exploration probabilty
for i in range(len(df)): # for each point in time
    m.add_data(series=df.iloc[i]) # Add Data
    mab.perform_action(m) # Select an action and calculate results

When we look at the results we see that the "best" action is not always the same but we are pretty good about selecting the best action anyway. We do this by sampling options to monitor how all the actions change over time through constant exploration (15% of the time). The regret column measures the difference between the latency of the option we selected and the ideal option which we can use to measure and tune the cost of our exploration.

mab.actions.sample(10)
action latency min_latency regret
502 uoregon-edu 417 17 400
698 pace-edu 17 17 0
808 fitnyc-edu 13 13 0
954 fitnyc-edu 10 10 0
81 stjohns-edu 17 17 0
947 fitnyc-edu 10 10 0
29 wcupa-edu 15 15 0
604 cmu-edu 70 18 52
573 pace-edu 18 18 0
920 fitnyc-edu 11 11 0

Conclusion

I hope after reading this post you have an good technical foundation in what a multi-armed bandit is and how to code a very simple one. I hope in addition to that you understand what are some of the challenges to a succesful implementation and full solution. I did not code up the variations that make this approach more powerful, but with this as a foundation and a little research you now should have all the tools you need to get started.