Background: Multi Armed Bandits (MAB) are a method of choosing the best action from a bunch of options. In order to choose the best action there are several problems to solve. These are:

How do you know what action is "best"?
What if the "best" action changes over time? How do you know it's changed?
How do I balance choosing the best action with collecting new information?

Purpose: The purpose of this post is to explain what a Multi Armed Bandit (MAB) is, and how it solves those problems.

MAB Setup Requirements

In it's essence, MAB is just A/B testing in an intelligent and automated way. Because it is like A/B testing it has very similar requirements for setup, but because it is automated it often requires defining these items with a bit more precision. Let's start with what we need to have before we can use a MAB.

Things to Define The things needed to set up new actions are the same as needed to set up an A/B test:

What the action is (ie send an email)
When selected, what should be done (ie use x template to send automated email to y customer)
What is the success criteria
What is the metric

Most of the above is business information that business stakeholders decide in conjunction with technical teams. The tricky piece is all metrics must be comparable to each other. This is easy if it's all emails and you just want to measure click through rate, but sometimes comparisons are harder. A common approach is to define each of these in terms of dollars (if possible).

Example:Which is better: A 10% click through email rate, or 2% phone answering rate? Without defining it in terms of dollars (profit or revenue), how do you know which is better for the business?

Terminology

Before we go further there are 2 terms we need to understand:

Exploitation: This simply means we are going to pick the action we think is best based on the metric
Exploration: This means we are not going to pick the best action, but rather we are going to survey other action. One particular action may change in effectiveness over time and we want to spend some amount of time exploring so we can stay on top of these changes.

The Minimal Implementation

Let's describe what the simplest implementation would look like. The point of this is to understand the core of what a MAB is. Once we understand that we can look at all the 'tricks' that are used to make it fit a particular business and work better.

First we will do 2 things:

What percent of the time will be exploiting vs exploring (see terminology section above)
Calculate metrics for all actions

Then loop through this repeatedly:

Pick action (Using probability to define whether we are exploiting or exploring)
Update data with results when available
Recalculate metrics using new data
Repeat

This is the simplest MAB we can create. Let's look at what we need to do to make it 'smarter' and fit a business.

Enhancements & Customization

Now that we understand what a MAB is, let's look at how to make it smarter. Which of these we do is problem dependent, but these are the key ideas to consider.

1. Use segments

Rather than calculate metrics over the entire universe of customers.

Different segments of customers may behave differently. The "best" action for 1 group of customers may be different than the "best action" for a different group of customers. Instead of treating all customers equal, break them into segments so we can personalize actions more.

Example:One segment could be "people over 65 that live in a cold climate area that has opened at least 1 email in the past"

2. Use smarter exploration methods

Rather than picking exploration actions based on random chance use statistics to determine which need more sampling

The most common 2 methods of defining exploration:

Thompson Sampling: Choose best action based on random belief based on the distributions rather than just the metric. 1 action may be better overall but based on a random sample in the distribution a 'less good' action may be better than a 'best action' based on distribution overlap.
Upper Confidence Bound: Choose based on how confident you are in your measurements. The is no need to explore a category you have high confidence in (ie very stable with lots of data).

3. Incorporate business logic

There are 2 main places this can be added

Action Selection

Sometimes not all actions are appropriate or we want to modify the probability an action is selected based on specific customer behavior. Let's look at a couple examples of how we can weave in business logic.

Example:If no phone number on file all actions related to email cannot be selected This is the simplest example of a business rule. It could be that overall sending an email is the best action for customers, but we wouldn't want to choose an email action if we have no way to reach out to them.

Example:If have customer has not answered the last 10 phone calls, reduce call action probabilities by 10% This is a simple example. Making a phone call may be the best way overall to make a sale, but we have information on this specific customer that they do not answer the phone. We may still want to try on occasion, but less often for this specific customer.

Example:If a customer has bought X products before, remove Y offer from consideration This is where it's important to incorporate human expertise and knowledge of a business in. The best technical solutions don't ignore human expertise, but rather enhance human expertise and help scale that expertise to more customers.

Action Execution

You can have some generic actions such as "Send Offer Email" and insert business logic there. For example "Send Offer Email" may send template 1 with cruise destinations for landlocked states or template 2 with resort destinations for people in ocean border states.

4. Deep Learning

This is rarely the best starting point and requires a great deal of expertise. Rather than picking actions based on metric you could pick an action based on predicted outcomes using an output of NN (meaning NN needs to be updated in real time).

Using deep learning can often improve accuracy quite a bit, but it comes with a lot of negatives as well.

Pros:

Increased accuracy

Cons:

Harder to change, add, and remove actions
Less "Interpretable", meaning it is harder to understand rationale behind what/why certain actions are considered "best"
Much longer to set up and much harder to maintain over time
Higher risk of hidden biases

Think carefully before going this route! Using deep learning for this is an incredibly powerful approach in the right situations, but those situations are not as common as many people believe.

Code it up!

Here we will code the simplest example so we can see a skeleton of how it works and get you started on the path of implementing one for your use-case.

import pandas as pd
from collections import defaultdict 
import numpy as np
import math, statistics, random

First let's load and take a look at our data. The dataset we will be using is website latency. We want to pick the website at each given moment with the lowest latency.

Note: This dataset is easy because it’s very easy to have comparable metrics. We don’t have to worry about comparing phone calls to emails, we have the same universal metric for each action.

df = pd.read_csv('univ-latencies/univ-latencies.txt')
df.head()

So first, we need to define out metrics and how we will update the metrics. As new data comes in we need to always keep track of the data so we can determine which is the best action and which we need to explore periodically.

class Metrics:
    def __init__(self):
        self.values = defaultdict(lambda: [])
        self.metrics = dict()
    
    def update_metrics(self):
        '''calculate the metrics for each action'''
        self.metrics = dict() # Reset metrics
        for key in self.values.keys(): # For each action
            self.metrics[key] = statistics.mean(self.values[key]) # average the latencies for our metric
      
    def add_data(self,series):
        for key in series.keys(): # for each action
            self.values[key].append(series[key]) # Append the new data 
            if len(self.values[key]) > 7: self.values[key].pop(0) # only keep last 7 values - older data is no longer relevant
        self.update_metrics() # Update metrics using recent data

Next, we need the actual Multi-Armed Bandit that can select actions based on the metrics the previous class calculates. Let's build this now.

class MAB:
    def __init__(self,exploration_prob):
        self.exploration_prob = exploration_prob # User defined
        self.actions = pd.DataFrame(columns=['action','latency','min_latency','regret'])
    
    def perform_action(self,m):
        
        # Identify the "best" action based on the smallest average latency
        ideal_action = min(m.metrics, key=m.metrics.get) 
        
        # Based on exploration probabiliity choose either random exploration or choose the "best action"
        if np.random.rand() < self.exploration_prob: action = random.choice(list(m.metrics.keys()))
        else: action = ideal_action
            
        # Store results
        idx = mab.actions.index.max()+1
        if math.isnan(idx): idx = 0
        self.actions.loc[idx] = [action,m.metrics[action],m.metrics[ideal_action],m.metrics[action]-m.metrics[ideal_action]]

We can now use those to run through the entire dataset and choose actions. Regret measures the difference between the latency we chose and the latency of the best choice, which helps us understand the cost of our exploration.

m = Metrics() # Initialize Metrics
mab = MAB(0.15) # Initionalize our MAB with a 15% exploration probabilty
for i in range(len(df)): # for each point in time
    m.add_data(series=df.iloc[i]) # Add Data
    mab.perform_action(m) # Select an action and calculate results

When we look at the results we see that the "best" action is not always the same but we are pretty good about selecting the best action anyway. We do this by sampling options to monitor how all the actions change over time through constant exploration (15% of the time). The regret column measures the difference between the latency of the option we selected and the ideal option which we can use to measure and tune the cost of our exploration.

mab.actions.sample(10)

Conclusion

I hope after reading this post you have an good technical foundation in what a multi-armed bandit is and how to code a very simple one. I hope in addition to that you understand what are some of the challenges to a succesful implementation and full solution. I did not code up the variations that make this approach more powerful, but with this as a foundation and a little research you now should have all the tools you need to get started.

	acu-edu	acadiau-ca	adrian-edu	agnesscott-edu	aims-edu	uni-freiburg-de	alfred-edu	alvernia-edu	alverno-edu	american-edu	...	williams-edu	wsc-nodak-edu	winona-msus-edu	wpi-edu	wright-edu	yale-edu	yu-edu	yorku-ca	upenn-edu	ens-fr
0	396	381	488	506	333	1327	132	70	456	121	...	220	1898	434	125	304	94	460	347	532	429
1	271	261	488	504	276	1084	89	23	409	34	...	263	1032	294	74	269	252	98	265	233	293
2	271	141	325	545	266	1078	86	27	837	33	...	409	891	292	77	288	41	78	261	114	300
3	268	136	324	1946	331	1342	88	24	531	35	...	361	421	298	76	228	42	153	266	322	532
4	273	136	321	549	290	1192	143	26	434	32	...	982	522	296	73	764	41	98	262	234	271

	action	latency	min_latency	regret
502	uoregon-edu	417	17	400
698	pace-edu	17	17	0
808	fitnyc-edu	13	13	0
954	fitnyc-edu	10	10	0
81	stjohns-edu	17	17	0
947	fitnyc-edu	10	10	0
29	wcupa-edu	15	15	0
604	cmu-edu	70	18	52
573	pace-edu	18	18	0
920	fitnyc-edu	11	11	0