Multi Armed Bandits
How can a Multi Armed Bandit (probabilistic model) help me choose the "right" action
Background: Multi Armed Bandits (MAB) are a method of choosing the best action from a bunch of options. In order to choose the best action there are several problems to solve. These are:
- How do you know what action is "best"?
- What if the "best" action changes over time? How do you know it's changed?
- How do I balance choosing the best action with collecting new information?
Purpose: The purpose of this post is to explain what a Multi Armed Bandit (MAB) is, and how it solves those problems.
MAB Setup Requirements
In it's essence, MAB is just A/B testing in an intelligent and automated way. Because it is like A/B testing it has very similar requirements for setup, but because it is automated it often requires defining these items with a bit more precision. Let's start with what we need to have before we can use a MAB.
Things to Define The things needed to set up new actions are the same as needed to set up an A/B test:
- What the action is (ie send an email)
- When selected, what should be done (ie use x template to send automated email to y customer)
- What is the success criteria
- What is the metric
Most of the above is business information that business stakeholders decide in conjunction with technical teams. The tricky piece is all metrics must be comparable to each other. This is easy if it's all emails and you just want to measure click through rate, but sometimes comparisons are harder. A common approach is to define each of these in terms of dollars (if possible).
Example:Which is better: A 10% click through email rate, or 2% phone answering rate? Without defining it in terms of dollars (profit or revenue), how do you know which is better for the business?
Terminology
Before we go further there are 2 terms we need to understand:
- Exploitation: This simply means we are going to pick the action we think is best based on the metric
- Exploration: This means we are not going to pick the best action, but rather we are going to survey other action. One particular action may change in effectiveness over time and we want to spend some amount of time exploring so we can stay on top of these changes.
The Minimal Implementation
Let's describe what the simplest implementation would look like. The point of this is to understand the core of what a MAB is. Once we understand that we can look at all the 'tricks' that are used to make it fit a particular business and work better.
First we will do 2 things:
- What percent of the time will be exploiting vs exploring (see terminology section above)
- Calculate metrics for all actions
Then loop through this repeatedly:
- Pick action (Using probability to define whether we are exploiting or exploring)
- Update data with results when available
- Recalculate metrics using new data
- Repeat
This is the simplest MAB we can create. Let's look at what we need to do to make it 'smarter' and fit a business.
1. Use segments
Rather than calculate metrics over the entire universe of customers.
Different segments of customers may behave differently. The "best" action for 1 group of customers may be different than the "best action" for a different group of customers. Instead of treating all customers equal, break them into segments so we can personalize actions more.
Example:One segment could be "people over 65 that live in a cold climate area that has opened at least 1 email in the past"
2. Use smarter exploration methods
Rather than picking exploration actions based on random chance use statistics to determine which need more sampling
The most common 2 methods of defining exploration:
- Thompson Sampling: Choose best action based on random belief based on the distributions rather than just the metric. 1 action may be better overall but based on a random sample in the distribution a 'less good' action may be better than a 'best action' based on distribution overlap.
- Upper Confidence Bound: Choose based on how confident you are in your measurements. The is no need to explore a category you have high confidence in (ie very stable with lots of data).
Example:If no phone number on file all actions related to email cannot be selected This is the simplest example of a business rule. It could be that overall sending an email is the best action for customers, but we wouldn't want to choose an email action if we have no way to reach out to them.
Example:If have customer has not answered the last 10 phone calls, reduce call action probabilities by 10% This is a simple example. Making a phone call may be the best way overall to make a sale, but we have information on this specific customer that they do not answer the phone. We may still want to try on occasion, but less often for this specific customer.
Example:If a customer has bought X products before, remove Y offer from consideration This is where it's important to incorporate human expertise and knowledge of a business in. The best technical solutions don't ignore human expertise, but rather enhance human expertise and help scale that expertise to more customers.
4. Deep Learning
This is rarely the best starting point and requires a great deal of expertise. Rather than picking actions based on metric you could pick an action based on predicted outcomes using an output of NN (meaning NN needs to be updated in real time).
Using deep learning can often improve accuracy quite a bit, but it comes with a lot of negatives as well.
Pros:
- Increased accuracy
Cons:
- Harder to change, add, and remove actions
- Less "Interpretable", meaning it is harder to understand rationale behind what/why certain actions are considered "best"
- Much longer to set up and much harder to maintain over time
- Higher risk of hidden biases
Think carefully before going this route! Using deep learning for this is an incredibly powerful approach in the right situations, but those situations are not as common as many people believe.
import pandas as pd
from collections import defaultdict
import numpy as np
import math, statistics, random
First let's load and take a look at our data. The dataset we will be using is website latency. We want to pick the website at each given moment with the lowest latency.
df = pd.read_csv('univ-latencies/univ-latencies.txt')
df.head()
So first, we need to define out metrics and how we will update the metrics. As new data comes in we need to always keep track of the data so we can determine which is the best action and which we need to explore periodically.
class Metrics:
def __init__(self):
self.values = defaultdict(lambda: [])
self.metrics = dict()
def update_metrics(self):
'''calculate the metrics for each action'''
self.metrics = dict() # Reset metrics
for key in self.values.keys(): # For each action
self.metrics[key] = statistics.mean(self.values[key]) # average the latencies for our metric
def add_data(self,series):
for key in series.keys(): # for each action
self.values[key].append(series[key]) # Append the new data
if len(self.values[key]) > 7: self.values[key].pop(0) # only keep last 7 values - older data is no longer relevant
self.update_metrics() # Update metrics using recent data
Next, we need the actual Multi-Armed Bandit that can select actions based on the metrics the previous class calculates. Let's build this now.
class MAB:
def __init__(self,exploration_prob):
self.exploration_prob = exploration_prob # User defined
self.actions = pd.DataFrame(columns=['action','latency','min_latency','regret'])
def perform_action(self,m):
# Identify the "best" action based on the smallest average latency
ideal_action = min(m.metrics, key=m.metrics.get)
# Based on exploration probabiliity choose either random exploration or choose the "best action"
if np.random.rand() < self.exploration_prob: action = random.choice(list(m.metrics.keys()))
else: action = ideal_action
# Store results
idx = mab.actions.index.max()+1
if math.isnan(idx): idx = 0
self.actions.loc[idx] = [action,m.metrics[action],m.metrics[ideal_action],m.metrics[action]-m.metrics[ideal_action]]
We can now use those to run through the entire dataset and choose actions. Regret measures the difference between the latency we chose and the latency of the best choice, which helps us understand the cost of our exploration.
m = Metrics() # Initialize Metrics
mab = MAB(0.15) # Initionalize our MAB with a 15% exploration probabilty
for i in range(len(df)): # for each point in time
m.add_data(series=df.iloc[i]) # Add Data
mab.perform_action(m) # Select an action and calculate results
When we look at the results we see that the "best" action is not always the same but we are pretty good about selecting the best action anyway. We do this by sampling options to monitor how all the actions change over time through constant exploration (15% of the time). The regret column measures the difference between the latency of the option we selected and the ideal option which we can use to measure and tune the cost of our exploration.
mab.actions.sample(10)
Conclusion
I hope after reading this post you have an good technical foundation in what a multi-armed bandit is and how to code a very simple one. I hope in addition to that you understand what are some of the challenges to a succesful implementation and full solution. I did not code up the variations that make this approach more powerful, but with this as a foundation and a little research you now should have all the tools you need to get started.